Writing & Running Pipelines on the Open Grid Engine using QMake. Wibowo Arindrarto DTLS Focus Meeting 15.04.2014

Similar documents
Introduction to NGS data analysis

About the Princess Margaret Computational Biology Resource Centre (PMCBRC) cluster

Annoyances with our current source control Can it get more comfortable? Git Appendix. Git vs Subversion. Andrey Kotlarski 13.XII.

MATLAB & Git Versioning: The Very Basics

Practical Solutions for Big Data Analytics

Using GitHub for Rally Apps (Mac Version)

Streamline your drupal development workflow in a 3-tier-environment - A story about drush make and drush aliases

Globus Genomics Tutorial GlobusWorld 2014

PROGRAMMING FOR BIOLOGISTS. BIOL 6297 Monday, Wednesday 10 am -12 pm

Work. MATLAB Source Control Using Git

E. coli plasmid and gene profiling using Next Generation Sequencing

MOOSE-Based Application Development on GitLab

Writing standalone Qt & Python applications for Android

Lab Exercise Part II: Git: A distributed version control system

CPSC 491. Today: Source code control. Source Code (Version) Control. Exercise: g., no git, subversion, cvs, etc.)

An Introduction to Mercurial Version Control Software

Data Analysis & Management of High-throughput Sequencing Data. Quoclinh Nguyen Research Informatics Genomics Core / Medical Research Institute

Hadoop-BAM and SeqPig

Dry Dock Documentation

Version Control with Git. Kate Hedstrom ARSC, UAF

Version Control Your Jenkins Jobs with Jenkins Job Builder

Git - Working with Remote Repositories

Continuous Integration and Delivery at NSIDC

UMass High Performance Computing Center

SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop

Version Control with. Ben Morgan

Removing Sequential Bottlenecks in Analysis of Next-Generation Sequencing Data

EMC DOCUMENTUM xplore 1.1 DISASTER RECOVERY USING EMC NETWORKER

Large-scale Research Data Management and Analysis Using Globus Services. Ravi Madduri Argonne National Lab University of

Extending Remote Desktop for Large Installations. Distributed Package Installs

Improving your Drupal Development workflow with Continuous Integration

Version Control using Git and Github. Joseph Rivera

FEEG Applied Programming 3 - Version Control and Git II

Magento Search Extension TECHNICAL DOCUMENTATION

OpenMake Dynamic DevOps Suite 7.5 Road Map. Feature review for Mojo, Meister, CloudBuilder and Deploy+

Git Fusion Guide August 2015 Update

Oracle Exam 1z0-102 Oracle Weblogic Server 11g: System Administration I Version: 9.0 [ Total Questions: 111 ]

Integrated Rule-based Data Management System for Genome Sequencing Data

UGENE Quick Start Guide

C Programming Review & Productivity Tools

Version Control with Git. Linux Users Group UT Arlington. Rohit Rawat

HDFS Cluster Installation Automation for TupleWare

Exam Name: IBM InfoSphere MDM Server v9.0

CSE-E5430 Scalable Cloud Computing. Lecture 4

Introduction. Created by Richard Bell 10/29/2014

Version Control with Git. Dylan Nugent

Introduction to the Git Version Control System

An Introduction to Mercurial Version Control Software

Version Control! Scenarios, Working with Git!

Introduction to Git. Markus Kötter Notes. Leinelab Workshop July 28, 2015

Data management on HPC platforms

Version Control with Svn, Git and git-svn. Kate Hedstrom ARSC, UAF

monoseq Documentation

Integrated version control with Fossil SCM

Gitflow process. Adapt Learning: Gitflow process. Document control

Continuous Integration and Delivery. manage development build deploy / release

StriderCD Book. Release 1.4. Niall O Higgins

IBM InfoSphere MDM Server v9.0. Version: Demo. Page <<1/11>>

Streamline Computing Linux Cluster User Training. ( Nottingham University)

DevShop. Drupal Infrastructure in a Box. Jon Pugh CEO, Founder ThinkDrop Consulting Brooklyn NY

A Complete Example of Next- Gen DNA Sequencing Read Alignment. Presentation Title Goes Here

PKI, Git and SVN. Adam Young. Presented by. Senior Software Engineer, Red Hat. License Licensed under

Processing NGS Data with Hadoop-BAM and SeqPig

Developer Workshop Marc Dumontier McMaster/OSCAR-EMR

Introduction to Version Control

Delivering the power of the world s most successful genomics platform

DEPLOYING EMC DOCUMENTUM BUSINESS ACTIVITY MONITOR SERVER ON IBM WEBSPHERE APPLICATION SERVER CLUSTER

Version control with GIT

Hadoopizer : a cloud environment for bioinformatics data analysis

Developing tests for the KVM autotest framework

Web Developer Toolkit for IBM Digital Experience

LifeScope Genomic Analysis Software 2.5

Using Git for Centralized and Distributed Version Control Workflows - Day 3. 1 April, 2016 Presenter: Brian Vanderwende

Git Basics. Christian Hanser. Institute for Applied Information Processing and Communications Graz University of Technology. 6.

Analysis of NGS Data

NGS Data Analysis: An Intro to RNA-Seq

Next Generation Sequencing; Technologies, applications and data analysis

Is This Your Pipe? Hijacking the Build Pipeline

Using Git for Project Management with µvision

Putting It All Together. Vagrant Drush Version Control

Administration GUIDE. SharePoint Server idataagent. Published On: 11/19/2013 V10 Service Pack 4A Page 1 of 201

Version control. with git and GitHub. Karl Broman. Biostatistics & Medical Informatics, UW Madison

Continuous Integration. CSC 440: Software Engineering Slide #1

SMRT Analysis Software Installation (v2.3.0)

Automating Big Data Benchmarking for Different Architectures with ALOJA

Mobile Development with Git, Gerrit & Jenkins

Building a Python Plugin

Introducing Xcode Source Control

Text file One header line meta information lines One line : variant/position

Using the Yale HPC Clusters

Galaxy4Bioinformatics Développement et intégration d application sous Galaxy TOOL INTEGRATION

Handling next generation sequence data

The Global Rules set is evaluated first and contains the global access rules that apply to all NG firewalls using the shared service.

Module 11 Setting up Customization Environment

Surround SCM Best Practices

DevOps Course Content

Work Environment. David Tur HPC Expert. HPC Users Training September, 18th 2015

Transcription:

Writing & Running Pipelines on the Open Grid Engine using QMake Wibowo Arindrarto DTLS Focus Meeting 15.04.2014

Makefile (re)introduction Atomic recipes / rules that define full pipelines Initially written for compiling source code files into executables Repurposed for data processing pipelines

Makefile Recipes Atomic recipes / rules that define full pipelines ## This is my Makefile all: sample.bam %.sam: %.fastq hg19.ref bowtie --sam $(word $^,2) $< > $@ %.bam: %.sam hg19.fa samtools view -bt $word $^,2) -o $@ $< Workflow are determined implicitly, depending on the main target Some useful aliases for common patterns, e.g.: %.sam: %.fastq '%' denote common part $^ list of all dependencies $< first dependency

Why Makefiles? Many alternatives already exist Ruffus, Snakemake, bpipe, etc. None parallelizes as easy as (q)make on our cluster Coming up with a defined approach for Makefile pipelines & writing helper scripts is the way to go for now This is not a solved problem!

Makefiles are neat Parallelization ~ for cores (make) and nodes (qmake) Resume runs from failure points Easy to define dependencies among steps Close to the shell environment Already used in some of our earlier internal pipelines, e.g. GAPSS3 Big upgrade from shell / python / perl scripts!

GAPSS3 Makefile-based pipeline for exome and genome alignment Designed to be run on multiple core machines (or a cluster) Ran as regular Makefile $ make -f Exome.mk $ qmake -cwd -inherit -- -j 5 -f Exome.mk Worked as intended, but highlights areas where we can improve...

Problems Encountered Bioinformatics pipelines (vs software build systems): More moving parts (aligners, variant callers) should be easier to swap parts in and out More experimental in nature should be easy to play with program option flags More investigative in nature should be easy to generate reports for diagnosis

Rig: Framework on Top of Make Core idea Modules: a single unit that perform useful function Each module are standalone but can also be combined to create another module Implementation Each module: recipe file + config file Two types of modules: tool wrappers and pipelines

Rig: Framework on Top of Make Logged by default Variables defined inside the config stdout, stderr, and job details (qmake only) Dynamic options Command-line flag change on the fly $ qmake -- -j 5 -f pipeline.mk OPT_BOWTIE_m=1

Module Structure sample.mk: recipes %.sam: %.1.fastq %.2.fastq $(BOWTIE) $(IDX) $^ > $@ %.bam: %.sam $(SAMTOOLS) view -bt $(REF) -o $@ $< sample.mkc: config INPUTS := mine.1.fastq mine.2.fastq IDX := /usr/local/indices/hg19 REF := /usr/local/genomes/hg19.fa BOWTIE := /usr/bin/bowtie SAMTOOLS := /usr/bin/samtools Cleaner separation of module logic & module components Easier setup of required variables (e.g. reporting variables)

Module Types Pipeline Multiple recipes (similar to GAPSS) Tool wrapper Single recipe, 'wraps' a single command line tool $ bowtie --m 1 /usr/local/indices/hg19 sample.fq > sample.sam $ qmake -- -j 5 -f pipeline.mk OPT_BOWTIE_m=1 %.sam: %.fq bowtie /usr/local/indices/hg19 $^ --m 1 $< > $@ # instead %.sam: %.fasta $(MAKE) $(MODULE_ALIGNER) $(OPT_ALIGNER)

Other Additions Python scripts to handle boilerplate code $ rig_gen.py tool bowtie2 # creates a template tool wrapper named bowtie2 Python module for exploring job logs >>> from rig import RigRun >>> run = RigRun('my_pipeline', '/path/to/log/directory') >>> for module in run:... for job in module:... print job.id, job.start my_module.12412 datetime.datetime(2013, 6, 26, 0, 54, 16, 867242) Nameset files for defined input patterns

Tool wrappers: 41 In Progress bowtie, cufflinks, sickle, etc. Pipelines: 14 Customizable QC pipeline (FastQC, sickle, cutadapt) Gentrap v2.0 (using the QC pipeline, 2 new aligners) GATK best practices pipeline (one module per phase) Deepsage pipeline using genome & 'transcriptome' alignment and more.. Identical logging for make and qmake Tests..?

Compromises File sync problem: not cleanly handled by qmake We had to hack into it and use a custom shell wrapper to ensure dependencies are available before each job.

Acknowledgements Jeroen Laros Leon Mei Martijn Vermaat Martin van den Kerkhoff Michiel van Galen Peter van 't Hof Zuotian Tatum Sander van der Zeeuw Wai Yi Leung

Initial Development Model Single git repository: core library, pipelines, tool wrappers Dependency problem: one tool, multiple pipelines? Challenge: How to version repo within a repo? Choice between git submodules and git subtree

subtree vs submodule git subtree: Add sub repository as a folder under the main repository (as a remote) Can push to sub-repository selectively Can pull entire sub-repository history git submodule: Add sub repository under the main repository, but not under git (?) Requires additional.gitsubmodule file (which is versioned) Cloning is messy..

Workflow Create tool wrappers, push to remote repo Create pipeline, and then add modules: git remote add mod_tool... git subtree add -P {path} --squash mod_tool/master Work on pipeline, work on tool Pull from tool repo: git subtree merge -P {path} --mod_tool/master Push to tool repo: git subtree push -P {path mod_tool/master

git subtree considerations Advantages: Everything is a regular file Makes releases easy Clone as usual Downsides: History becomes messy

Demo time! bioassisst: simple pipeline that processes maps FASTQ files into BAM files