Writing & Running Pipelines on the Open Grid Engine using QMake Wibowo Arindrarto DTLS Focus Meeting 15.04.2014
Makefile (re)introduction Atomic recipes / rules that define full pipelines Initially written for compiling source code files into executables Repurposed for data processing pipelines
Makefile Recipes Atomic recipes / rules that define full pipelines ## This is my Makefile all: sample.bam %.sam: %.fastq hg19.ref bowtie --sam $(word $^,2) $< > $@ %.bam: %.sam hg19.fa samtools view -bt $word $^,2) -o $@ $< Workflow are determined implicitly, depending on the main target Some useful aliases for common patterns, e.g.: %.sam: %.fastq '%' denote common part $^ list of all dependencies $< first dependency
Why Makefiles? Many alternatives already exist Ruffus, Snakemake, bpipe, etc. None parallelizes as easy as (q)make on our cluster Coming up with a defined approach for Makefile pipelines & writing helper scripts is the way to go for now This is not a solved problem!
Makefiles are neat Parallelization ~ for cores (make) and nodes (qmake) Resume runs from failure points Easy to define dependencies among steps Close to the shell environment Already used in some of our earlier internal pipelines, e.g. GAPSS3 Big upgrade from shell / python / perl scripts!
GAPSS3 Makefile-based pipeline for exome and genome alignment Designed to be run on multiple core machines (or a cluster) Ran as regular Makefile $ make -f Exome.mk $ qmake -cwd -inherit -- -j 5 -f Exome.mk Worked as intended, but highlights areas where we can improve...
Problems Encountered Bioinformatics pipelines (vs software build systems): More moving parts (aligners, variant callers) should be easier to swap parts in and out More experimental in nature should be easy to play with program option flags More investigative in nature should be easy to generate reports for diagnosis
Rig: Framework on Top of Make Core idea Modules: a single unit that perform useful function Each module are standalone but can also be combined to create another module Implementation Each module: recipe file + config file Two types of modules: tool wrappers and pipelines
Rig: Framework on Top of Make Logged by default Variables defined inside the config stdout, stderr, and job details (qmake only) Dynamic options Command-line flag change on the fly $ qmake -- -j 5 -f pipeline.mk OPT_BOWTIE_m=1
Module Structure sample.mk: recipes %.sam: %.1.fastq %.2.fastq $(BOWTIE) $(IDX) $^ > $@ %.bam: %.sam $(SAMTOOLS) view -bt $(REF) -o $@ $< sample.mkc: config INPUTS := mine.1.fastq mine.2.fastq IDX := /usr/local/indices/hg19 REF := /usr/local/genomes/hg19.fa BOWTIE := /usr/bin/bowtie SAMTOOLS := /usr/bin/samtools Cleaner separation of module logic & module components Easier setup of required variables (e.g. reporting variables)
Module Types Pipeline Multiple recipes (similar to GAPSS) Tool wrapper Single recipe, 'wraps' a single command line tool $ bowtie --m 1 /usr/local/indices/hg19 sample.fq > sample.sam $ qmake -- -j 5 -f pipeline.mk OPT_BOWTIE_m=1 %.sam: %.fq bowtie /usr/local/indices/hg19 $^ --m 1 $< > $@ # instead %.sam: %.fasta $(MAKE) $(MODULE_ALIGNER) $(OPT_ALIGNER)
Other Additions Python scripts to handle boilerplate code $ rig_gen.py tool bowtie2 # creates a template tool wrapper named bowtie2 Python module for exploring job logs >>> from rig import RigRun >>> run = RigRun('my_pipeline', '/path/to/log/directory') >>> for module in run:... for job in module:... print job.id, job.start my_module.12412 datetime.datetime(2013, 6, 26, 0, 54, 16, 867242) Nameset files for defined input patterns
Tool wrappers: 41 In Progress bowtie, cufflinks, sickle, etc. Pipelines: 14 Customizable QC pipeline (FastQC, sickle, cutadapt) Gentrap v2.0 (using the QC pipeline, 2 new aligners) GATK best practices pipeline (one module per phase) Deepsage pipeline using genome & 'transcriptome' alignment and more.. Identical logging for make and qmake Tests..?
Compromises File sync problem: not cleanly handled by qmake We had to hack into it and use a custom shell wrapper to ensure dependencies are available before each job.
Acknowledgements Jeroen Laros Leon Mei Martijn Vermaat Martin van den Kerkhoff Michiel van Galen Peter van 't Hof Zuotian Tatum Sander van der Zeeuw Wai Yi Leung
Initial Development Model Single git repository: core library, pipelines, tool wrappers Dependency problem: one tool, multiple pipelines? Challenge: How to version repo within a repo? Choice between git submodules and git subtree
subtree vs submodule git subtree: Add sub repository as a folder under the main repository (as a remote) Can push to sub-repository selectively Can pull entire sub-repository history git submodule: Add sub repository under the main repository, but not under git (?) Requires additional.gitsubmodule file (which is versioned) Cloning is messy..
Workflow Create tool wrappers, push to remote repo Create pipeline, and then add modules: git remote add mod_tool... git subtree add -P {path} --squash mod_tool/master Work on pipeline, work on tool Pull from tool repo: git subtree merge -P {path} --mod_tool/master Push to tool repo: git subtree push -P {path mod_tool/master
git subtree considerations Advantages: Everything is a regular file Makes releases easy Clone as usual Downsides: History becomes messy
Demo time! bioassisst: simple pipeline that processes maps FASTQ files into BAM files