Bringing Hadoop into Bioinformatics with Cloudgene and CloudMan

Similar documents

Cloudflow A Framework for MapReduce Pipeline Development in Biomedical Research

Cloud BioLinux: Pre-configured and On-demand Bioinformatics Computing for the Genomics Community

Cloud BioLinux: Pre-configured and On-demand Bioinformatics Computing for the Genomics Community

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Big Data Training - Hackveda

Hadoopizer : a cloud environment for bioinformatics data analysis

Big Data Use Case: Business Analytics

File S1: Supplementary Information of CloudDOE

Running and Enhancing your own Galaxy. Daniel Blankenberg The Galaxy Team

Workshop on Hadoop with Big Data

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

Hadoop & Spark Using Amazon EMR

Introduction to Big Data Training

SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop

Hunk & Elas=c MapReduce: Big Data Analy=cs on AWS

TRAINING PROGRAM ON BIGDATA/HADOOP

Support for data-intensive computing with CloudMan

Ansible in Depth WHITEPAPER. ansible.com

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.

Implement Hadoop jobs to extract business value from large and varied data sets

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Cloud Computing Solutions for Genomics Across Geographic, Institutional and Economic Barriers

Pivotal HD Enterprise

Cloud-Based Big Data Analytics in Bioinformatics

Building Your Big Data Team

Enabling cloud for e-science with OpenNebula

HADOOP BIG DATA DEVELOPER TRAINING AGENDA

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Peers Techno log ies Pv t. L td. HADOOP

Student Project 1 - Explorative Data Analysis with Hadoop and Spark

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

Open source software for building a private cloud

Upcoming Announcements

Learning Tree Training Pre-approved Training for Continuing Education Units (CEUs)

How To Write A Nosql Database In Spring Data Project

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

Map Reduce & Hadoop Recommended Text:

-> Integration of MAPHiTS in Galaxy

From Spark to Ignition:

Big Data on AWS. Services Overview. Bernie Nallamotu Principle Solutions Architect

HOPS: Hadoop Open Platform-as-a-Service

Cloud computing - Architecting in the cloud

Automating Big Data Benchmarking for Different Architectures with ALOJA

The Future of Data Management

Unified Batch & Stream Processing Platform

CSE-E5430 Scalable Cloud Computing. Lecture 4

GAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION

Hadoop Job Oriented Training Agenda

CAPTURING & PROCESSING REAL-TIME DATA ON AWS

SAS BIG DATA SOLUTIONS ON AWS SAS FORUM ESPAÑA, OCTOBER 16 TH, 2014 IAN MEYERS SOLUTIONS ARCHITECT / AMAZON WEB SERVICES

Big Data Web Analytics Platform on AWS for Yottaa

Data processing goes big

S06: Open-Source Stack for Cloud Computing

CLOUD MANAGED SERVICES FRAMEWORK E-BOOK

Building Bioinformatics Capacity in Africa. Nicky Mulder CBIO Group, UCT

Alternative Deployment Models for Cloud Computing in HPC Applications. Society of HPC Professionals November 9, 2011 Steve Hebert, Nimbix

Azure Data Lake Analytics

Amazon Elastic MapReduce. Jinesh Varia Peter Sirota Richard Cole

ITG Software Engineering

Real Time Big Data Processing

Data Analytics Infrastructure

Razvoj Java aplikacija u Amazon AWS Cloud: Praktična demonstracija

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Synthetic Data Generation for Realistic Analytics Examples and Testing

Big Data Open Source Stack vs. Traditional Stack for BI and Analytics

Open Cloud System. (Integration of Eucalyptus, Hadoop and AppScale into deployment of University Private Cloud)

Amazon EC2 Product Details Page 1 of 5

Big Data at Cloud Scale

New solutions for Big Data Analysis and Visualization

Ali Ghodsi Head of PM and Engineering Databricks

Chase Wu New Jersey Ins0tute of Technology

The Inside Scoop on Hadoop

What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea

The Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah LinkedIn

Timofey Turenko. Kirill Krinkin St-Petersburg Electrotechnical University

Cloudera Manager Training: Hands-On Exercises

Hortonworks and ODP: Realizing the Future of Big Data, Now Manila, May 13, 2015

Final Project Proposal. CSCI.6500 Distributed Computing over the Internet

ASAM ODS, Peak ODS Server and openmdm as a company-wide information hub for test and simulation data. Peak Solution GmbH, Nuremberg

WHAT S NEW IN SAS 9.4

HPC ABDS: The Case for an Integrating Apache Big Data Stack

Native Connectivity to Big Data Sources in MSTR 10

ArcGIS for Server in the Cloud

Big Data and Hadoop. Module 1: Introduction to Big Data and Hadoop. Module 2: Hadoop Distributed File System. Module 3: MapReduce

Keyword: YARN, HDFS, RAM

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Scalable Architecture on Amazon AWS Cloud

Oracle Big Data Fundamentals Ed 1 NEW

Addressing Open Source Big Data, Hadoop, and MapReduce limitations

How Bigtop Leveraged Docker for Build Automation and One-Click Hadoop Provisioning

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

Integrating SAP BusinessObjects with Hadoop. Using a multi-node Hadoop Cluster

Deploy Your First CF App on Azure with Template and Service Broker. Thomas Shao, Rita Zhang, Bin Xia Microsoft Azure Team

Transcription:

Bringing Hadoop into Bioinformatics with Cloudgene and CloudMan Sebastian Schönherr, Lukas Forer, Davor Davidovic, Hansi Weissensteiner, Florian Kronenberg, Enis Afgan Dublin, BOSC 2015

All started at BOSC 2012

BOSC 2012

BOSC 2012 - CloudMan Cluster on the Cloud for everyone Configures Galaxy automatically Features Private/public cloud support, Instance sharing, dynamic cluster scaling, Persistent storage, re-launch your cluster Enis Afgan, Johns Hopkins University & RBI

CloudMan 2015 Cloud manager in several cloud infrastructures Amazon AWS: Since 2010 Nectar: Since 2012 Jetstream: Coming late 2015 EGI ENGAGE H2020 project Deploy your own version of Galaxy on the Cloud Using Ansible playbook + Packer https://github.com/galaxyproject/galaxy-cloudmanplaybook

BOSC 2012

BOSC 2012 - Cloudgene Improve usability of Hadoop in Bioinformatics A graphical execution platform for Hadoop programs Interface to integrate programs (YAML) Combine several programs into a workflow Setting up a Hadoop cluster on the cloud Lukas Forer Sebastian Schönherr - Medical University of Innsbruck

Cloudgene 2015 From a general workflow system to a Software-as- A-Service platform Dedicated service for a given workflow Already 2 services up and running Supports Hadoop YARN Stack MRv2, Apache Spark Combine Hadoop + Pig + Command Line Programs + R (RMarkdown) programs into one workflow Automatic file staging

BOSC 2012 - Cloudgene + CloudMan Similar ideas, different context Galaxy Workflowsystem Per job parallelization using SGE Cluster in the cloud Cloudgene Workflowsystem Per task parallelization using Hadoop

BOSC 2012 - Cloudgene + CloudMan

Project started in 2014 Platform for Big Data Bioinformatics Analysis Combine the projects CloudMan for Hadoop cluster provisioning Cloudgene for Hadoop execution Find a suitable use case

MapReduce in Bioinformatics https://www.biostars.org/p/115260/ S. Schoenherr VO NoSQL 14

A Real World Use case Michigan Imputation Server Cloudgene as the underlying framework Our workflow includes QC + Phasing + Imputation Cooperation with Center of Statistical Genetics, University of Michigan https://imputationserver.sph.umich.edu Gonçalo Abecasis Michael Boehnke Christian Fuchsberger

Overall Workflow Reference Panels: 1000 Genomes / Hapmap / HRC

Benefits Why CloudMan? Provide our services on private & public clouds Data sensitivity Provide best practices pipeline to everyone Reach a wide user community (Nectar, Jetstream)

Benefits Why Cloudgene? Well-tested platform for running (Hadoop) services Provides user management, admin dashboards,... Focus on the service implementation itself, not on the infrastructure Service 1: Michigan Imputation Server Service 2: mtdna-server Detecting heteroplasmies and contamination in mtdna NGS data http://mtdna-server.uibk.ac.at Service 3:? (Maybe after this meeting)

Software Stack Bioinformatics Workflows Cloudgene MapReduce Platform

Software Stack Bioinformatics Imputation Workflows Server Cloudgene MapReduce Platform CloudMan Infrastructure Manager

Current Project Status Hadoop + Cloudgene running on CloudMan Fully distributed mode Run a WordCount YARN example with Cloudgene Current work Install services as apps (Cloudgene), scaling of cluster (CloudMan) Updates / Screenshots https://wiki.galaxyproject.org/cloudman/services

Codefest 2015 Build a Docker Image for Hadoop + Cloudgene We integrated mtdna-server docker pull seppinho/cdh5-pseudo-mtdnaserver Hadoop Galaxy Adapter (CRS4) Perfect fit Export our workflow and integrate it into Galaxy (tbd)

Acknowledgement CloudMan Enis Afgan and Davor Davidovic wiki.galaxyproject.org/cloudman Cloudgene Lukas Forer and Sebastian Schönherr cloudgene.uibk.ac.at Michigan Imputation Server Gonçalo Abecasis; Michael Boehnke; Christian Fuchsberger imputationserver.sph.umich.edu

Thanks to BOSC!