Cloud-Based Big Data Analytics in Bioinformatics



Similar documents
Cloud-Based Big Data Analytics in Bioinformatics: A Review

Hadoopizer : a cloud environment for bioinformatics data analysis

Managing and Conducting Biomedical Research on the Cloud Prasad Patil

SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop

Analyze Human Genome Using Big Data

Towards Integrating the Detection of Genetic Variants into an In-Memory Database

New solutions for Big Data Analysis and Visualization

High Throughput Sequencing Data Analysis using Cloud Computing

Preparing the scenario for the use of patient s genome sequences in clinic. Joaquín Dopazo

Cloud Computing for Big Data

Development of Bio-Cloud Service for Genomic Analysis Based on Virtual

Local Alignment Tool Based on Hadoop Framework and GPU Architecture

Research Article Stormbow: A Cloud-Based Tool for Reads Mapping and Expression Quantification in Large-Scale RNA-Seq Studies

Cloud-based Analytics and Map Reduce

Cloud-enabling Sequence Alignment with Hadoop MapReduce: A Performance Analysis

What is Cloud Computing? Tackling the Challenges of Big Data. Tackling The Challenges of Big Data. Matei Zaharia. Matei Zaharia. Big Data Collection

Large-scale DNA Sequence Analysis in the Cloud: A Stream-based Approach

HADOOP IN THE LIFE SCIENCES:

Analysis of ChIP-seq data in Galaxy

Support for data-intensive computing with CloudMan

Improving Current Hadoop MapReduce Workflow and Performance

Big Data Challenges in Bioinformatics

Early Cloud Experiences with the Kepler Scientific Workflow System

Cloud Courses Description

The Cloud at Your Service

Globus Genomics Tutorial GlobusWorld 2014

Delivering the power of the world s most successful genomics platform

Alternative Deployment Models for Cloud Computing in HPC Applications. Society of HPC Professionals November 9, 2011 Steve Hebert, Nimbix

Course Design Document: IS429: Cloud Computing and SaaS Solutions. Version 1.0

Cheminformatics in the Cloud. Michael A. Dippolito DeltaSoft, Inc. 3-June-2009 ChemAxon European User Group Meeting

Scientific and Technical Applications as a Service in the Cloud

How To Understand Cloud Computing

A Primer of Genome Science THIRD

Cloud Computing for Education Workshop

NCTA Cloud Architecture

PaRFR : Parallel Random Forest Regression on Hadoop for Multivariate Quantitative Trait Loci Mapping. Version 1.0, Oct 2012

Cloud Computing Solutions for Genomics Across Geographic, Institutional and Economic Barriers

HIGH-SPEED BRIDGE TO CLOUD STORAGE

Eoulsan Analyse du séquençage à haut débit dans le cloud et sur la grille

Twister4Azure: Data Analytics in the Cloud

SAP HANA Enabling Genome Analysis

Analysis and Research of Cloud Computing System to Comparison of Several Cloud Computing Platforms

Big Data Mining Services and Knowledge Discovery Applications on Clouds

Open Cloud System. (Integration of Eucalyptus, Hadoop and AppScale into deployment of University Private Cloud)

Teaching Computational Thinking using Cloud Computing: By A/P Tan Tin Wee

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Cloud Computing Summary and Preparation for Examination

Cloud Computing. Adam Barker

Running Agilent GeneSpring MPP on the Cloud

Leading Genomics. Diagnostic. Discove. Collab. harma. Shanghai Cambridge, MA Reykjavik

Cloud Computing. Chapter 1 Introducing Cloud Computing

Sriram Krishnan, Ph.D.

Big Data Challenges. technology basics for data scientists. Spring Jordi Torres, UPC - BSC

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

Comparing Cloud Computing Resources for Model Calibration with PEST

Cloud Computing Services and its Application

Computational infrastructure for NGS data analysis. José Carbonell Caballero Pablo Escobar

Ch. 4 - Topics of Discussion

NIH Commons Overview, Framework & Pilots - Version 1. The NIH Commons

Bursting to a Hybrid Cloud for Services OFC 2015

What s New in Pathway Studio Web 11.1

White Paper on CLOUD COMPUTING

Cloud BioLinux: Pre-configured and On-demand Bioinformatics Computing for the Genomics Community

Research on Storage Techniques in Cloud Computing

Deep Sequencing Data Analysis

Cloud Computing. Alex Crawford Ben Johnstone

Cloud BioLinux: Pre-configured and On-demand Bioinformatics Computing for the Genomics Community

CLOUD COMPUTING. When It's smarter to rent than to buy

Cloud Computing. Chapter 1 Introducing Cloud Computing

Bioinformatics Grid - Enabled Tools For Biologists.

School of Nursing. Presented by Yvette Conley, PhD

Cloud Computing Now and the Future Development of the IaaS

<Insert Picture Here> Enterprise Cloud Computing: What, Why and How

-> Integration of MAPHiTS in Galaxy

GeneProf and the new GeneProf Web Services

ENABLING DATA TRANSFER MANAGEMENT AND SHARING IN THE ERA OF GENOMIC MEDICINE. October 2013

Personalized Medicine and IT

A Priority Shift Mechanism to Reduce Wait time in Cloud Environment

Building Bioinformatics Capacity in Africa. Nicky Mulder CBIO Group, UCT

INCREASING THE CLOUD PERFORMANCE WITH LOCAL AUTHENTICATION

Transcription:

Cloud-Based Big Data Analytics in Bioinformatics Presented By Cephas Mawere Harare Institute of Technology, Zimbabwe 1

Introduction 2

Big Data Analytics Big Data are a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. The process of research into massive amounts of such data to reveal hidden patterns and secret correlations is thus called big data analytics. High-throughput technologies in bioinformatics 3

High-throughput Technologies Next Generation Sequencing (NGS) Virtual Screening Genotyping SNP (Single Nucleotide Polymorphism) discovery Gene expression Proteomics ----------- Sequence mapping Sequence analysis Peak caller for ChIP-seq data Identification of epistatic interactions of SNPs Drug discovery Data storage Data complexity Memory management Computational power 4

Example: Next Generation Sequencing (NGS).ACGTACCT 5 http://www.ncbi.nlm.nih.gov/genbank/statistics. Accessed on Oct 21, 2014

What Next? 6

Cloud Computing computing in which dynamically scalable and virtualized resources are provided as a service over the internet. Unified, location- independent platform for data & computation Pay-per-use & available even to small labs Affordable costs Linear scaling with parallel execution Time effective; large no. of nodes Up-to-date 7 [Bateman, A. & Wood, M. Bioinformatics 25, 1475 (2009).]

Cloud-based services in bioinformatics are grouped into Data as a Service (DaaS), 8 Software as a Service (SaaS), Platform as a Service (PaaS), and Infrastructure as a Service (IaaS). [Dai et al. Biology Direct 2012, 7:43. http://www.biology-direct.com/content/7/1/43]

Langmead, B., Schatz, M. C., Lin, J., Pop, M., & Salzberg, S. L. (2009). Searching for SNPs with cloud computing. Genome Biol, 10(11), R134. 9

Langmead, B et. al., (2009). Genome Biology, vol. 10, no. 11 21/11/2014 2.7B reads genotyped in < 3hrs on EC2 using a 320 CPU cluster for a cost of $85. 10

CloudBurst Running time on EC2 High-Medium Instance Cluster Comparison of CloudBurst running time while scaling the size of the cluster for mapping 7M reads to human chromosome 22 with at most four mismatches on the EC2 Cluster. The 96- core cluster is 3.5X faster than the 24- core cluster. 11 [Schatz. M. CloudBurst: highly sensitive read mapping with MapReduce. Bioinformatics, Vol.25 no. 11 2009, pp. 1363-1369. ]

Guo, X et. Al., (2014). BMC Bioinformatics 2014, 15:102 21/11/2014 Genome-Wide Epistatic Interactions Computing nodes are sampled from 1-40 with 5 as interval. The red, blue, cyan and grey curve show functions of speed-up of M=10 000, M= 50 000, and M= 100 000, with size fixed to 800. # Windows Azure platform used. 12

Surflex-Dock on Amazon EC2 One C1.xlarge EC2= modern computer with 4 cores (-90hrs) An EC2 cluster of 40 c1.xlarge instances will thus perform in 2 hrs. Virtual screening (HTS) experiment for 375k ligand-receptor dockings on 160 cores was completed in < 9hrs for a total of less than $20, to find a lead in drug discovery. 13 www.cetara.com/cloud_computing_enables_cost_effective_virtual_screening_using_the_amazon_cloud

PeakRanger for ChIP-seq Data Performance of PeakRanger in cloud parallel computing. A) test with fixed number of nodes and data sets of increasing size; B)test with increasing numbers of nodes and data sets with fixed sizes. 14 [Feng, X et. Al., PeakRanger: a cloud-enabled peak caller for ChIP-seq data. BMC Bioinformatics 2011, 12:139.]

Challenges of Cloud computing Security:- trustability issues. How confidential is the data? Data is stored in a redundant manner. Limited control over remote storage. Data transfer: large data sets need to be moved to the cloud. slow upload speed. Some vendors have offered to ship data on hard drives. Scalability:- need for parallelization, and depends on the algorithm itself if MapReduce suits the use case & desired effects. Usability:- bioinformatics involves a pipeline of scripts and a command line interface. Setting up nodes in the cloud involves a lot of command line and some deeper understanding. Need for technical expertise for new apps. 15

Data Management Besides software tools that benefit the computational power and scalability of a cloud systems for data management were developed; Amazon hosts a lot of public datasets on their S3 storage system which are accessible for every AWS customer. This removes time intensive uploading processes. Same datasets can be reused by other researchers. Popular datasets include the annotated human genome, GenBank, HapMap or UniGene. 16

Conclusion & Future Work Cloud computing is an alternative solution to big data challenges in bioinformatics as shown by the use cases explored. High speed transfer technologies can be used in future to aid big data transfer Despite challenges still pending with data security, future efforts must be devoted to open and publicly accessible bioinformatics clouds to the whole scientific community to accelerate diagnosis, prognosis, drug discovery and personalized medicine. Cloud computing of big data in bioinformatics can be done on several platforms which provide a lightweight programming environment for development of pipelines. Bioinformatics clouds should integrate both data and software tools 17

Thank You 18