GC3 Use cases for the Cloud



Similar documents
Alternative Deployment Models for Cloud Computing in HPC Applications. Society of HPC Professionals November 9, 2011 Steve Hebert, Nimbix

Converged, Real-time Analytics Enabling Faster Decision Making and New Business Opportunities

Shouguo Gao Ph. D Department of Physics and Comprehensive Diabetes Center

Hadoop. Bioinformatics Big Data

G E N OM I C S S E RV I C ES

Delivering the power of the world s most successful genomics platform

Eoulsan Analyse du séquençage à haut débit dans le cloud et sur la grille

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Grid Computing vs Cloud

HPC technology and future architecture

Databricks. A Primer

COMPUTER MEASUREMENT GROUP - India Hyderabad Chapter. Strategies to Optimize Cloud Costs By Cloud Performance Monitoring

Hadoopizer : a cloud environment for bioinformatics data analysis

Databricks. A Primer

Steps to Migrating to a Private Cloud

Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc.

Attacking the Biobank Bottleneck

How to Ingest Data into Google BigQuery using Talend for Big Data. A Technical Solution Paper from Saama Technologies, Inc.

HadoopTM Analytics DDN

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Data-intensive HPC: opportunities and challenges. Patrick Valduriez

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Cloud Computing Solutions for Genomics Across Geographic, Institutional and Economic Barriers

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

Workshop on Hadoop with Big Data

Putting Genomes in the Cloud with WOS TM. ddn.com. DDN Whitepaper. Making data sharing faster, easier and more scalable

Testing 3Vs (Volume, Variety and Velocity) of Big Data

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Big Data Mining Services and Knowledge Discovery Applications on Clouds

Big Data on Microsoft Platform

Energy Efficient MapReduce

BUILDING A SCALABLE BIG DATA INFRASTRUCTURE FOR DYNAMIC WORKFLOWS

The big data revolution

FINANCIAL SERVICES: FRAUD MANAGEMENT A solution showcase

Hadoop Architecture. Part 1

HiBench Introduction. Carson Wang Software & Services Group

Activity 7.21 Transcription factors

Building a Scalable Big Data Infrastructure for Dynamic Workflows

Seeking Opportunities for Hardware Acceleration in Big Data Analytics

Leading Genomics. Diagnostic. Discove. Collab. harma. Shanghai Cambridge, MA Reykjavik

Implement Hadoop jobs to extract business value from large and varied data sets

Beyond Lambda - how to get from logical to physical. Artur Borycki, Director International Technology & Innovations

Data Mining with Hadoop at TACC

Sanjeev Kumar. contribute

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Managing and Conducting Biomedical Research on the Cloud Prasad Patil

UCLA Team Sequences Cell Line, Puts Open Source Software Framework into Production

UZH Experiences with OpenStack

Is a Data Scientist the New Quant? Stuart Kozola MathWorks

OpenCB a next generation big data analytics and visualisation platform for the Omics revolution

Symmetric Multiprocessing

2. Research and Development on the Autonomic Operation. Control Infrastructure Technologies in the Cloud Computing Environment

SQLstream Blaze and Apache Storm A BENCHMARK COMPARISON

Improving Grid Processing Efficiency through Compute-Data Confluence

Scala Storage Scale-Out Clustered Storage White Paper

White Paper The Numascale Solution: Extreme BIG DATA Computing

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Introduction to Arvados. A Curoverse White Paper

numascale White Paper The Numascale Solution: Extreme BIG DATA Computing Hardware Accellerated Data Intensive Computing By: Einar Rustad ABSTRACT

How To Run A Cloud At Cornell Cac

REGULATIONS FOR THE DEGREE OF MASTER OF SCIENCE IN COMPUTER SCIENCE (MSc[CompSc])

Creating a universe on Hive with Hortonworks HDP 2.0

Apache Hadoop: Past, Present, and Future

Big Data and the Data Lake. February 2015

Early Cloud Experiences with the Kepler Scientific Workflow System

Big Data on AWS. Services Overview. Bernie Nallamotu Principle Solutions Architect

Enabling the Use of Data

GeneProf and the new GeneProf Web Services

Integrated Rule-based Data Management System for Genome Sequencing Data

Amazon EC2 Product Details Page 1 of 5

REGULATIONS FOR THE DEGREE OF MASTER OF SCIENCE IN COMPUTER SCIENCE (MSc[CompSc])

Sriram Krishnan, Ph.D.

Testing Automation for Distributed Applications By Isabel Drost-Fromm, Software Engineer, Elastic

Testing Big data is one of the biggest

Customer Case Study. Automatic Labs

Ubuntu and Hadoop: the perfect match

Building Bioinformatics Capacity in Africa. Nicky Mulder CBIO Group, UCT

Practical Solutions for Big Data Analytics

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Bringing Big Data Modelling into the Hands of Domain Experts

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

Characterizing Task Usage Shapes in Google s Compute Clusters

BIG DATA-AS-A-SERVICE

Big Workflow: More than Just Intelligent Workload Management for Big Data

Image Analytics on Big Data In Motion Implementation of Image Analytics CCL in Apache Kafka and Storm

Lab Management, Device Provisioning and Test Automation Software

Big Data Challenges in Bioinformatics

Big Data Use Case. How Rackspace is using Private Cloud for Big Data. Bryan Thompson. May 8th, 2013

Transcription:

GC3: Grid Computing Competence Center GC3 Use cases for the Cloud Some real world examples suited for cloud systems Antonio Messina <antonio.messina@uzh.ch> Trieste, 24.10.2013

Who am I System Architect at the GC3 - Grid Computing Competence Center of the University of Zurich our mission is to foster research, education, infrastructure, and usage of distributed computing at the University of Zurich. we maintain computational infrastructures: HPC cluster(s) A Cloud (called Hobbes) An ARC grid site we develop software to enable users to run on these infrastructures we organize schools and training events for Python, Cloud systems, (Grid) and more...

Take-home message Two real-world examples to demonstrate that Scientists have many kind of computational needs Some of them run on cloud infrastructures and they are happier than before...

Take-home message Two real-world examples to demonstrate that Scientists have many kind of computational needs Some of them run on cloud infrastructures and they are happier than before...

Take-home message Two real-world examples to demonstrate that Scientists have many kind of computational needs Some of them run on cloud infrastructures and they are happier than before...

Take-home message Two real-world examples to demonstrate that Scientists have many kind of computational needs Some of them run on cloud infrastructures and they are happier than before...

Disclaimer

Disclaimer

Akos Dobay Institute of Evolutionary Biology and Environmental Studies University of Zurich Epidemiology and infectious diseases Evolutionary biology Gene regulation and epigenetics Computational and theoretical biology Computer-based modelling of biological systems

Pseudomonas Aeruginosa Like many other bacteria, they produce molecules (public goods) public goods goes around and bind to environmental resources (e.g. iron particles) the bacteria then consume these public goods

Cheaters Some bacteria are cheaters: they only consume public goods.... like people who do not pay taxes. If there are too many cheaters, the population will face extinction. Otherwise, the population will grow until there is physical space. The goal is to study the evolution of this kind of systems, possibly without having a lab full of in vitro cultures.

Evolution of a bacteria population This movie will show a simulation of a population of bacteria. They produce public goods, consume them, reproduce and die.

What decides the evolution of the population? Many factors have to be studied: environment initial disposition of the cells production rate of public goods durability of the public goods death rate of the cells reproduction rate

biointeract compute random starting conditions run 2000 simulations for one set of parameters accepts 4 different parameters from command line configurable number of cells output of multiple simulations are later used for statistical analysis. For each parameter, we investigate 10 different values. which means 1000 different runs

biointeract (2) Computational needs: each simulation takes at least 1 minute (150 cells) each run takes from 12hours to 3 days (depending on the number of cells) no input files small output file low memory consumption single-core gbiointeract compute all the combinations of parameters run on one or more cloud and on batch systems

Why not HPC clusters (g)biointeract can run also on HPC clusters. However, it would be a waste of precious resources: no need for a fast parallel storage no need for a fast, low-latency (and expensive) network no need for a big storage no need for an SMP machine

parameter sweep Periodically re-run an application/a workflow when new data is generated and coping with peak cycles. Process large data sets, requiring large-scale tightly coupled systems, large memory systems of large, fast scratch storage. Modify input parameters applied to a computational model, which requires any combination of large-scale tightly coupled system, large memory system, large fast scratch storage. Assess a range of parameter combinations based on a prototypical embarrassingly parallel use case, e.g. high energy physics.

Other parameter sweep use cases (1) Vanessa, Alpine cryosphere-related slope movements: measurements and analysis. Data coming from GPS stations planted on the Alps. they send information on their position and inclination used to understand movements of the slope Small set of R scripts to calibrate the geographical model. 17,598 runs 48,747 cpu-hours

Other parameter seep use cases (2) Joel, Mountain cryosphere subgrid parameterisation and computation. GEOtop is a distributed hydrological model. GEOtop uses a DEM. The bigger and more detailed the area, the bigger is the DEM, the longer takes the simulation. Preprocessing system to intelligently reduce the points where GEOtop has to run 4 order of magnitude less computation Need to calibrate the model: 8,588 runs 114,000 cpu-hours

Dr. Michal Okoniewski Marek Wiewiorka Functional Genomics Center Zurich Data analyses in genomic and transcriptomic projects. Support of experimental design, statistical analysis, data integration and annotation processing.

Did you know that horses can suffer from asthma? What makes this RNA and DNA sequencing is needed to be able to answer this kind of questions.

Did you know that horses can suffer from asthma? What makes this into this? RNA and DNA sequencing is needed to be able to answer this kind of questions.

Sequencing machines prices is dropping Cost per genome. link

RNA sequencing Illumina HiSeq 2000 blood sample from horses sequenced to a raw file then converted to a BAM file (1GB big) a BAM file can contain a full genome FGCZ has already 300 samples. Goal is: finding genes that switch on or off in case of horse asthma.

RNA sequencing BAM files contains information on RNA presence and quantity from a genome at a given moment in time. RNA continually changes, as opposed to static genome. RNA sequencing is useful for instance to observe cellular pathway alterations during infection and gene expression level changes in cancer studies. Having multiple samples, it s possible to do statistical analysis, aggregating reads mapped to a specific genome region in genes or in exons, thus looking for patterns, and find specific shapes of coverage of the genome, and statistically significant differences between them.

Computational infrastructure Hadoop is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models

Hadoop HDFS spreads data over multiple (thousands) servers each machine offer local computation and storage jobs are sent to the machines which holds the data to compute (data locality) no need for hardware high-availability, failures are detected at application level the cluster can be easily grown or shrunk HDFS works very well for mostly immutable files.

Implementation Queries are written in Scala (in the future also R will be supported) the framework generates multiple jobs from the query jobs are sent on the nodes of the Hadoop cluster each job works on some of the BAM files stored on that specific node results are then collected and aggregated Deployment of the Hadoop cluster on the cloud is done using elasticluster

Why not HPC clusters? In the future, near real-time computations will be performed, which do not play well with batch systems thanks to HDFS, no need for parallel storage: data is read locally. no need for fast network interconnect a framework which uses Hadoop is used, that allows for easy and fast implementation of statistical queries over the dataset

Thank you and many thanks to Akos, Michal and Marek! Questions?