1 GC3: Grid Computing Competence Center GC3 Use cases for the Cloud Some real world examples suited for cloud systems Antonio Messina Trieste,
2 Who am I System Architect at the GC3 - Grid Computing Competence Center of the University of Zurich our mission is to foster research, education, infrastructure, and usage of distributed computing at the University of Zurich. we maintain computational infrastructures: HPC cluster(s) A Cloud (called Hobbes) An ARC grid site we develop software to enable users to run on these infrastructures we organize schools and training events for Python, Cloud systems, (Grid) and more...
3 Take-home message Two real-world examples to demonstrate that Scientists have many kind of computational needs Some of them run on cloud infrastructures and they are happier than before...
4 Take-home message Two real-world examples to demonstrate that Scientists have many kind of computational needs Some of them run on cloud infrastructures and they are happier than before...
5 Take-home message Two real-world examples to demonstrate that Scientists have many kind of computational needs Some of them run on cloud infrastructures and they are happier than before...
6 Take-home message Two real-world examples to demonstrate that Scientists have many kind of computational needs Some of them run on cloud infrastructures and they are happier than before...
9 Akos Dobay Institute of Evolutionary Biology and Environmental Studies University of Zurich Epidemiology and infectious diseases Evolutionary biology Gene regulation and epigenetics Computational and theoretical biology Computer-based modelling of biological systems
10 Pseudomonas Aeruginosa Like many other bacteria, they produce molecules (public goods) public goods goes around and bind to environmental resources (e.g. iron particles) the bacteria then consume these public goods
11 Cheaters Some bacteria are cheaters: they only consume public goods.... like people who do not pay taxes. If there are too many cheaters, the population will face extinction. Otherwise, the population will grow until there is physical space. The goal is to study the evolution of this kind of systems, possibly without having a lab full of in vitro cultures.
12 Evolution of a bacteria population This movie will show a simulation of a population of bacteria. They produce public goods, consume them, reproduce and die.
13 What decides the evolution of the population? Many factors have to be studied: environment initial disposition of the cells production rate of public goods durability of the public goods death rate of the cells reproduction rate
14 biointeract compute random starting conditions run 2000 simulations for one set of parameters accepts 4 different parameters from command line configurable number of cells output of multiple simulations are later used for statistical analysis. For each parameter, we investigate 10 different values. which means 1000 different runs
15 biointeract (2) Computational needs: each simulation takes at least 1 minute (150 cells) each run takes from 12hours to 3 days (depending on the number of cells) no input files small output file low memory consumption single-core gbiointeract compute all the combinations of parameters run on one or more cloud and on batch systems
16 Why not HPC clusters (g)biointeract can run also on HPC clusters. However, it would be a waste of precious resources: no need for a fast parallel storage no need for a fast, low-latency (and expensive) network no need for a big storage no need for an SMP machine
17 parameter sweep Periodically re-run an application/a workflow when new data is generated and coping with peak cycles. Process large data sets, requiring large-scale tightly coupled systems, large memory systems of large, fast scratch storage. Modify input parameters applied to a computational model, which requires any combination of large-scale tightly coupled system, large memory system, large fast scratch storage. Assess a range of parameter combinations based on a prototypical embarrassingly parallel use case, e.g. high energy physics.
18 Other parameter sweep use cases (1) Vanessa, Alpine cryosphere-related slope movements: measurements and analysis. Data coming from GPS stations planted on the Alps. they send information on their position and inclination used to understand movements of the slope Small set of R scripts to calibrate the geographical model. 17,598 runs 48,747 cpu-hours
19 Other parameter seep use cases (2) Joel, Mountain cryosphere subgrid parameterisation and computation. GEOtop is a distributed hydrological model. GEOtop uses a DEM. The bigger and more detailed the area, the bigger is the DEM, the longer takes the simulation. Preprocessing system to intelligently reduce the points where GEOtop has to run 4 order of magnitude less computation Need to calibrate the model: 8,588 runs 114,000 cpu-hours
20 Dr. Michal Okoniewski Marek Wiewiorka Functional Genomics Center Zurich Data analyses in genomic and transcriptomic projects. Support of experimental design, statistical analysis, data integration and annotation processing.
22 Did you know that horses can suffer from asthma? What makes this RNA and DNA sequencing is needed to be able to answer this kind of questions.
23 Did you know that horses can suffer from asthma? What makes this into this? RNA and DNA sequencing is needed to be able to answer this kind of questions.
24 Sequencing machines prices is dropping Cost per genome. link
25 RNA sequencing Illumina HiSeq 2000 blood sample from horses sequenced to a raw file then converted to a BAM file (1GB big) a BAM file can contain a full genome FGCZ has already 300 samples. Goal is: finding genes that switch on or off in case of horse asthma.
26 RNA sequencing BAM files contains information on RNA presence and quantity from a genome at a given moment in time. RNA continually changes, as opposed to static genome. RNA sequencing is useful for instance to observe cellular pathway alterations during infection and gene expression level changes in cancer studies. Having multiple samples, it s possible to do statistical analysis, aggregating reads mapped to a specific genome region in genes or in exons, thus looking for patterns, and find specific shapes of coverage of the genome, and statistically significant differences between them.
27 Computational infrastructure Hadoop is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models
28 Hadoop HDFS spreads data over multiple (thousands) servers each machine offer local computation and storage jobs are sent to the machines which holds the data to compute (data locality) no need for hardware high-availability, failures are detected at application level the cluster can be easily grown or shrunk HDFS works very well for mostly immutable files.
29 Implementation Queries are written in Scala (in the future also R will be supported) the framework generates multiple jobs from the query jobs are sent on the nodes of the Hadoop cluster each job works on some of the BAM files stored on that specific node results are then collected and aggregated Deployment of the Hadoop cluster on the cloud is done using elasticluster
30 Why not HPC clusters? In the future, near real-time computations will be performed, which do not play well with batch systems thanks to HDFS, no need for parallel storage: data is read locally. no need for fast network interconnect a framework which uses Hadoop is used, that allows for easy and fast implementation of statistical queries over the dataset
31 Thank you and many thanks to Akos, Michal and Marek! Questions?
The Massachusetts Open Cloud (MOC) October 11, 2012 Abstract The Massachusetts open cloud is a new non-profit open public cloud that will be hosted (primarily) at the MGHPCC data center. Its mission is
Global Headquarters: 5 Speen Street Framingham, MA 01701 USA P.508.872.8200 F.508.935.4015 www.idc.com W H I T E P A P E R B i g D a t a : W h a t I t I s a n d W h y Y o u S h o u l d C a r e Sponsored
Final Report DFN-Project GRIDWELTEN: User Requirements and Environments for GRID-Computing 5/30/2003 Peggy Lindner 1, Thomas Beisel 1, Michael M. Resch 1, Toshiyuki Imamura 2, Roger Menday 3, Philipp Wieder
Emergence and Taxonomy of Big Data as a Service Benoy Bhagattjee Working Paper CISL# 2014-06 May 2014 Composite Information Systems Laboratory (CISL) Sloan School of Management, Room E62-422 Massachusetts
Cloud Computing and Grid Computing 360-Degree Compared 1,2,3 Ian Foster, 4 Yong Zhao, 1 Ioan Raicu, 5 Shiyong Lu firstname.lastname@example.org, email@example.com, firstname.lastname@example.org, email@example.com 1 Department
Big Data Mining Services and Knowledge Discovery Applications on Clouds Domenico Talia DIMES, Università della Calabria & DtoK Lab Italy firstname.lastname@example.org Data Availability or Data Deluge? Some decades
TRANSACTIONS ON CLOUD COMPUTING, VOL. AA, NO. B, JUNE 2014 1 Deploying Large-Scale Data Sets on-demand in the Cloud: Treats and Tricks on Data Distribution. Luis M. Vaquero 1, Member, IEEE Antonio Celorio
BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON Overview * Introduction * Multiple faces of Big Data * Challenges of Big Data * Cloud Computing
B.2 Executive Summary As demonstrated in Section A, Compute Canada (CC) supports a vibrant community of researchers spanning all disciplines and regions in Canada. Providing access to world- class infrastructure
Solving Big Data Challenges for Enterprise Application Performance Management Tilmann Rabl Middleware Systems Research Group University of Toronto, Canada email@example.com Sergio Gómez Villamor
white paper Boosting Retail Revenue and Efficiency with Big Data Analytics A Simplified, Automated Approach to Big Data Applications: StackIQ Enterprise Data Management and Monitoring Abstract Contents
No One (Cluster) Size Fits All: Automatic Cluster Sizing for Data-intensive Analytics Herodotos Herodotou Duke University firstname.lastname@example.org Fei Dong Duke University email@example.com Shivnath Babu Duke
32 Big Data: present and future Big Data: present and future Mircea Răducu TRIFU, Mihaela Laura IVAN University of Economic Studies, Bucharest, Romania firstname.lastname@example.org, email@example.com
An Oracle White Paper June 2013 Oracle: Big Data for the Enterprise Executive Summary... 2 Introduction... 3 Defining Big Data... 3 The Importance of Big Data... 4 Building a Big Data Platform... 5 Infrastructure
Background A White Paper Optimizing your Call Center through Simulation By Bill Hall, Call Center Services and Dr. Jon Anton, Purdue University The challenge for today's call centers is providing value-added
Estimating the Cost of a GIS in the Amazon Cloud An Esri White Paper August 2012 Copyright 2012 Esri All rights reserved. Printed in the United States of America. The information contained in this document
Peter Reimann b Michael Reiter a Holger Schwarz b Dimka Karastoyanova a Frank Leymann a SIMPL A Framework for Accessing External Data in Simulation Workflows Stuttgart, March 20 a Institute of Architecture
Introduction to InfiniBand for End Users Industry-Standard Value and Performance for High Performance Computing and the Enterprise Paul Grun InfiniBand Trade Association INTRO TO INFINIBAND FOR END USERS
Big Data Computing and Clouds: Trends and Future Directions Marcos D. Assunção a,, Rodrigo N. Calheiros b, Silvia Bianchi c, Marco A. S. Netto c, Rajkumar Buyya b, arxiv:1312.4722v2 [cs.dc] 22 Aug 2014
The HAS Architecture: A Highly Available and Scalable Cluster Architecture for Web Servers Ibrahim Haddad A Thesis in the Department of Computer Science and Software Engineering Presented in Partial Fulfillment
Big Workflow: More than Just Intelligent Workload Management for Big Data Michael Feldman White Paper February 2014 EXECUTIVE SUMMARY Big data applications represent a fast-growing category of high-value
David Chappell December 2011 WHAT IS AN APPLICATION PLATFORM? Sponsored by Microsoft Corporation Copyright 2011 Chappell & Associates Just about every application today relies on other software: operating
Analysis of dynamic sensor networks: power law then what? (Invited Paper) Éric Fleury, Jean-Loup Guillaume CITI / ARES INRIA INSA de Lyon F-9 Villeurbanne FRANCE Céline Robardet LIRIS / CNRS UMR INSA de
UNIVERSITY OF OSLO Department of Informatics Performance Measurement of Web Services Linux Virtual Server Muhammad Ashfaq Oslo University College May 19, 2009 Performance Measurement of Web Services Linux
MiSeq: Imaging and Page Welcome Navigation Presenter Introduction MiSeq Sequencing Workflow Narration Welcome to MiSeq: Imaging and. This course takes 35 minutes to complete. Click Next to continue. Please
INTELLIGENT BUSINESS STRATEGIES W H I T E P A P E R Architecting A Big Data Platform for Analytics By Mike Ferguson Intelligent Business Strategies October 2012 Prepared for: Table of Contents Introduction...