Ressources management and runtime environments in the exascale computing era

Size: px
Start display at page:

Download "Ressources management and runtime environments in the exascale computing era"

Transcription

1 Ressources management and runtime environments in the exascale computing era Guillaume Huard MOAIS and MESCAL INRIA Projects CNRS LIG Laboratory Grenoble University, France Guillaume Huard MOAIS and MESCAL INRIA Projets 1/28

2 Introduction Large scale platforms have become a reality : using grids, parallel applications can run on thousands of processors cores Tow main models for grids : Structured : lightweight grids Unstructured : P2P overlay grids This talk : High performance computation on structured grids Anticipation of their evolution when growing to exaflop range computing power Guillaume Huard MOAIS and MESCAL INRIA Projets 2/28

3 Structured approach Computing centers interconnect their clusters : lightweight grids Hierarchical structure Clusters of homogeneous resources Network and CPU disparity only among distinct clusters Reasonable reliability Unavailability usually limited to few machines Reliable Backbone and services The french academic grid, Grid5000 is built on this model Guillaume Huard MOAIS and MESCAL INRIA Projets 3/28

4 Challenges in HPC on structured grids Scalability: required for both algorithms and runtime Adaptivity: computation and data must be balanced and placed to mach Computing resources capabilities Communication links capacity Efficiency: computation on a grid is expensive (energy consumption cost), efficient platform usage mandatory Guillaume Huard MOAIS and MESCAL INRIA Projets 4/28

5 Outline Computing on lightweight grids 1 Computing on lightweight grids 2 Application safety and efficiency Middlewares interactions and data management Green computing and platform administration 3 Guillaume Huard MOAIS and MESCAL INRIA Projets 5/28

6 OAR : Managing resources OAR is the batch scheduler used in Grid5000 clusters Classical batch/interactive submission of parallel jobs Elaborate resource query scheme (precise reservation of nodes/processors/cores, switch location, available memory,...) Job dependencies enabling computation workflow support OAR also features low level nodes management Effective nodes cleaning using cpuset Interfaced with kadeploy for environment deployment Guillaume Huard MOAIS and MESCAL INRIA Projets 6/28

7 OAR scheduling snapshot Support for backfilling and fair-sharing policies Guillaume Huard MOAIS and MESCAL INRIA Projets 7/28

8 Enabling the grid with OAR Efficient platform use Best effort jobs : opportunistic computation Dynamic nodes : appropriate management of volatile resources Large set of tasks abstraction Array jobs CiGri system : life cycle management for bag of tasks Large parallel applications setup Advance reservations : enable clusters coordination Checkpoint/resubmit : to test global gang scheduling or fault tolerance Guillaume Huard MOAIS and MESCAL INRIA Projets 8/28

9 Outline Computing on lightweight grids 1 Computing on lightweight grids 2 Application safety and efficiency Middlewares interactions and data management Green computing and platform administration 3 Guillaume Huard MOAIS and MESCAL INRIA Projets 9/28

10 TakTuk : Adaptive Deployment of Parallel Executions Nodes administration: launch the same command on all nodes of a platform uptime to grab statistics about the recent machine availability dig, ping, ifconfig... for network issues diagnostic... Parallel applications development: launch the same parallel program on all nodes (like mpirun) Slaves of a master/slave application All participants of a symmetric parallel application Self organizing system (P2P), daemons (monitoring) and redirect I/O to/from the initiating node Guillaume Huard MOAIS and MESCAL INRIA Projets 10/28

11 Existing tools Computing on lightweight grids Flat deployment tools : pdsh/dsh (IBM Cluster Tools suite) Similar to: Foreach host in hosts do fork ssh $host command Naturally pipelined by the OS : deployment in linear time Distributed deployment : gexec (Ganglia Cluster Suite) Remote gexec daemons take part in the deployment : deployment tree, logarithmic time Requires daemons installation Does not adapt to heterogeneity or nodes failures Guillaume Huard MOAIS and MESCAL INRIA Projets 11/28

12 Optimal deployment Theoretical optimal deployment on homogeneous machines mixes Concurrent connection processes Parallel connections initiation Distribution of remote connexion tasks node 3 node 2 node 1 Time Guillaume Huard MOAIS and MESCAL INRIA Projets 12/28

13 Dynamic deployment The performance of nodes and network vary Heterogeneous architectures in different clusters Load due to OS or hanged processes (zombies, infinite loop) External contention (network, centralized services) Cache effects, swap, other users,... TakTuk algorithm : try to do things ASAP Distribute the engine (using remote executions) Nodes initiate several parallel connections Idle nodes get remaining deployment tasks by work stealing Guillaume Huard MOAIS and MESCAL INRIA Projets 13/28

14 TakTuk deployment compared to other tools Performance versus pdsh and gexec Execution time (s) 20 pdsh, window 64 taktuk, window Execution time (s) gexec, arity 2 taktuk, ssh, window 15 taktuk, rsh, window Number of nodes Number of nodes Advantages No installation required on remote nodes (can self-propagate) Adapts to nodes load, insensitive to nodes failures Guillaume Huard MOAIS and MESCAL INRIA Projets 14/28

15 TakTuk unique features for the grid Heterogeneity and hierarchy Any part of the deployment can be statically specified (e.g. partial topology enforced by cluster front nodes) Deployed nodes logical numbering Distinct machines can execute different commands Applications support using deployment connexions Provides control communications layer Files transfer (send/receive/multicast/gather) capabilities Guillaume Huard MOAIS and MESCAL INRIA Projets 15/28

16 Outline Computing on lightweight grids 1 Computing on lightweight grids 2 Application safety and efficiency Middlewares interactions and data management Green computing and platform administration 3 Guillaume Huard MOAIS and MESCAL INRIA Projets 16/28

17 KAAPI : Parallel programming library Middleware for adaptive computation on Multi-core architectures Clusters and grids High level API for resources abstraction : Athapascan fork keyword to create parallel tasks shared keyword to declare shared data Objectives Write once, run anywhere Guaranteed performances Guillaume Huard MOAIS and MESCAL INRIA Projets 17/28

18 KAAPI example : C++, fork and shared struct Fibonacci { void operator()( int n, a1::shared w<int> result ) { if (n < 2) result.write(n); else { a1::shared<int> subresult1; a1::shared<int> subresult2; a1::fork<fibonacci>()(n-1, subresult1); a1::fork<fibonacci>()(n-2, subresult2); a1::fork<sum>()(result, subresult1, subresult2); } } }; struct Sum { void operator()( a1::shared w<int> result, a1::shared r<int> sr1, a1::shared r<int> sr2 ) { result.write( sr1.read() + sr2.read() ); } } Guillaume Huard MOAIS and MESCAL INRIA Projets 18/28

19 KAAPI Workflow Computing on lightweight grids KAAPI Application constructs a data flow graph KAAPI maps tasks on resources: workstealing (dynamic load balancing) static placement KAAPI manages communications (shared memory or network communication) Guillaume Huard MOAIS and MESCAL INRIA Projets 19/28

20 TRIVA : Application Execution Visualization Collaboration with UFRG (Brasil) 3D (2D Resources / Time) for visualization outlines Application topology Network topology TRIVA is generic and extensible Based on Pajè generic traces description language Treemap views of synthetic data available scales to 1000s of processes Guillaume Huard MOAIS and MESCAL INRIA Projets 20/28

21 Large scale experiments KAAPI/TakTuk winner of the Plugtest (ETSI Event) for three consecutive years ( ) N-Queens and Financial applications on near 4000 cores 2008 edition used G5K + Intrigger : mixed communications TakTuk communications between different grids TCP/IP within each grid IDHAL experiments : coupling highly heterogenous machines G5K grid Brasilian grids Luxembourg clusters Individual volunteer machines linked via DSL modem Machines from PlanetLab Guillaume Huard MOAIS and MESCAL INRIA Projets 21/28

22 Outline Computing on lightweight grids Application safety and efficiency Middlewares interactions and data management Green computing and platform administration 1 Computing on lightweight grids 2 Application safety and efficiency Middlewares interactions and data management Green computing and platform administration 3 Guillaume Huard MOAIS and MESCAL INRIA Projets 22/28

23 Evolution forecast for structured grids Application safety and efficiency Middlewares interactions and data management Green computing and platform administration Next step: interconnect several structured grids into a larger one Several new issues Hierarchical network Nodes communicate with their neighbors only Front node forwarding for inter grid communications More nodes failures (even during short executions) This meets unstructured grids issues (as in P2P grids, PlanetLab) Of course, former lightweight grid issues worsen: scale, heterogeneity and energy consumption Guillaume Huard MOAIS and MESCAL INRIA Projets 23/28

24 KAAPI ongoing works Application safety and efficiency Middlewares interactions and data management Green computing and platform administration Deepen the run anywhere concept Nodes dynamicity Fault tolerance : checkpoint/restart application CCK : coordinated checkpoint protocol TIC : theft induced protocol (distributed) Interaction with the deployment tool : add/remove resources during computation Heterogeneity handling Hierarchical work stealing (sensitive to high latency networks) NUMA aware sheduling Complete implementation of an adaptive parallel STL Guillaume Huard MOAIS and MESCAL INRIA Projets 24/28

25 TRIVA ongoing works Application safety and efficiency Middlewares interactions and data management Green computing and platform administration Improve scalability user navigation in the large volume of informations well chosen data aggregation for relevant overviews Aggregation example : treemap Transform data summary (e.g. number of steals) into visually relevant square Can be applied at each level: core, processor, node, cluster, grid Behavior patterns identification Manipulation of objects classes, correlated events,... Detection of common patterns Guillaume Huard MOAIS and MESCAL INRIA Projets 25/28

26 TakTuk ongoing works Application safety and efficiency Middlewares interactions and data management Green computing and platform administration Improve distributed applications support Applications management extensions Deployment networks union support Interface between batch scheduler and application Data management Efficient broadcast of large data files using direct connections rather than deployment network based on Santos and al. algorithms for K item broadcast Guillaume Huard MOAIS and MESCAL INRIA Projets 26/28

27 OAR ongoing works Computing on lightweight grids Application safety and efficiency Middlewares interactions and data management Green computing and platform administration Flexibility and application support Green OAR Dynamic machines power state changes (history and models) Scheduling sensitive to energy (consumption/speed tradeoff) OAR API for interactions with applications dynamic job s resources addition/removal Clusters administration OAR live CD Support for virtualized clusters Guillaume Huard MOAIS and MESCAL INRIA Projets 27/28

28 Computing on lightweight grids Thanks for your attention, any question? OAR: N. Capit, G. Da-Costa, Y. Georgiou, G. Huard, C. Martin, G. Mounier, P. Neyron, and O. Richard. A batch scheduler with high level components In CCGrid 2005 KAAPI: T. Gautier, X. Besseron, and L. Pigeon. KAAPI: A thread scheduling runtime system for data flow computations on cluster of multi-processors In PASCO 2007 TakTuk: B. Claudel, G. Huard, and O. Richard. Taktuk, adaptive deployment of remote executions In HPDC 2009 (to appear) TRIVA: L. M. Schnorr, G. Huard, and P. O. A. Navaux. 3d approach to the visualization of parallel applications and grid monitoring information In Grid 2008 Guillaume Huard MOAIS and MESCAL INRIA Projets 28/28

Provisioning and Resource Management at Large Scale (Kadeploy and OAR)

Provisioning and Resource Management at Large Scale (Kadeploy and OAR) Provisioning and Resource Management at Large Scale (Kadeploy and OAR) Olivier Richard Laboratoire d Informatique de Grenoble (LIG) Projet INRIA Mescal 31 octobre 2007 Olivier Richard ( Laboratoire d Informatique

More information

Enabling Large-Scale Testing of IaaS Cloud Platforms on the Grid 5000 Testbed

Enabling Large-Scale Testing of IaaS Cloud Platforms on the Grid 5000 Testbed Enabling Large-Scale Testing of IaaS Cloud Platforms on the Grid 5000 Testbed Sébastien Badia, Alexandra Carpen-Amarie, Adrien Lèbre, Lucas Nussbaum Grid 5000 S. Badia, A. Carpen-Amarie, A. Lèbre, L. Nussbaum

More information

3D Approach to the Visualization of Parallel Applications and Grid Monitoring Information

3D Approach to the Visualization of Parallel Applications and Grid Monitoring Information 3D Approach to the Visualization of Parallel Applications and Grid Monitoring Information Lucas Mello Schnorr, Guillaume Huard, Philippe Olivier Alexandre Navaux Instituto de Informática Universidade Federal

More information

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical Identify a problem Review approaches to the problem Propose a novel approach to the problem Define, design, prototype an implementation to evaluate your approach Could be a real system, simulation and/or

More information

How To Monitor Performance On A Microsoft Powerbook (Powerbook) On A Network (Powerbus) On An Uniden (Powergen) With A Microsatellite) On The Microsonde (Powerstation) On Your Computer (Power

How To Monitor Performance On A Microsoft Powerbook (Powerbook) On A Network (Powerbus) On An Uniden (Powergen) With A Microsatellite) On The Microsonde (Powerstation) On Your Computer (Power A Topology-Aware Performance Monitoring Tool for Shared Resource Management in Multicore Systems TADaaM Team - Nicolas Denoyelle - Brice Goglin - Emmanuel Jeannot August 24, 2015 1. Context/Motivations

More information

BSC vision on Big Data and extreme scale computing

BSC vision on Big Data and extreme scale computing BSC vision on Big Data and extreme scale computing Jesus Labarta, Eduard Ayguade,, Fabrizio Gagliardi, Rosa M. Badia, Toni Cortes, Jordi Torres, Adrian Cristal, Osman Unsal, David Carrera, Yolanda Becerra,

More information

Load balancing in SOAJA (Service Oriented Java Adaptive Applications)

Load balancing in SOAJA (Service Oriented Java Adaptive Applications) Load balancing in SOAJA (Service Oriented Java Adaptive Applications) Richard Olejnik Université des Sciences et Technologies de Lille Laboratoire d Informatique Fondamentale de Lille (LIFL UMR CNRS 8022)

More information

Distributed communication-aware load balancing with TreeMatch in Charm++

Distributed communication-aware load balancing with TreeMatch in Charm++ Distributed communication-aware load balancing with TreeMatch in Charm++ The 9th Scheduling for Large Scale Systems Workshop, Lyon, France Emmanuel Jeannot Guillaume Mercier Francois Tessier In collaboration

More information

How To Understand The Concept Of A Distributed System

How To Understand The Concept Of A Distributed System Distributed Operating Systems Introduction Ewa Niewiadomska-Szynkiewicz and Adam Kozakiewicz ens@ia.pw.edu.pl, akozakie@ia.pw.edu.pl Institute of Control and Computation Engineering Warsaw University of

More information

Kerrighed / XtreemOS cluster flavour

Kerrighed / XtreemOS cluster flavour Kerrighed / XtreemOS cluster flavour Jean Parpaillon Reisensburg Castle Günzburg, Germany July 5-9, 2010 July 6th, 2010 Kerrighed - XtreemOS cluster flavour 1 Summary Kerlabs Context Kerrighed Project

More information

INTERNET OF THE THINGS (IoT): An introduction to wireless sensor networking middleware

INTERNET OF THE THINGS (IoT): An introduction to wireless sensor networking middleware 1 INTERNET OF THE THINGS (IoT): An introduction to wireless sensor networking middleware Dr Antoine Bagula ISAT Laboratory, University of Cape Town, South Africa Goal of the lecture 2 The lecture intends

More information

159.735. Final Report. Cluster Scheduling. Submitted by: Priti Lohani 04244354

159.735. Final Report. Cluster Scheduling. Submitted by: Priti Lohani 04244354 159.735 Final Report Cluster Scheduling Submitted by: Priti Lohani 04244354 1 Table of contents: 159.735... 1 Final Report... 1 Cluster Scheduling... 1 Table of contents:... 2 1. Introduction:... 3 1.1

More information

Resource Utilization of Middleware Components in Embedded Systems

Resource Utilization of Middleware Components in Embedded Systems Resource Utilization of Middleware Components in Embedded Systems 3 Introduction System memory, CPU, and network resources are critical to the operation and performance of any software system. These system

More information

Grid Scheduling Dictionary of Terms and Keywords

Grid Scheduling Dictionary of Terms and Keywords Grid Scheduling Dictionary Working Group M. Roehrig, Sandia National Laboratories W. Ziegler, Fraunhofer-Institute for Algorithms and Scientific Computing Document: Category: Informational June 2002 Status

More information

Hybrid Software Architectures for Big Data. Laurence.Hubert@hurence.com @hurence http://www.hurence.com

Hybrid Software Architectures for Big Data. Laurence.Hubert@hurence.com @hurence http://www.hurence.com Hybrid Software Architectures for Big Data Laurence.Hubert@hurence.com @hurence http://www.hurence.com Headquarters : Grenoble Pure player Expert level consulting Training R&D Big Data X-data hot-line

More information

Middleware and Distributed Systems. Introduction. Dr. Martin v. Löwis

Middleware and Distributed Systems. Introduction. Dr. Martin v. Löwis Middleware and Distributed Systems Introduction Dr. Martin v. Löwis 14 3. Software Engineering What is Middleware? Bauer et al. Software Engineering, Report on a conference sponsored by the NATO SCIENCE

More information

High Performance Computing. Course Notes 2007-2008. HPC Fundamentals

High Performance Computing. Course Notes 2007-2008. HPC Fundamentals High Performance Computing Course Notes 2007-2008 2008 HPC Fundamentals Introduction What is High Performance Computing (HPC)? Difficult to define - it s a moving target. Later 1980s, a supercomputer performs

More information

CHAPTER 1 INTRODUCTION

CHAPTER 1 INTRODUCTION 1 CHAPTER 1 INTRODUCTION 1.1 MOTIVATION OF RESEARCH Multicore processors have two or more execution cores (processors) implemented on a single chip having their own set of execution and architectural recourses.

More information

Driving force. What future software needs. Potential research topics

Driving force. What future software needs. Potential research topics Improving Software Robustness and Efficiency Driving force Processor core clock speed reach practical limit ~4GHz (power issue) Percentage of sustainable # of active transistors decrease; Increase in #

More information

Big Data Management in the Clouds and HPC Systems

Big Data Management in the Clouds and HPC Systems Big Data Management in the Clouds and HPC Systems Hemera Final Evaluation Paris 17 th December 2014 Shadi Ibrahim Shadi.ibrahim@inria.fr Era of Big Data! Source: CNRS Magazine 2013 2 Era of Big Data! Source:

More information

- An Essential Building Block for Stable and Reliable Compute Clusters

- An Essential Building Block for Stable and Reliable Compute Clusters Ferdinand Geier ParTec Cluster Competence Center GmbH, V. 1.4, March 2005 Cluster Middleware - An Essential Building Block for Stable and Reliable Compute Clusters Contents: Compute Clusters a Real Alternative

More information

MapCenter: An Open Grid Status Visualization Tool

MapCenter: An Open Grid Status Visualization Tool MapCenter: An Open Grid Status Visualization Tool Franck Bonnassieux Robert Harakaly Pascale Primet UREC CNRS UREC CNRS RESO INRIA ENS Lyon, France ENS Lyon, France ENS Lyon, France franck.bonnassieux@ens-lyon.fr

More information

Optimizing Shared Resource Contention in HPC Clusters

Optimizing Shared Resource Contention in HPC Clusters Optimizing Shared Resource Contention in HPC Clusters Sergey Blagodurov Simon Fraser University Alexandra Fedorova Simon Fraser University Abstract Contention for shared resources in HPC clusters occurs

More information

Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage

Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage White Paper Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage A Benchmark Report August 211 Background Objectivity/DB uses a powerful distributed processing architecture to manage

More information

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, sborkar95@gmail.com Assistant Professor, Information

More information

Distributed Dynamic Load Balancing for Iterative-Stencil Applications

Distributed Dynamic Load Balancing for Iterative-Stencil Applications Distributed Dynamic Load Balancing for Iterative-Stencil Applications G. Dethier 1, P. Marchot 2 and P.A. de Marneffe 1 1 EECS Department, University of Liege, Belgium 2 Chemical Engineering Department,

More information

Distributed Systems LEEC (2005/06 2º Sem.)

Distributed Systems LEEC (2005/06 2º Sem.) Distributed Systems LEEC (2005/06 2º Sem.) Introduction João Paulo Carvalho Universidade Técnica de Lisboa / Instituto Superior Técnico Outline Definition of a Distributed System Goals Connecting Users

More information

Cluster, Grid, Cloud Concepts

Cluster, Grid, Cloud Concepts Cluster, Grid, Cloud Concepts Kalaiselvan.K Contents Section 1: Cluster Section 2: Grid Section 3: Cloud Cluster An Overview Need for a Cluster Cluster categorizations A computer cluster is a group of

More information

DISTRIBUTED SYSTEMS AND CLOUD COMPUTING. A Comparative Study

DISTRIBUTED SYSTEMS AND CLOUD COMPUTING. A Comparative Study DISTRIBUTED SYSTEMS AND CLOUD COMPUTING A Comparative Study Geographically distributed resources, such as storage devices, data sources, and computing power, are interconnected as a single, unified resource

More information

Scheduling and Resource Management in Computational Mini-Grids

Scheduling and Resource Management in Computational Mini-Grids Scheduling and Resource Management in Computational Mini-Grids July 1, 2002 Project Description The concept of grid computing is becoming a more and more important one in the high performance computing

More information

HPC Programming Framework Research Team

HPC Programming Framework Research Team HPC Programming Framework Research Team 1. Team Members Naoya Maruyama (Team Leader) Motohiko Matsuda (Research Scientist) Soichiro Suzuki (Technical Staff) Mohamed Wahib (Postdoctoral Researcher) Shinichiro

More information

CS550. Distributed Operating Systems (Advanced Operating Systems) Instructor: Xian-He Sun

CS550. Distributed Operating Systems (Advanced Operating Systems) Instructor: Xian-He Sun CS550 Distributed Operating Systems (Advanced Operating Systems) Instructor: Xian-He Sun Email: sun@iit.edu, Phone: (312) 567-5260 Office hours: 2:10pm-3:10pm Tuesday, 3:30pm-4:30pm Thursday at SB229C,

More information

Enterprise Application Monitoring with

Enterprise Application Monitoring with Enterprise Application Monitoring with 11/10/2007 Presented by James Peel james.peel@altinity.com / www.altinity.com 1 Who am I? James Peel - james.peel@altinity.com Job: Managing Director of Altinity

More information

MAQAO Performance Analysis and Optimization Tool

MAQAO Performance Analysis and Optimization Tool MAQAO Performance Analysis and Optimization Tool Andres S. CHARIF-RUBIAL andres.charif@uvsq.fr Performance Evaluation Team, University of Versailles S-Q-Y http://www.maqao.org VI-HPS 18 th Grenoble 18/22

More information

Experimenting OAR in a virtual cluster environment for batch schedulers comparative evaluation

Experimenting OAR in a virtual cluster environment for batch schedulers comparative evaluation Experimenting OAR in a virtual cluster environment for batch schedulers comparative evaluation Joseph Emeras (Joseph.Emeras@imag.fr) Yiannis Georgiou (Yiannis.Georgiou@imag.fr) 1 Context OAR [3] is the

More information

The Data Grid: Towards an Architecture for Distributed Management and Analysis of Large Scientific Datasets

The Data Grid: Towards an Architecture for Distributed Management and Analysis of Large Scientific Datasets The Data Grid: Towards an Architecture for Distributed Management and Analysis of Large Scientific Datasets!! Large data collections appear in many scientific domains like climate studies.!! Users and

More information

Improving the performance of data servers on multicore architectures. Fabien Gaud

Improving the performance of data servers on multicore architectures. Fabien Gaud Improving the performance of data servers on multicore architectures Fabien Gaud Grenoble University Advisors: Jean-Bernard Stefani, Renaud Lachaize and Vivien Quéma Sardes (INRIA/LIG) December 2, 2010

More information

Rodrigo Fernandes de Mello, Evgueni Dodonov, José Augusto Andrade Filho

Rodrigo Fernandes de Mello, Evgueni Dodonov, José Augusto Andrade Filho Middleware for High Performance Computing Rodrigo Fernandes de Mello, Evgueni Dodonov, José Augusto Andrade Filho University of São Paulo São Carlos, Brazil {mello, eugeni, augustoa}@icmc.usp.br Outline

More information

Distributed Systems. REK s adaptation of Prof. Claypool s adaptation of Tanenbaum s Distributed Systems Chapter 1

Distributed Systems. REK s adaptation of Prof. Claypool s adaptation of Tanenbaum s Distributed Systems Chapter 1 Distributed Systems REK s adaptation of Prof. Claypool s adaptation of Tanenbaum s Distributed Systems Chapter 1 1 The Rise of Distributed Systems! Computer hardware prices are falling and power increasing.!

More information

Principles and characteristics of distributed systems and environments

Principles and characteristics of distributed systems and environments Principles and characteristics of distributed systems and environments Definition of a distributed system Distributed system is a collection of independent computers that appears to its users as a single

More information

A Cost-Evaluation of MapReduce Applications in the Cloud

A Cost-Evaluation of MapReduce Applications in the Cloud 1/23 A Cost-Evaluation of MapReduce Applications in the Cloud Diana Moise, Alexandra Carpen-Amarie Gabriel Antoniu, Luc Bougé KerData team 2/23 1 MapReduce applications - case study 2 3 4 5 3/23 MapReduce

More information

Big Data Storage Architecture Design in Cloud Computing

Big Data Storage Architecture Design in Cloud Computing Big Data Storage Architecture Design in Cloud Computing Xuebin Chen 1, Shi Wang 1( ), Yanyan Dong 1, and Xu Wang 2 1 College of Science, North China University of Science and Technology, Tangshan, Hebei,

More information

Scalable monitoring and configuration tools for grids and clusters

Scalable monitoring and configuration tools for grids and clusters Scalable monitoring and configuration tools for grids and clusters Philippe Augerat 1, Cyrill Martin 2, Benhur Stein 3 1 INPG, ID laboratory, Grenoble 2 BULL, INRIA, ID laboratory, Grenoble 3 UFSM, Santa-Maria,

More information

Design Patterns of Scalable Cluster System Software

Design Patterns of Scalable Cluster System Software Design Patterns of Scalable Cluster System Software Bibo Tu 1,2, Ming Zou 1,2, Jianfeng Zhan 1, Lei Wang 1 and Jianping Fan 1 1. Institute of Computing Technology, Chinese Academy of Sciences, Beijing

More information

A High Performance Computing Scheduling and Resource Management Primer

A High Performance Computing Scheduling and Resource Management Primer LLNL-TR-652476 A High Performance Computing Scheduling and Resource Management Primer D. H. Ahn, J. E. Garlick, M. A. Grondona, D. A. Lipari, R. R. Springmeyer March 31, 2014 Disclaimer This document was

More information

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:

More information

1.1 Difficulty in Fault Localization in Large-Scale Computing Systems

1.1 Difficulty in Fault Localization in Large-Scale Computing Systems Chapter 1 Introduction System failures have been one of the biggest obstacles in operating today s largescale computing systems. Fault localization, i.e., identifying direct or indirect causes of failures,

More information

Batch Systems. provide a mechanism for submitting, launching, and tracking jobs on a shared resource

Batch Systems. provide a mechanism for submitting, launching, and tracking jobs on a shared resource PBS INTERNALS PBS & TORQUE PBS (Portable Batch System)-software system for managing system resources on workstations, SMP systems, MPPs and vector computers. It was based on Network Queuing System (NQS)

More information

Analysis and Implementation of Cluster Computing Using Linux Operating System

Analysis and Implementation of Cluster Computing Using Linux Operating System IOSR Journal of Computer Engineering (IOSRJCE) ISSN: 2278-0661 Volume 2, Issue 3 (July-Aug. 2012), PP 06-11 Analysis and Implementation of Cluster Computing Using Linux Operating System Zinnia Sultana

More information

EFFICIENT SCHEDULING STRATEGY USING COMMUNICATION AWARE SCHEDULING FOR PARALLEL JOBS IN CLUSTERS

EFFICIENT SCHEDULING STRATEGY USING COMMUNICATION AWARE SCHEDULING FOR PARALLEL JOBS IN CLUSTERS EFFICIENT SCHEDULING STRATEGY USING COMMUNICATION AWARE SCHEDULING FOR PARALLEL JOBS IN CLUSTERS A.Neela madheswari 1 and R.S.D.Wahida Banu 2 1 Department of Information Technology, KMEA Engineering College,

More information

LaPIe: Collective Communications adapted to Grid Environments

LaPIe: Collective Communications adapted to Grid Environments LaPIe: Collective Communications adapted to Grid Environments Luiz Angelo Barchet-Estefanel Thesis Supervisor: M Denis TRYSTRAM Co-Supervisor: M Grégory MOUNIE ID-IMAG Laboratory Grenoble - France LaPIe:

More information

Multi-Channel Clustered Web Application Servers

Multi-Channel Clustered Web Application Servers THE AMERICAN UNIVERSITY IN CAIRO SCHOOL OF SCIENCES AND ENGINEERING Multi-Channel Clustered Web Application Servers A Masters Thesis Department of Computer Science and Engineering Status Report Seminar

More information

Big Data and Apache Hadoop s MapReduce

Big Data and Apache Hadoop s MapReduce Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23

More information

Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes

Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes Eric Petit, Loïc Thebault, Quang V. Dinh May 2014 EXA2CT Consortium 2 WPs Organization Proto-Applications

More information

LinuxWorld Conference & Expo Server Farms and XML Web Services

LinuxWorld Conference & Expo Server Farms and XML Web Services LinuxWorld Conference & Expo Server Farms and XML Web Services Jorgen Thelin, CapeConnect Chief Architect PJ Murray, Product Manager Cape Clear Software Objectives What aspects must a developer be aware

More information

MPI / ClusterTools Update and Plans

MPI / ClusterTools Update and Plans HPC Technical Training Seminar July 7, 2008 October 26, 2007 2 nd HLRS Parallel Tools Workshop Sun HPC ClusterTools 7+: A Binary Distribution of Open MPI MPI / ClusterTools Update and Plans Len Wisniewski

More information

Manjrasoft Market Oriented Cloud Computing Platform

Manjrasoft Market Oriented Cloud Computing Platform Manjrasoft Market Oriented Cloud Computing Platform Innovative Solutions for 3D Rendering Aneka is a market oriented Cloud development and management platform with rapid application development and workload

More information

A Generic Deployment Framework for Grid Computing and Distributed Applications

A Generic Deployment Framework for Grid Computing and Distributed Applications A Generic Deployment Framework for Grid Computing and Distributed Applications Areski Flissi 1, Philippe Merle 2 1 LIFL / CNRS Université des Sciences et Technologies de Lille 59655 Villeneuve d Ascq,

More information

Manjrasoft Market Oriented Cloud Computing Platform

Manjrasoft Market Oriented Cloud Computing Platform Manjrasoft Market Oriented Cloud Computing Platform Aneka Aneka is a market oriented Cloud development and management platform with rapid application development and workload distribution capabilities.

More information

Resiliency in Distributed Workflow Systems for Numerical Applications

Resiliency in Distributed Workflow Systems for Numerical Applications Resiliency in Distributed Workflow Systems for Numerical Applications Laurentiu Trifan To cite this version: Laurentiu Trifan. Resiliency in Distributed Workflow Systems for Numerical Applications. Performance

More information

Thèse. Université de Grenoble. Docteur de l'université de Grenoble Spécialité : Informatique

Thèse. Université de Grenoble. Docteur de l'université de Grenoble Spécialité : Informatique Université de Grenoble Thèse Pour obtenir le grade de Docteur de l'université de Grenoble Spécialité : Informatique Arrêté ministériel : 7 août 2006 Présentée et soutenue publiquement par Yiannis Georgiou

More information

Distributed Operating Systems. Cluster Systems

Distributed Operating Systems. Cluster Systems Distributed Operating Systems Cluster Systems Ewa Niewiadomska-Szynkiewicz ens@ia.pw.edu.pl Institute of Control and Computation Engineering Warsaw University of Technology E&IT Department, WUT 1 1. Cluster

More information

Proactive, Resource-Aware, Tunable Real-time Fault-tolerant Middleware

Proactive, Resource-Aware, Tunable Real-time Fault-tolerant Middleware Proactive, Resource-Aware, Tunable Real-time Fault-tolerant Middleware Priya Narasimhan T. Dumitraş, A. Paulos, S. Pertet, C. Reverte, J. Slember, D. Srivastava Carnegie Mellon University Problem Description

More information

LSKA 2010 Survey Report Job Scheduler

LSKA 2010 Survey Report Job Scheduler LSKA 2010 Survey Report Job Scheduler Graduate Institute of Communication Engineering {r98942067, r98942112}@ntu.edu.tw March 31, 2010 1. Motivation Recently, the computing becomes much more complex. However,

More information

Principles of Distributed Database Systems

Principles of Distributed Database Systems M. Tamer Özsu Patrick Valduriez Principles of Distributed Database Systems Third Edition

More information

New Issues and New Capabilities in HPC Scheduling with the Maui Scheduler

New Issues and New Capabilities in HPC Scheduling with the Maui Scheduler New Issues and New Capabilities in HPC Scheduling with the Maui Scheduler I.Introduction David B Jackson Center for High Performance Computing, University of Utah Much has changed in a few short years.

More information

Chapter 18: Database System Architectures. Centralized Systems

Chapter 18: Database System Architectures. Centralized Systems Chapter 18: Database System Architectures! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems! Network Types 18.1 Centralized Systems! Run on a single computer system and

More information

Interconnect Efficiency of Tyan PSC T-630 with Microsoft Compute Cluster Server 2003

Interconnect Efficiency of Tyan PSC T-630 with Microsoft Compute Cluster Server 2003 Interconnect Efficiency of Tyan PSC T-630 with Microsoft Compute Cluster Server 2003 Josef Pelikán Charles University in Prague, KSVI Department, Josef.Pelikan@mff.cuni.cz Abstract 1 Interconnect quality

More information

Lecture 2 Parallel Programming Platforms

Lecture 2 Parallel Programming Platforms Lecture 2 Parallel Programming Platforms Flynn s Taxonomy In 1966, Michael Flynn classified systems according to numbers of instruction streams and the number of data stream. Data stream Single Multiple

More information

Parallel Computing. Benson Muite. benson.muite@ut.ee http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage

Parallel Computing. Benson Muite. benson.muite@ut.ee http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage Parallel Computing Benson Muite benson.muite@ut.ee http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework

More information

Seed4C: A High-security project for Cloud Infrastructure

Seed4C: A High-security project for Cloud Infrastructure Seed4C: A High-security project for Cloud Infrastructure J. Rouzaud-Cornabas (LIP/CC-IN2P3 CNRS) & E. Caron (LIP ENS-Lyon) November 30, 2012 J. Rouzaud-Cornabas (LIP/CC-IN2P3 CNRS) & E. Seed4C: Caron (LIP

More information

Mitglied der Helmholtz-Gemeinschaft. System monitoring with LLview and the Parallel Tools Platform

Mitglied der Helmholtz-Gemeinschaft. System monitoring with LLview and the Parallel Tools Platform Mitglied der Helmholtz-Gemeinschaft System monitoring with LLview and the Parallel Tools Platform November 25, 2014 Carsten Karbach Content 1 LLview 2 Parallel Tools Platform (PTP) 3 Latest features 4

More information

MOSIX: High performance Linux farm

MOSIX: High performance Linux farm MOSIX: High performance Linux farm Paolo Mastroserio [mastroserio@na.infn.it] Francesco Maria Taurino [taurino@na.infn.it] Gennaro Tortone [tortone@na.infn.it] Napoli Index overview on Linux farm farm

More information

Integration of the OCM-G Monitoring System into the MonALISA Infrastructure

Integration of the OCM-G Monitoring System into the MonALISA Infrastructure Integration of the OCM-G Monitoring System into the MonALISA Infrastructure W lodzimierz Funika, Bartosz Jakubowski, and Jakub Jaroszewski Institute of Computer Science, AGH, al. Mickiewicza 30, 30-059,

More information

Monitoring Elastic Cloud Services

Monitoring Elastic Cloud Services Monitoring Elastic Cloud Services trihinas@cs.ucy.ac.cy Advanced School on Service Oriented Computing (SummerSoc 2014) 30 June 5 July, Hersonissos, Crete, Greece Presentation Outline Elasticity in Cloud

More information

SOFT 437. Software Performance Analysis. Ch 5:Web Applications and Other Distributed Systems

SOFT 437. Software Performance Analysis. Ch 5:Web Applications and Other Distributed Systems SOFT 437 Software Performance Analysis Ch 5:Web Applications and Other Distributed Systems Outline Overview of Web applications, distributed object technologies, and the important considerations for SPE

More information

OpenMosix Presented by Dr. Moshe Bar and MAASK [01]

OpenMosix Presented by Dr. Moshe Bar and MAASK [01] OpenMosix Presented by Dr. Moshe Bar and MAASK [01] openmosix is a kernel extension for single-system image clustering. openmosix [24] is a tool for a Unix-like kernel, such as Linux, consisting of adaptive

More information

MEng, BSc Applied Computer Science

MEng, BSc Applied Computer Science School of Computing FACULTY OF ENGINEERING MEng, BSc Applied Computer Science Year 1 COMP1212 Computer Processor Effective programming depends on understanding not only how to give a machine instructions

More information

Monitoring Infrastructure for Superclusters: Experiences at MareNostrum

Monitoring Infrastructure for Superclusters: Experiences at MareNostrum ScicomP13 2007 SP-XXL Monitoring Infrastructure for Superclusters: Experiences at MareNostrum Garching, Munich Ernest Artiaga Performance Group BSC-CNS, Operations Outline BSC-CNS and MareNostrum Overview

More information

System Models for Distributed and Cloud Computing

System Models for Distributed and Cloud Computing System Models for Distributed and Cloud Computing Dr. Sanjay P. Ahuja, Ph.D. 2010-14 FIS Distinguished Professor of Computer Science School of Computing, UNF Classification of Distributed Computing Systems

More information

UPS battery remote monitoring system in cloud computing

UPS battery remote monitoring system in cloud computing , pp.11-15 http://dx.doi.org/10.14257/astl.2014.53.03 UPS battery remote monitoring system in cloud computing Shiwei Li, Haiying Wang, Qi Fan School of Automation, Harbin University of Science and Technology

More information

Distributed Systems. Examples. Advantages and disadvantages. CIS 505: Software Systems. Introduction to Distributed Systems

Distributed Systems. Examples. Advantages and disadvantages. CIS 505: Software Systems. Introduction to Distributed Systems CIS 505: Software Systems Introduction to Distributed Systems Insup Lee Department of Computer and Information Science University of Pennsylvania Distributed Systems Why distributed systems? o availability

More information

Map-Reduce for Machine Learning on Multicore

Map-Reduce for Machine Learning on Multicore Map-Reduce for Machine Learning on Multicore Chu, et al. Problem The world is going multicore New computers - dual core to 12+-core Shift to more concurrent programming paradigms and languages Erlang,

More information

Keywords Distributed Computing, On Demand Resources, Cloud Computing, Virtualization, Server Consolidation, Load Balancing

Keywords Distributed Computing, On Demand Resources, Cloud Computing, Virtualization, Server Consolidation, Load Balancing Volume 5, Issue 1, January 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Survey on Load

More information

Principles of Operating Systems CS 446/646

Principles of Operating Systems CS 446/646 Principles of Operating Systems CS 446/646 1. Introduction to Operating Systems a. Role of an O/S b. O/S History and Features c. Types of O/S Mainframe systems Desktop & laptop systems Parallel systems

More information

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association Making Multicore Work and Measuring its Benefits Markus Levy, president EEMBC and Multicore Association Agenda Why Multicore? Standards and issues in the multicore community What is Multicore Association?

More information

Program Grid and HPC5+ workshop

Program Grid and HPC5+ workshop Program Grid and HPC5+ workshop 24-30, Bahman 1391 Tuesday Wednesday 9.00-9.45 9.45-10.30 Break 11.00-11.45 11.45-12.30 Lunch 14.00-17.00 Workshop Rouhani Karimi MosalmanTabar Karimi G+MMT+K Opening IPM_Grid

More information

Parallel Processing over Mobile Ad Hoc Networks of Handheld Machines

Parallel Processing over Mobile Ad Hoc Networks of Handheld Machines Parallel Processing over Mobile Ad Hoc Networks of Handheld Machines Michael J Jipping Department of Computer Science Hope College Holland, MI 49423 jipping@cs.hope.edu Gary Lewandowski Department of Mathematics

More information

Fault Tolerance in Hadoop for Work Migration

Fault Tolerance in Hadoop for Work Migration 1 Fault Tolerance in Hadoop for Work Migration Shivaraman Janakiraman Indiana University Bloomington ABSTRACT Hadoop is a framework that runs applications on large clusters which are built on numerous

More information

Vers des mécanismes génériques de communication et une meilleure maîtrise des affinités dans les grappes de calculateurs hiérarchiques.

Vers des mécanismes génériques de communication et une meilleure maîtrise des affinités dans les grappes de calculateurs hiérarchiques. Vers des mécanismes génériques de communication et une meilleure maîtrise des affinités dans les grappes de calculateurs hiérarchiques Brice Goglin 15 avril 2014 Towards generic Communication Mechanisms

More information

A Steering Environment for Online Parallel Visualization of Legacy Parallel Simulations

A Steering Environment for Online Parallel Visualization of Legacy Parallel Simulations A Steering Environment for Online Parallel Visualization of Legacy Parallel Simulations Aurélien Esnard, Nicolas Richart and Olivier Coulaud ACI GRID (French Ministry of Research Initiative) ScAlApplix

More information

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic BigData An Overview of Several Approaches David Mera Masaryk University Brno, Czech Republic 16/12/2013 Table of Contents 1 Introduction 2 Terminology 3 Approaches focused on batch data processing MapReduce-Hadoop

More information

An approach to grid scheduling by using Condor-G Matchmaking mechanism

An approach to grid scheduling by using Condor-G Matchmaking mechanism An approach to grid scheduling by using Condor-G Matchmaking mechanism E. Imamagic, B. Radic, D. Dobrenic University Computing Centre, University of Zagreb, Croatia {emir.imamagic, branimir.radic, dobrisa.dobrenic}@srce.hr

More information

Architectures for Big Data Analytics A database perspective

Architectures for Big Data Analytics A database perspective Architectures for Big Data Analytics A database perspective Fernando Velez Director of Product Management Enterprise Information Management, SAP June 2013 Outline Big Data Analytics Requirements Spectrum

More information

SAS Grid: Grid Scheduling Policy and Resource Allocation Adam H. Diaz, IBM Platform Computing, Research Triangle Park, NC

SAS Grid: Grid Scheduling Policy and Resource Allocation Adam H. Diaz, IBM Platform Computing, Research Triangle Park, NC Paper BI222012 SAS Grid: Grid Scheduling Policy and Resource Allocation Adam H. Diaz, IBM Platform Computing, Research Triangle Park, NC ABSTRACT This paper will discuss at a high level some of the options

More information

Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing

Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing /35 Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing Zuhair Khayyat 1 Karim Awara 1 Amani Alonazi 1 Hani Jamjoom 2 Dan Williams 2 Panos Kalnis 1 1 King Abdullah University of

More information

Operating System Support for Multiprocessor Systems-on-Chip

Operating System Support for Multiprocessor Systems-on-Chip Operating System Support for Multiprocessor Systems-on-Chip Dr. Gabriel marchesan almeida Agenda. Introduction. Adaptive System + Shop Architecture. Preliminary Results. Perspectives & Conclusions Dr.

More information

The Lattice Project: A Multi-Model Grid Computing System. Center for Bioinformatics and Computational Biology University of Maryland

The Lattice Project: A Multi-Model Grid Computing System. Center for Bioinformatics and Computational Biology University of Maryland The Lattice Project: A Multi-Model Grid Computing System Center for Bioinformatics and Computational Biology University of Maryland Parallel Computing PARALLEL COMPUTING a form of computation in which

More information

Lecture 3: Scaling by Load Balancing 1. Comments on reviews i. 2. Topic 1: Scalability a. QUESTION: What are problems? i. These papers look at

Lecture 3: Scaling by Load Balancing 1. Comments on reviews i. 2. Topic 1: Scalability a. QUESTION: What are problems? i. These papers look at Lecture 3: Scaling by Load Balancing 1. Comments on reviews i. 2. Topic 1: Scalability a. QUESTION: What are problems? i. These papers look at distributing load b. QUESTION: What is the context? i. How

More information

Alternative Deployment Models for Cloud Computing in HPC Applications. Society of HPC Professionals November 9, 2011 Steve Hebert, Nimbix

Alternative Deployment Models for Cloud Computing in HPC Applications. Society of HPC Professionals November 9, 2011 Steve Hebert, Nimbix Alternative Deployment Models for Cloud Computing in HPC Applications Society of HPC Professionals November 9, 2011 Steve Hebert, Nimbix The case for Cloud in HPC Build it in house Assemble in the cloud?

More information