Big Data Management in the Clouds and HPC Systems

Similar documents
Data Semantics Aware Cloud for High Performance Analytics

Scalable Data-Intensive Processing for Science on Clouds: A-Brain and Z-CloudFlow

Big Data Processing using Hadoop. Shadi Ibrahim Inria, Rennes - Bretagne Atlantique Research Center

Leveraging BlobSeer to boost up the deployment and execution of Hadoop applications in Nimbus cloud environments on Grid 5000

Achieving Performance Isolation with Lightweight Co-Kernels

High-Performance Big Data Management Across Cloud Data Centers

Computing in clouds: Where we come from, Where we are, What we can, Where we go

An HPC Application Deployment Model on Azure Cloud for SMEs

High Performance Computing. Course Notes HPC Fundamentals

Enabling the Use of Data

Emerging Technology for the Next Decade

BlobSeer: Enabling Efficient Lock-Free, Versioning-Based Storage for Massive Data under Heavy Access Concurrency

Alternative Deployment Models for Cloud Computing in HPC Applications. Society of HPC Professionals November 9, 2011 Steve Hebert, Nimbix

Hadoop Cluster Applications

Big Data on AWS. Services Overview. Bernie Nallamotu Principle Solutions Architect

Datacenters and Cloud Computing. Jia Rao Assistant Professor in CS

Microsoft Research Windows Azure for Research Training

How To Understand Cloud Computing

Seed4C: A Cloud Security Infrastructure validated on Grid 5000

Microsoft Research Microsoft Azure for Research Training

Last time. Data Center as a Computer. Today. Data Center Construction (and management)

A Middleware Strategy to Survive Compute Peak Loads in Cloud

Heterogeneous Workload Consolidation for Efficient Management of Data Centers in Cloud Computing

BlobSeer: Towards efficient data storage management on large-scale, distributed systems

Apache Hadoop. Alexandru Costan

Pros and Cons of HPC Cloud Computing

Cloud computing - Architecting in the cloud

Big Data - Infrastructure Considerations

How To Talk About Data Intensive Computing On The Cloud

Big Data Challenges in Bioinformatics

Introduction to Cloud Computing

Optimizing Storage for Better TCO in Oracle Environments. Part 1: Management INFOSTOR. Executive Brief

BSC vision on Big Data and extreme scale computing

BIG DATA TRENDS AND TECHNOLOGIES

How To Build A Cloud Computer

Windows Azure Storage Scaling Cloud Storage Andrew Edwards Microsoft

Microsoft Private Cloud Fast Track

IBM Spectrum Scale vs EMC Isilon for IBM Spectrum Protect Workloads

Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012

Elastic Cloud Computing in the Open Cirrus Testbed implemented via Eucalyptus

Azure Data Lake Analytics

BMW11: Dealing with the Massive Data Generated by Many-Core Systems. Dr Don Grice IBM Corporation

The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform

Cloud Computing Trends

Data Warehouse in the Cloud Marketing or Reality? Alexei Khalyako Sr. Program Manager Windows Azure Customer Advisory Team

Top Ten Data Management Trends

Networking in the Hadoop Cluster

Building & Optimizing Enterprise-class Hadoop with Open Architectures Prem Jain NetApp

Cluster, Grid, Cloud Concepts

Introducing EEMBC Cloud and Big Data Server Benchmarks

Microsoft Analytics Platform System. Solution Brief

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

Li Sheng. Nowadays, with the booming development of network-based computing, more and more

Evaluation Methodology of Converged Cloud Environments

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Cloud computing The cloud as a pool of shared hadrware and software resources

SURFsara HPC Cloud Workshop

GeoGrid Project and Experiences with Hadoop

Building an energy dashboard. Energy measurement and visualization in current HPC systems

General Parallel File System (GPFS) Native RAID For 100,000-Disk Petascale Systems

Energy Efficient MapReduce

I/O intensive applications: what are the main differences in the design of the HPC filesystems vs the MapReduce ones?

Using an In-Memory Data Grid for Near Real-Time Data Analysis

Cloud Service Model. Selecting a cloud service model. Different cloud service models within the enterprise

Big Fast Data Hadoop acceleration with Flash. June 2013

Tableau Server Scalability Explained

Clusters: Mainstream Technology for CAE

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Chapter 7. Using Hadoop Cluster and MapReduce

DataSteward: Using Dedicated Compute Nodes for Scalable Data Management on Public Clouds

Assignment # 1 (Cloud Computing Security)

Harnessing the Power of the Microsoft Cloud for Deep Data Analytics

An Overview on Important Aspects of Cloud Computing


NCTA Cloud Architecture

On- Prem MongoDB- as- a- Service Powered by the CumuLogic DBaaS Platform

Will They Blend?: Exploring Big Data Computation atop Traditional HPC NAS Storage

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Cloud Computing with Red Hat Solutions. Sivaram Shunmugam Red Hat Asia Pacific Pte Ltd.

LS DYNA Performance Benchmarks and Profiling. January 2009

Hadoop. Sunday, November 25, 12

Simplifying Storage Operations By David Strom (published 3.15 by VMware) Introduction

An Energy-Aware Design and Reporting Tool for On-Demand Service Infrastructures

Product Brief SysTrack VMP

CUMULUX WHICH CLOUD PLATFORM IS RIGHT FOR YOU? COMPARING CLOUD PLATFORMS. Review Business and Technology Series

Petascale Software Challenges. Piyush Chaudhary High Performance Computing

Can High-Performance Interconnects Benefit Memcached and Hadoop?

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

HPC and Big Data. EPCC The University of Edinburgh. Adrian Jackson Technical Architect

Data Centric Systems (DCS)

GigaSpaces Real-Time Analytics for Big Data

CLOUD COMPUTING. When It's smarter to rent than to buy

Application Performance in the Cloud

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

A Cloud Test Bed for China Railway Enterprise Data Center

Building a Scalable Big Data Infrastructure for Dynamic Workflows

Transcription:

Big Data Management in the Clouds and HPC Systems Hemera Final Evaluation Paris 17 th December 2014 Shadi Ibrahim Shadi.ibrahim@inria.fr

Era of Big Data! Source: CNRS Magazine 2013 2

Era of Big Data! Source: CNRS Magazine 2013 2

Era of Big Data! Source: CNRS Magazine 2013 2

Era of Big Data! Source: CNRS Magazine 2013 2

! KerData s Core Research Challenges Applications Massive data analysis: clouds (e.g. MapReduce) Post- Petascale HPC simulations: supercomputers Focus 1: Scalable big data management on IaaS and PaaS clouds Challenge : understand how to reconcile performance, scalability, security and quality of service according to the requirements of data- intensive applications Focus 2: Scalable big data management on Post- Petascale HPC systems Challenge: go beyond the limitations of current file- based approaches 3

Focus 1: Scalable Big Data Management in IaaS and Paas Clouds

Scalable Map-Reduce Processing ANR Project Map-Reduce (ARPEGE, 2010-2014) Partners: Inria (teams : KerData - leader, AVALON, Grand Large), Argonne National Lab, UIUC, JLPC, IBM, IBCP Some results Goal: High-performance Map-Reduce processing through concurrency-optimized data processing URL: mapreduce.inria.fr Versioning-based concurrency management for increased data throughput (BlobSeer approach) Efficient intermediate data storage in pipelines Substantial improvements with respect to Hadoop Application to efficient VM deployment Intensive, long-run experiments done on Grid'5000 Up to 300 nodes/500 cores Plans: joint deployment G5K+FutureGrid (USA) Papers: JPDC, Concurrency and Computation Practice and Experience, ACM HPDC 2011 (AR:12.9%), ACM HPDC 2012, IEEE/ ACM CCGRID 2012, 2013, 2014, Euro-Par 2012, IEEE IPDPS 2013 5

Impact: Transfer to Commercial Clouds The A-Brain Microsoft Research Inria Project Approach 1. Preliminary experiments on Grid 5000 2. Apply the approach to MS Azure clouds Brain image p(, ) Genetic data Partners!!!!! KerData, PARIETAL (Inria) Microsoft ATL Europe! N~2000 TomusBlobs software (based on BlobSeer) Gain / Blobs Azure : 45% Scalability : 1000 cores Demo available!! Y Anatomical MRI! Functional MRI! Diffusion MRI q~10 5-6 p~10 6 X DNA array (SNP/CNV)! gene expression data! others... http://www.irisa.fr/kerdata/doku.php?id=abrain 6

Exposing Data Locality in MapReduce Data locality is crucial for Hadoop s performance Map Reduce Map scheduler ignores the state of the replication - e.g., 58% of Facebook s jobs achieve only 5% node locality and 59% rack locality (Eurosys 2010) Maestro: replica- aware map scheduling 35% performance improvement Input Map Map Output Reduce 100 nodes Maestro: Replica- aware map scheduling for mapreduce S Ibrahim, H Jin, L Lu, B He, G Antoniu, S Wu. CCGrid 2012 7

Energy- Efficient Big Data Processing Why Energy efficient Hadoop? THE engine for Big Data processing in data- centers Power bills become a substantial part of the total cost of ownership (TCO) of data- centers It is essential to explore and optimize the energy efficiency when running Big Data application in Hadoop 8

Investigate the Impacts of CPU- Frequencies Scaling on Power Efficiency in Hadoop Diversity of MapReduce applications Multiple phases of MapReduce application CPU load is high (98%) during almost 75% of the job running CPU load is high(80%) during only 15% of the job running Disk I/O CPU Disk I/O Network There is a significant potential of energy saving by scaling down the CPU frequency when peak CPU is not needed 9

Investigate the Impacts of CPU- Frequencies Scaling on Power Efficiency in Hadoop Pi application Sort application We observe that different DVFS settings are not only sub- optimal for different MapReduce applications but also sub- optimal for different stages of the MapReduce application Build dynamic frequency tuning tool specifically tailored to match MapReduce application types and execution stages Towards Efficient Power Management in MapReduce: Investigation of CPU- Frequencies Scaling on Power Efficiency in Hadoop, S. Ibrahim et al, ARMS- CC 2014. 10

Consistency Management in the Cloud Replication has become an essential feature in storage system and is extensively leveraged in the cloud Fast access Enhanced performance High availability How to ensure a consistent state of all the replicas? Strong consistency High latency Fresh reads Eventual consistency Low latency Stale reads 11 scalable

Expose the tight relation between Consistency and Performance Monetary Cost Power Consumption 12

Expose the tight relation between Consistency and Performance Monetary Cost Power Consumption PNUTS 12

Bismar: Cost- Efficient Consistency Management for the Cloud 100 Cost 31% of Money Saving(Compared to Quorum) 70 Stale Read Rate (%) Only 3.5% 75 Cost ($) 52,5 Stale Read Rate (%) 50 35 25 17,5 0 One Two Quorum Bismar 0 One Two Quorum Bismar Bismar reduces the monetary cost by 31% while resulting in only 3.5% of stale data reads "Harmony: Towards automated self- adaptive consistency in cloud storage" HE Chihoub, S Ibrahim, G Antoniu, MS Perez. CLUSTER 2012 "Consistency in the cloud: When money does matter!" HE Chihoub, S Ibrahim, G Antoniu, MS Pérez.CCGrid 2013 13

Energy vs Consistency Exploring the tight relationship of Consistency vs Energy Coefficient of Variation: high variation with low consistency levels One Quorum All We observe that there are three main factors contribute to energy consumption in Cassandra cluster: Consistency Models Workload access patterns Degree of concurrency! Eventual Consistency introduces usage variability between storage nodes! Adaptive Configuration of the Storage Cluster 14

Joint Laboratory on Extreme Scale Computing INRIA University of Illinois at Urbana Champaign (UIUC) Argonne National Laboratory (ANL) Barcelona Supercomputing Center (BSC) Focus 2: Scalable big data Management on Post- Petascale HPC systems

The need for Scalable, Smart, yet Efficient I/O Management Tomorrow s supercomputer Millions of cores Increasing gap between compute and I/O performance 100.000+ cores PetaBytes of data ~ 10.000 cores 16

Efficient I/O Using Dedicated Cores Damaris: Dedicated Adaptable Middleware for Application Resources Inline Steering Main idea: Dedicate cores in each SMP node for data management Simulations "Damaris: how to efficiently leverage multicore parallelism to achieve scalable, jitter- free i/o" M Dorier, G Antoniu, F Cappello, M Snir, L Orf. CLUSTER 2012 http://damaris.gforge.inria.fr/ Parallel File System 17

Towards smart in- situ Visualization On dedicated cores VisIt runtime XML libsi m Damaris plugin Damaris Server On simulation cores Damaris Client Simulation On the user s machine Connect visualization software to simulation Perform visualization when the simulation runs Perform smart visualization, i.e. reduce image resolution when needed http://damaris.gforge.inria.fr/ 18

Energy Efficient I/O Management Power bills become a substantial part of the total cost of ownership (TCO) of supercomputers A typical supercomputer of thousands of cores consumes several megawatt of power! Performance has long been the major focus of the HPC community No.1 supercomputer, Tianhe- 2 : performance of 33.8 PFLOPS but with a 24 MW power consumption 19

Energy Efficient I/O Management Power bills become a substantial part of the total cost of ownership (TCO) of supercomputers A typical supercomputer of thousands of cores consumes several megawatt of power! Performance has long been the major focus of the HPC community No.1 supercomputer, Tianhe- 2 : performance of 33.8 PFLOPS but with a 24 MW power consumption Energy will even increase as we reach the era of Exascale systems. 19

Energy Efficient I/O Management Evaluate the energy consumption of different I/O approaches based on dedicated cores, or dedicated nodes Three factors at least contribute to such variations: The adopted I/O approach The output frequency The architecture of the system on which we run the HPC application A performance and energy analysis of I/O management approaches for exascale systems O Yildiz, M Dorier, S Ibrahim, G Antoniu. DIDC 2014 20

Mitigating I/O Interference CALCioM: Cross- Application Layer for Coordinated I/O Management Goal: Make applications communicate their I/O behavior to one another Make them coordinate to avoid interfering Choose best coordination strategy dynamically Experiments with Grid 5000 and Surveyor (BlueGene/P, ANL) Calciom: Mitigating i/o interference in hpc systems through cross- application coordination. M Dorier, G Antoniu, R Ross, D Kimpe, S Ibrahim. IPDPS 2014 21

Predicting I/O Patterns Goal: predict the spatial and temporal I/O patterns Omnisc IO: use context- free grammars! Predictions: Where, when, and how much At run time With negligible overhead And negligible memory footprint Results: With CM1, Nek5000, GTC and LAMMPS On Grid 5000 Omnisc'IO: a grammar- based approach to spatial and temporal I/O patterns prediction M Dorier, S Ibrahim, G Antoniu, R Ross - SC'14 22

Thank you! Shadi Ibrahim INRIA research Scientist KerData Team Shadi.ibrahim@inria.fr