High-Performance Big Data Management Across Cloud Data Centers

Similar documents

Scalable Data-Intensive Processing for Science on Clouds: A-Brain and Z-CloudFlow

High-Performance Big Data Management Across Cloud Data Centers. Radu Marius Tudoran

Big Data Management in the Clouds and HPC Systems

Leveraging BlobSeer to boost up the deployment and execution of Hadoop applications in Nimbus cloud environments on Grid 5000

BlobSeer: Towards efficient data storage management on large-scale, distributed systems

GigaSpaces Real-Time Analytics for Big Data

A Cost-Evaluation of MapReduce Applications in the Cloud

Cloud Computing Trends

CUMULUX WHICH CLOUD PLATFORM IS RIGHT FOR YOU? COMPARING CLOUD PLATFORMS. Review Business and Technology Series

Data Semantics Aware Cloud for High Performance Analytics

Network Infrastructure Services CS848 Project

Towards an Optimized Big Data Processing System

Cloud Design and Implementation. Cheng Li MPI-SWS Nov 9 th, 2010

Implementing Microsoft Azure Infrastructure Solutions

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

Concept and Project Objectives

ANDREW HERTENSTEIN Manager Microsoft Modern Datacenter and Azure Solutions En Pointe Technologies Phone

Globus Striped GridFTP Framework and Server. Raj Kettimuthu, ANL and U. Chicago

Assignment # 1 (Cloud Computing Security)

Big Data Mining Services and Knowledge Discovery Applications on Clouds

Monitoring Elastic Cloud Services

Chapter 19 Cloud Computing for Multimedia Services

IaaS Cloud Architectures: Virtualized Data Centers to Federated Cloud Infrastructures

Multi-Datacenter Replication

Course 20533: Implementing Microsoft Azure Infrastructure Solutions

Chapter 4 Cloud Computing Applications and Paradigms. Cloud Computing: Theory and Practice. 1

Amazon EC2 Product Details Page 1 of 5

Apache Hadoop. Alexandru Costan

Cluster, Grid, Cloud Concepts

BIG DATA TRENDS AND TECHNOLOGIES

Introduction to Cloud Computing

ASPERA HIGH-SPEED TRANSFER SOFTWARE. Moving the world s data at maximum speed

Taking Big Data to the Cloud. Enabling cloud computing & storage for big data applications with on-demand, high-speed transport WHITE PAPER

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Scalable Architecture on Amazon AWS Cloud

Evaluation Methodology of Converged Cloud Environments

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Hadoop: Embracing future hardware

The Data Grid: Towards an Architecture for Distributed Management and Analysis of Large Scientific Datasets

Big Data on AWS. Services Overview. Bernie Nallamotu Principle Solutions Architect

Virtualizing Apache Hadoop. June, 2012

Addressing Open Source Big Data, Hadoop, and MapReduce limitations

Software-Defined Networks Powered by VellOS

Azure Data Lake Analytics

In-Memory BigData. Summer 2012, Technology Overview

Trends and Research Opportunities in Spatial Big Data Analytics and Cloud Computing NCSU GeoSpatial Forum

Datacenters and Cloud Computing. Jia Rao Assistant Professor in CS

HPC technology and future architecture

Hadoop & Spark Using Amazon EMR

OpenNebula Leading Innovation in Cloud Computing Management

Migrating SaaS Applications to Windows Azure

Oracle: Database and Data Management Innovations with CERN Public Day

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

Implementing Microsoft Azure Infrastructure Solutions 20533B; 5 Days, Instructor-led

Manjrasoft Market Oriented Cloud Computing Platform

ENZO UNIFIED SOLVES THE CHALLENGES OF OUT-OF-BAND SQL SERVER PROCESSING

Performance Evaluation of the Illinois Cloud Computing Testbed

Course 20533B: Implementing Microsoft Azure Infrastructure Solutions

Cloud Computing. Lecture 24 Cloud Platform Comparison

Microsoft Big Data Solutions. Anar Taghiyev P-TSP

A Brief Introduction to Apache Tez

Herve Roggero 3/3/2015

Load Balancing and Maintaining the Qos on Cloud Partitioning For the Public Cloud

HRG Assessment: Stratus everrun Enterprise

Microsoft Private Cloud

PERFORMANCE ANALYSIS OF PaaS CLOUD COMPUTING SYSTEM

What is Cloud Computing? Tackling the Challenges of Big Data. Tackling The Challenges of Big Data. Matei Zaharia. Matei Zaharia. Big Data Collection

White Paper. Cloud Native Advantage: Multi-Tenant, Shared Container PaaS. Version 1.1 (June 19, 2012)

Cloud Computing. Summary

An enterprise- grade cloud management platform that enables on- demand, self- service IT operating models for Global 2000 enterprises

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015

Windows Azure Platform

Real Time Big Data Processing

Towards Smart and Intelligent SDN Controller

Benchmarking Amazon s EC2 Cloud Platform

Towards an understanding of oversubscription in cloud

Chapter 7. Using Hadoop Cluster and MapReduce

How To Understand The Concept Of A Distributed System

Scalability and BMC Remedy Action Request System TECHNICAL WHITE PAPER

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Introduction to Cloud Computing

Transcription:

High-Performance Big Data Management Across Cloud Data Centers Radu Tudoran PhD Advisors Gabriel Antoniu INRIA Luc Bougé ENS Rennes KerData research team IRISA/INRIA Rennes

Doctoral Work: Context VOLUME BIG DATA VARIETY Exponential Growth Communication VALUE Processing VERACITY VELOCITY Compute Processing Compute Monitoring 130 exabytes Storage 20,000 exabytes Monitoring 2005 2020 Storage Data Center Communication Data Center Cloud Computing Internet Data Center 2012 Report 2

Geographically-Distributed Processing CERN ATLAS PB of data distributed for storage across multiple institutions Ocean Observatory Data sources located in geographically distant regions Large IT Web-Services Data processing exceeds site limits - 3 3

Doctoral Work in a Nutshell Scaling the processing Multi-site MapReduce Enabling scientific discovery Data management across sites High-Performance Big Data Management Across Cloud Data Centers Cloud-provided Transfers Service Streaming across cloud sites Configurable cost-performance tradeoffs 4

Doctoral Work in a Nutshell Scaling the processing Multi-site MapReduce Enabling scientific discovery Data management across sites High-Performance Big Data Management Across Cloud Data Centers Cloud-provided Transfers Service Streaming across cloud sites Configurable cost-performance tradeoffs 5

A Big Data Case Study: The A-Brain Application Image data dimension n voxels n subjects Subject 1 Value: find the correlation between brain markers and genetic data in order to understand the behavioral variability and diseases Correlations? Genetic data dimension n snps n subjects SNP data Subject 2... Subject n Variety Multi-modal joint analysis Data contains outliers from acquisition process Volume n voxels = 10 6 n snps = 10 6 n subjects = 10 3 Data space potentially reaches TB to PB level Veracity Biologically significant results and false detection control requires 10 4 permutations Velocity not the case But other examples are coming in a few slides 6

Data Management on Public Clouds Cloud Compute Nodes Cloud-provided storage service How about data locality? 7

Our approach: TomusBlobs Collocate computation and data in PaaS clouds by federating the virtual disk of compute nodes Self-configuration, automatic deployment and scaling of the data management system Apply to MapReduce and Workflow processing 8

Leveraging TomusBlobs for MapReduce Processing Map Map Map Client Azure Queues Reduce Reduce New MapReduce prototype (no Hadoop at that point on Azure) Adopt BlobSeer as storage backend 9

Initial A-Brain Experimentation Scenario: 100 nodes deployment on Azure Comparison with an Azure Blobs based MapReduce TomusBlobs is 3x-4x faster than the cloud remote storage 10

Beyond MapReduce: Map-IterativeReduce Unique result with parallel reduction No central control entity No synchronization barrier 11

The Efficiency of Full-Reduction The Most Frequent Words benchmark A-Brain initial experimentation Data set 3.2 GB to 32 GB Data set 5 GB to 50 GB Experimental Setup: 200 nodes deployment on Azure Map-IterativeReduce reduces the execution timespan to half 12

TomusBlobs for Workflow Processing Exploit workflow specificities: Data access patterns File manipulation Batch processing ATLE 13

TomusBlobs for Workflow Processing Multiple transfer solutions: FTP, In-Memory, BitTorrent Adapt the transfer to the data access pattern Adaptive replication strategies for higher performance Integration with Microsoft Generic Worker 14

Workflow Processing on Cloud Synthetic workflow BLAST scientific workflow Experimental Setup: 100 Azure nodes, Generic Worker engine TomusBlobs adaptively chooses each time the best strategy 15

Doctoral Work in a Nutshell Scaling the processing Multi-site MapReduce Enabling scientific discovery Data management across sites High-Performance Big Data Management Across Cloud Data Centers Cloud-provided Transfers Service Streaming across cloud sites Configurable cost-performance tradeoffs 16

A-Brain Subjects Y ~2000 voxels~10 5-6 Find Correlations SNPs~10 6 17

Single-Site Computation on the Cloud Timespan estimation for single core machine: 5,3 years Parallelize and execute on Azure cloud across 350 cores using TomusBlobs Achievements: Reduced execution time to 5.6 days Demonstrated that univariate analysis is too sensitive to outliers Geographically distributed processing New challenge: adopt a more robust analysis which increases the computation requirements to 86 years (for single core) 18

Going Geo-distributed Azure Data Centers Hierarchical multi-site MapReduce: Map-IterativeReduce, Global Reduce Data management: TomusBlobs (intra-site), Cloud Storage (inter-site) Iterative-Reduce technique for minimizing transfers of partial results Balance the network bottleneck from single data center 19-19

Executing the A-Brain Application at Large-Scale Multi-site processing: East US, North US, North EU Azure Data Centers Experiments performed on 1000 cores Experiment duration: ~ 14 days More than 210.000 hours of computation used Cost of the experiments: 20000 euros (VM price, storage, outbound traffic) 28000 map jobs (each lasting about 2 hours) and ~600 reduce jobs Data transfers more than 1 TB Scientific Discovery: Provided the first statistical evidence of the heritability of functional signals in a failed stop task in basal ganglia 20

Doctoral Work in a Nutshell Scaling the processing Multi-site MapReduce Map- IterativeReduce Hierarchical twotier MapReduce cloud framework Enabling scientific discovery Data management across sites High-Performance Big Data Management Across Cloud Data Centers Cloud-provided Transfers Service Streaming across cloud sites Configurable cost-performance tradeoffs 21

To Cloud or Not to Cloud Data? High Latency Network Low Latency Network Sender Receiver WAN Data Center Data Center Cloud-provided storage service Limitations: No (or weak) SLA guarantees High-latency and low throughput transfer 22

Addressing the SLA Issues for Inter-Site Transfers Decision Manager Model Schedule Monitor Agent Transfer Agent WAN TA DM MA Data Center Data Center Three design principles: Environment awareness: model the cloud performance Real-time adaptation for data transfers Cost effectiveness: maximize throughput or minimize costs 23

Modeling Cloud Data Transfer Sampling method Estimate the cloud performance based on the monitoring trace Average and variability estimated for each metric: - Updated based on weights given to fresh samples: from 0 (no trust) to 1 (full trust) Predictive transfers: express transfer time and cost Dynamically adjust the transfer quotas across routes 24

Addressing Inter-Site Transfer Performance: Multi-Path Transfers High Latency Network Low Latency Network Sender Receiver Intermediate Nodes Data Center Data Center Leverage network parallelism: Aggregate inter-site bandwidth through multi-path transfers 25

Addressing Inter-Site Transfer Performance: Multi-Hop Transfers High Latency Network Low Latency Network Sender Receiver Intermediate Nodes Data Center Data Center Data Center Further increase network parallelism: Avoid network throttling by considering alternative routes through other data centers 26

When Money Meets Performance How much are you willing to pay for performance? How much is it actually worth paying? 27

Comparing to Existing Solutions Experimental setup: up to 10 nodes, Azure Cloud Transfers between North Central US to North EU Azure data centers 28

Doctoral Work in a Nutshell Scaling the processing Multi-site MapReduce Map- IterativeReduce Hierarchical twotier MapReduce cloud framework Enabling scientific discovery Data management across sites High-Performance Big Data Management Across Cloud Data Centers Cloud-provided Transfers Service Streaming across cloud sites Configurable cost-performance tradeoffs 29

To Stream or Not to Stream? CERN Atlas 30

Towards Dynamic Batch-based Streaming Sender Destination Batch Serialize Network De-Serialize CEP Engine Event Source L Batcing L Serialization L Transfer argmin Nb L Event L Deserialization L Reordering Latency (L) modeled based on stream context: event size, throughput, arrival rate, routes, serialization/deserialization technology, batch size ATLE 31

JetStream Batch Oracle Transfer module Buffer Event Sender Transfer module Transfer module Event Receiver Buffer Serializer De-Serializer 32

JetStream for MonALISA Independent event streaming takes 30,900 seconds compared to 80 seconds for JetStream 1.1 million events; North US to North EU Azure data centers Automatically resource optimization Optimizing the latency and transfer rate tradeoff 33

Variable Streaming Rates Elastic scaling of the resource based on load Environment-aware self-optimization 34

Doctoral Work in a Nutshell Scaling the processing Multi-site MapReduce Map- IterativeReduce Hierarchical twotier MapReduce cloud framework Enabling scientific discovery Data management across sites High-Performance Big Data Management Across Cloud Data Centers Cloud-provided Transfers Service Streaming across cloud sites Configurable cost-performance tradeoffs 35

Transfer Options on the Cloud The Default Option Multi-path transfers No (or minimal) configuration High-latency and low throughput transfer Fixed price scheme Aggregate inter-site throughput Fixed price scheme Managed, configured and administrated by users 36

How About a Transfer as a Service? Asymmetric cloud service Symmetric cloud service Federated clouds No transparent communication optimizations Same cloud vendor Allows any number of communication optimizations 37

Is TaaS Feasible Performance-wise? Multi-tenant service usage: performance degradations of 20% while the number of service nodes per app is decreased from 5:1 to 1:1 38

Performance-Based Cost Models? Scenario: Transfer large volumes of data across Azure sites Cost: Cost margins for the service usage can be defined based on performance 39

Data Transfer Market Data transfer market: Flexible and dynamic pricing => win-win situation for cloud vendor and users Why? Decrease price => to reduce idle bandwidth Increase price => to decrease network congestion 40

Conclusions & Perspectives 41

Doctoral Work in a Nutshell Scaling the processing Multi-site MapReduce Map- IterativeReduce Hierarchical twotier MapReduce cloud framework Enabling scientific discovery Data management across sites High-Performance Big Data Management Across Cloud Data Centers Cloud-provided Transfers Service Streaming across cloud sites Configurable cost-performance tradeoffs 42

Achievements Publications 1 Book Chapter In Cloud Computing for Data-Intensive Applications, Springer 2015 3 Journal articles Frontiers in Neuroinformatics 2014 Concurrency and Computation Practice and Experience 2013 ERCIM Electonic Journal 2012 7 International Conferences publications 3 papers at IEEE/ACM CCGrid 2012 and 2014 (Cloud Cluster and Grid, rank A), Acceptance rates: 26%, 19% IEEE SRDS 2014 (Symposium on Reliable Distributed Systems, rank A) IEEE Big Data 2013, Acceptance rate 17% ACM DEBS 2014 (Distributed Event Based Systems), Acceptance rate 9% IEEE Trustcom/ISPA 2013 (rank A) 7 Workshops papers, Posters and Demos MapReduce in conjuction with ACM HPDC (rank A) CloudCP in conjuction with ACM EuroSys (rank A) IPDPSW in conjuction with IEEE IPDPS (rank A) Microsoft: CloudFutures, ResearchNext, PhD Summer School DEBS Demo in conjunction with ACM DEBS Software PaaS data management middleware Available with Microsoft GenericWorker MapReduce engine for the Azure cloud Cloud service for bio-informatics SaaS for benchmarking the performance of data stage-in to cloud data centers Available on Azure Cloud Middleware for batch-based, highperformance streaming across cloud sites Binding with Microsoft StreamInsight External Collaborators Microsoft Research ATLE, Cambridge Argonne National Laboratory Inria Saclay Inria Sophia Antipolis 43

Perspectives Multi-site workflow across geographically distributed sites Worflow data access patterns, self-* processing, cost/performance tradeoffs Z-CloudFlow Cloud stream processing Management of many small events, latency constraints for distributed queries Diversification of the cloud data management ecosystem X-as-a-Service, uniform storage across sites, API for task orchestration One size does not fit all! 44

Doctoral Work in a Nutshell Scaling the processing Multi-site MapReduce Map- IterativeReduce Hierarchical twotier MapReduce cloud framework Enabling scientific discovery Data management across sites High-Performance Big Data Management Across Cloud Data Centers Cloud-provided Transfers Service Streaming across cloud sites Configurable cost-performance tradeoffs 45

Backup slides 46

Data Centers From few-large DCs tomorrow To many small DCs http://www.extremetech.com/wpcontent/uploads/2013/07/microsoft-datacenter.jpg Multi-site processing Integrated MapReduce processes across sites Workflow orchestration Site cross-scheduling of tasks http://www.datacenterdynamics.com/focus/archiv e/2014/04/huawei-launches-40ft-and-20ft-datacenter-containers Multi-site data management Uniform storage across data centers High-performance transfer tools Transfer as a Service Usage and data access patterns 47

Service Diversification Handling Big Data grows in complexity Architectural design options for many items storage Enriched and diversified data-oriented services Smart replication strategies Diversification of processing Customizable-user API: towards generic business workflows Solutions for providing the versatility of workflows and simplicity of MapReduce One size does not fit all! 48

Lessons learned: Starting an analysis in the cloud Deployment start times For each new or updated deployment on Azure, the fabric controller prepares the nodes High deployment times (though better after the update from Nov. 12) Bigger problems reported for Amazon EC2: The most common failure is an inability to acquire all of the virtual machine images you requested because insufficient resources are available. When attempting to allocate 80 cores at once, this happens fairly frequently. Keith R. Jackson, Lavanya Ramakrishnan, Karl J. Runge, and Rollin C. Thomas. 2010. Seeking supernovae in the clouds: a performance study. In Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing (HPDC '10). Scheduling mechanisms for efficient data access. M M M M M Wave M Pipeline 49

Lessons learned: running BigData applications A real need for advanced data management functionality for running scientific Big Data processing in the clouds Monitoring API Monitoring and logging services for Big Data Current cloud storage APIs do not support even simple operations on multiple files/blobs (e.g. grep, select/filter, compress, aggregate) Data management for geo-distributed processing Cloud storage delivers poor performances High performance alternatives Inter-site data transfer is not supported Transfer as a service 50

How much data can I transfer using 25 VMs for 10 minutes? Experimental setup: up to 25 nodes, Azure Cloud Transfers between North Central US to North EU Azure data centers 51

Is it feasible performance-wise? Multi-tenancy Impact of CPU Load on I/O Service Access Concurrency: performance degradations of ~20% when reducing the service nodes per application from 5:1 to 1:1 CPU load on user transfer nodes: performance degradation up to 40% 52

Doctoral Work in a Nutshell Scaling the processing Multi-site MapReduce Map- IterativeReduce Hierarchical twotier MapReduce cloud framework Enabling scientific discovery Data management across sites High-Performance Big Data Management Across Cloud Data Centers Cloud-provided Transfers Service Streaming across cloud sites Configurable cost-performance tradeoffs 53

Collocating data and computation What if some of the compute nodes become data nodes? Beyond the Put/Get data management systems: What is the good option to build advanced data management functionality? 54

Which Nodes to Dedicate? User Deployment Fault Domain Design Principles Dedicate compute nodes for managing and storing data Topology awareness No modification to the cloud middleware ISPA 2013 Fault Domain 55

A topology-aware selection Discover the virtualized topology Clustering approach - Throughput measurements between VMs - Asserting the performance Maximize throughput between application nodes and storage nodes 56

Assessing the storage throughput Read throughput Write throughput Scenario: Cumulative throughput Experimental setup: 50 client nodes, 50 storage nodes Transfer improvement due to CPU and network management 57