Data Semantics Aware Cloud for High Performance Analytics

Similar documents
Written examination in Cloud Computing

Big Data Management in the Clouds and HPC Systems

Hadoop Architecture. Part 1

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

A Cost-Evaluation of MapReduce Applications in the Cloud

Introduction to Hadoop

Chapter 7. Using Hadoop Cluster and MapReduce

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

ISSN: Keywords: HDFS, Replication, Map-Reduce I Introduction:

Network-Aware Scheduling of MapReduce Framework on Distributed Clusters over High Speed Networks

A Service for Data-Intensive Computations on Virtual Clusters

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

SURVEY ON SCIENTIFIC DATA MANAGEMENT USING HADOOP MAPREDUCE IN THE KEPLER SCIENTIFIC WORKFLOW SYSTEM

Pilot-Streaming: Design Considerations for a Stream Processing Framework for High- Performance Computing

Benchmarking Hadoop & HBase on Violin

Introduction to Cloud Computing

Performance and Energy Efficiency of. Hadoop deployment models

HadoopRDF : A Scalable RDF Data Analysis System

Microsoft Research Windows Azure for Research Training

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Parallel Data Mining and Assurance Service Model Using Hadoop in Cloud

Survey on Scheduling Algorithm in MapReduce Framework

Introduction to Hadoop

Cloud Federation to Elastically Increase MapReduce Processing Resources

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

Microsoft Research Microsoft Azure for Research Training

A Multilevel Secure MapReduce Framework for Cross-Domain Information Sharing in the Cloud

Can High-Performance Interconnects Benefit Memcached and Hadoop?

Early Cloud Experiences with the Kepler Scientific Workflow System

Storage Architectures for Big Data in the Cloud

A very short Intro to Hadoop

Sector vs. Hadoop. A Brief Comparison Between the Two Systems

Cloud Computing. Adam Barker

Intro to Map/Reduce a.k.a. Hadoop

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

Hadoop Cluster Applications

Big Data on AWS. Services Overview. Bernie Nallamotu Principle Solutions Architect

Data-Intensive Computing with Map-Reduce and Hadoop

Open source Google-style large scale data analysis with Hadoop

Final Project Proposal. CSCI.6500 Distributed Computing over the Internet

The Inside Scoop on Hadoop

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Cloud computing - Architecting in the cloud

Fault Tolerance in Hadoop for Work Migration

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Chapter 4 Cloud Computing Applications and Paradigms. Cloud Computing: Theory and Practice. 1

Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012

Leveraging BlobSeer to boost up the deployment and execution of Hadoop applications in Nimbus cloud environments on Grid 5000

Apache Hadoop. Alexandru Costan

NASA's Strategy and Activities in Server Side Analytics

Big Data With Hadoop

MapReduce, Hadoop and Amazon AWS

NoSQL and Hadoop Technologies On Oracle Cloud

HADOOP MOCK TEST HADOOP MOCK TEST I

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Cloud Computing and Amazon Web Services

Hadoop & Spark Using Amazon EMR

Research Article Hadoop-Based Distributed Sensor Node Management System

Enabling Multi-pipeline Data Transfer in HDFS for Big Data Applications

Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications

Big Data - Infrastructure Considerations


PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

HPC and Big Data. EPCC The University of Edinburgh. Adrian Jackson Technical Architect

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

An HPC Application Deployment Model on Azure Cloud for SMEs

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

DATA SECURITY MODEL FOR CLOUD COMPUTING

A Comparison of Clouds: Amazon Web Services, Windows Azure, Google Cloud Platform, VMWare and Others (Fall 2012)

Big Data Analysis and Its Scheduling Policy Hadoop

Local Alignment Tool Based on Hadoop Framework and GPU Architecture

HDFS Space Consolidation

Mobile Cloud Computing for Data-Intensive Applications

What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea

Azure Data Lake Analytics

Reduction of Data at Namenode in HDFS using harballing Technique

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

CUMULUX WHICH CLOUD PLATFORM IS RIGHT FOR YOU? COMPARING CLOUD PLATFORMS. Review Business and Technology Series

A programming model in Cloud: MapReduce

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP

Transcription:

Data Semantics Aware Cloud for High Performance Analytics Microsoft Future Cloud Workshop 2011 June 2nd 2011, Prof. Jun Wang, Computer Architecture and Storage System Laboratory (CASS)

Acknowledgement Thanks to NSF CCF-0811413 and NSF CAREER CCF-0953946 grants support. Other project members Qiangju Xiao Pengju Shang Christopher Mitchell Grant Mackey Junyao Zhang Department Of Energy Los Alamos national laboratory

Application Trend: more scientists will perform analytics on clouds! More and more scientists from many different disciplines rely on data analytics to advance scientific and engineering discovery Accelerate dramatically the pace of discovery by removing a range of mundane but timeconsuming tasks from our consciousness (I. Foster Argonne). Scientific grand challenge problems require globalscale interactive data sharing and collaborations

Background about data-intensive HPC Analytics Similar to e-science and data intensive computing in that Manipulate oceans of data Performing dominant file I/O & moderate network I/O Distinct features HPC data access pattern Strided access/non-contiguous access Small-sized requests Some coordination efforts among many parallel tasks all matching

Background about data-intensive HPC Analytics

Our Preliminary Works at HPC Analytics MRAP (ACM HPDC 10) Motivation: there exists a semantics gap between simulation output data at source sites and analyzed data at destination sites by analysis programs in HPC analytics Idea: Passing data semantics information from domain scientist down to the physical file storage level at analyses sites Do a better job of Data Staging: read only needed data for analysis Do a better job of Analysis: avoid unnecessary file I/O by combining analysis step with data preprocessing step into one step By extending current Hadoop framework, results show a throughput improvement of 33% using a bioinformatics app (CloudBurst) and an I/O kernel of Halo catalogs

Challenges to enable data-intensive HPC analytics in Cloud environment What if we have similar infrastructure support as in cloud such as Open Cirrus UIUC clouds Problems have been solved in our ACM HPDC2010 MRAP What if not Amazon EC2, Microsoft Azure Challenges Scalability: DIFS = co-located compute and storage + chunk based distributed file system Software resiliency at data access level Multi-replication, runtime scheduling support Programming language: MapReduce, share nothing only allows accessing one source data per Map task

The state of the art: HPC in Microsoft Azure and Amazon EC2 Amazon The Amazon EC2 Cluster Compute and Cluster GPU instance types Amazon Elastic MapReduce Public Data Sets on AWS Azure IPDPS 10: escience in the cloud: A modis satellite data reprojection and reduction pipeline in the windows azure platform. Supercomputing (SC) 2010 conference, Microsoft Corp. announced the release of NCBI BLAST on Windows Azure. Windows HPC Server 2008 R2 released by Windows Azure. Many others ongoing

Limitations of Current Cloud Solutions on implementing co-located compute & storage Different clouds share the following drawbacks (Amazon EC2, Azure, Linux Open Cirrus) High Performance overhead: Use of virtualization technology leads to sharing of physical disks, lacking of direct harnessment of resources; Local storage associated with a VM instance is not persistent: Instances are randomly assigned to users. Once the instance is released, it can be assigned to other users, then data on that is erased; unsuitable for constructing HDFS. Lack of a physical distributed persistent storage: centrally managed storage; Lack of a data transfer function between local storages of instances in Azure, making data sharing across instances is disabled.

Limitations of Current Solutions on supporting co-located compute and storage (cont ) For Amazon, some possible solutions: Constructing DIFS using local storage of virtual machines; If input and output data are still stored at S3, then cost goes up for the HDFS replica and there might be bottleneck for input and output transfer; If not, it increases the risk of data loss since the data is not copied to the permanent locations. Taking advantage of EBS which exists independently from the life of an instance to store the data. EBS is charged according to the volume size you allocate instead of the data size you store; The data communication between the instance and the EBS volume is still through the network and it costs a lot of time.

Limitations of Current Solutions on supporting colocated compute and storage (cont )

Co-located Compute and storage on Azure (1) Instances can not access local storage from other instances, so it is not possible to construct HDFS on current infrastructure; VMs are assigned to users randomly, it can not be guaranteed that the instance you are using is the same one, so the data might be lost at any point of time of a failure, or after you resume you work.

Co-located Compute and storage on Azure (2)

Our HPC analytics Solutions for Azure Phase1: Data Preprocessing (i.e. Data Staging) Our previous work has proved that in HPC analytics applications, data format and access patterns could be used to significantly improve the performance of future data analysis. Our proposal: HPC upload middleware to preprocess the data before being uploaded to the Azure storage Phase 2: Conducting Analysis Duplicate a MapReduce/Hadoop context on Azure Develop a MapReduce/Hadoop framework on Azure

Phase 1: HPC Upload Middleware Motivation System Design Example: Gadget-II, Astrophysics data set

HPC Upload Middleware -- Motivation There are special data semantics for HPC data well defined in HPC: NetCDF, HDF5, etc Azure storage does not consider HPC data semantics; Make Azure store the HPC data with the awareness of HPC data semantics using a Uploading Middleware Recognize NetCDF, HDF5, Gadget, etc structured data in HPC

HPC Upload Middleware -- Motivation

HPC Upload Middleware -- System Design (1) Two models: block blob and page blob Data Preprocessing Preprocessing the HPC data using user defined method (e.g., a DLL implementation); Generate preprocessed HPC data together with their semantics tags (i.e., metadata describes data content) Data Uploading (block blob or page blob) Upload generated HPC data blobs using functions PutBlock and PutBlockList of CloudBlockBlob for block blob uploading, and WritePages for page blob uploading respectively

HPC Upload Middleware -- System Design (2)

HPC Upload Middleware -- Gadget II

Phase 2: Running an Analysis program on MapReduce/HDFS & Azure HDFS MapReduce Challenges to deploy MapReduce/HDFS over Azure

HDFS (1) HDFS system is a chunk-based storage system. Comprised of one namenode managing metadata and block locations, and several datanodes storing chunks of the file. Namenode often performs as JobTracker to schedule and track the tasks and Datanodes as TaskTrackers to perform specific tasks. Files are split into fix-sized chunks and stored across datanodes.

HDFS (2)

MapReduce (1) MapReduce is a programming framework model; Comprised of 2 phases: map and reduce. Data -> (key, value) pairs -> map function -> (keyx, valuex) pairs -> reduce function -> (keyr, valuer) pairs

MapReduce (2)

Challenges to deploy HDFS/MapReduce over Azure Virtual machines in Azure make instances in HDFS dynamic, resulting in data loss if data stored in VMs. Currently the instance can not access the local storage of other instances, disabling the data transfer between local storages of instances. Centralized managed storage makes colocated computation and storage impossible.

Our current solution: constructing a logical HDFS layer on top of Azure s central storage Idea: mimic a HDFS context Create two tables on top of Azure s central storage infrastructure One table manages HDFS blocks One table manages Azure compute nodes Advantages: Port many existing MapReduce/Hadoop applications to Azure Support co-located compute and storage Drawbacks: Still need to save all important data to central storage

HDFS attempt on Azure (1) Uploading A table named metadatafilesummary is used to store the block entries; Another table named instancemanagetable is used to store the online VM IDs (which is the unique identifier for every VM instance); HDFS blocks are uploaded as block blobs; Data is split into HDFS blocks using predefined size; Check the available VM instances in table instancemanagetable and distribute the blocks randomly across them; Add block entry in table metadatafilesummary ; Upload HDFS blocks onto the Azure storage;

HDFS attempt on Azure (2)

HDFS attempt on Azure (3) Downloading A table named metadatafilesummary is used to store the block entries; Fetch all the block entries for the file requested; Read the block ID and and related VM IDs; Pick the one with the same VM ID as the requested instance, other wise random get one; HDFS blocks are downloaded as block blobs.

HDFS attempt on Azure (4)

MapReduce over Azure -- architecture

Preliminary Results of HPC uploading middleware on Azure (1) 70000 Data Volume Transferred For Read (60G) 60000 50000 40000 30000 DataVolumTransferForRead (MB) TotalDataSize (MB) 20000 10000 0 Block Blob With PreProcess Block Blob Without PreProcess Page Blob With PreProcess Page Blob Without PreProcess

Preliminary Results of HPC uploading middleware on Azure (2) 1400 1200 1000 800 600 Block Blob With PreProcess Block Blob Without PreProcess Page Blob With PreProcess Page Blob Without PreProcess 400 200 0 Uploading Time (6M) (100ms) Download Time(6M) (ms) Upload Time(128M) (s) DownloadTime(128M) (10ms)

Preliminary Results of HPC uploading middleware on Azure (3) Access Time improvement 128 MB (%) blockblob with preprocess 85.00922509 pageblob with preprocess 84.69162996 Access Time improvement 6 MB (%) blockblob with preprocess 41.87716263 pageblob with preprocess 56.34458673

Breakdown of results With Preprocessing: Uploading Time = PreProcessing Time + AddEntry Time + Data Uploading Time Downloading Time = Entry Query Time + Data Downloading Time Without Preprocessing: Uploading Time = Data Uploading Time Downloading Time = Data Downloading Time + Strided Access Time User related latency: Preprocessing Time means Overhead time caused by uploading middleware: Adding Entry Time, Entry Query Time

Conclusion We demonstrate the feasibility of: Deploying the HPC analysis applications on Azure; Deploying the HDFS over Azure; Deploying the MapReduce over HDFS at Azure. With data semantics awareness at both data staging and analysis phases, the performance of HPC analytics programs can be significantly improved at Clouds About 85% improvement on data access latency

Thanks!