GeoGrid Project and Experiences with Hadoop

Similar documents

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Hadoop & its Usage at Facebook

Hadoop Distributed File System. Dhruba Borthakur June, 2007

Hadoop Architecture. Part 1

Chapter 7. Using Hadoop Cluster and MapReduce

Design and Evolution of the Apache Hadoop File System(HDFS)

Hadoop & its Usage at Facebook

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

HDFS Architecture Guide

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

Hadoop Distributed File System. Jordan Prosch, Matt Kipps

Apache Hadoop FileSystem and its Usage in Facebook

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

GraySort and MinuteSort at Yahoo on Hadoop 0.23

CSE-E5430 Scalable Cloud Computing Lecture 2

Introduction to Cloud Computing

Snapshots in Hadoop Distributed File System

HDFS Under the Hood. Sanjay Radia. Grid Computing, Hadoop Yahoo Inc.

Performance Comparison of Intel Enterprise Edition for Lustre* software and HDFS for MapReduce Applications

Big data management with IBM General Parallel File System

Hadoop and its Usage at Facebook. Dhruba Borthakur June 22 rd, 2009

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Apache Hadoop FileSystem Internals

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Distributed File Systems

MANAGEMENT OF DATA REPLICATION FOR PC CLUSTER BASED CLOUD STORAGE SYSTEM

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

THE HADOOP DISTRIBUTED FILE SYSTEM

International Journal of Advance Research in Computer Science and Management Studies

Big Data Technology Core Hadoop: HDFS-YARN Internals

Scala Storage Scale-Out Clustered Storage White Paper

Fault Tolerance in Hadoop for Work Migration

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

EXPERIMENTATION. HARRISON CARRANZA School of Computer Science and Mathematics

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

An Alternative Storage Solution for MapReduce. Eric Lomascolo Director, Solutions Marketing

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

HADOOP MOCK TEST HADOOP MOCK TEST I

Hadoop IST 734 SS CHUNG

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Apache Hadoop. Alexandru Costan

Use of Hadoop File System for Nuclear Physics Analyses in STAR

Processing of Hadoop using Highly Available NameNode

BlobSeer: Towards efficient data storage management on large-scale, distributed systems

Big Fast Data Hadoop acceleration with Flash. June 2013

Cloud Computing at Google. Architecture

NoSQL and Hadoop Technologies On Oracle Cloud

Hadoop Cluster Applications

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

DEPLOYING AND MONITORING HADOOP MAP-REDUCE ANALYTICS ON SINGLE-CHIP CLOUD COMPUTER

Reduction of Data at Namenode in HDFS using harballing Technique

Storage Architectures for Big Data in the Cloud

Hadoop implementation of MapReduce computational model. Ján Vaňo

GraySort on Apache Spark by Databricks

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Benchmarking Hadoop & HBase on Violin

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

Large scale processing using Hadoop. Ján Vaňo

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Using an In-Memory Data Grid for Near Real-Time Data Analysis

MapReduce and Hadoop Distributed File System

Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand

Hypertable Architecture Overview

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Keywords: Big Data, HDFS, Map Reduce, Hadoop

TECHNICAL WHITE PAPER: ELASTIC CLOUD STORAGE SOFTWARE ARCHITECTURE

Evaluating HDFS I/O Performance on Virtualized Systems

Shared Parallel File System

Performance and Scalability Overview

Performance Analysis of Mixed Distributed Filesystem Workloads

Driving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA

NextGen Infrastructure for Big DATA Analytics.

Application Development. A Paradigm Shift

NoSQL Data Base Basics

Big Data Analytics(Hadoop) Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Google File System. Web and scalability

HDFS Users Guide. Table of contents

CIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensive Computing. University of Florida, CISE Department Prof.

THE EXPAND PARALLEL FILE SYSTEM A FILE SYSTEM FOR CLUSTER AND GRID COMPUTING. José Daniel García Sánchez ARCOS Group University Carlos III of Madrid

Hadoop Scalability at Facebook. Dmytro Molkov YaC, Moscow, September 19, 2011

HDFS Federation. Sanjay Radia Founder and Hortonworks. Page 1

Introduction to HDFS. Prasanth Kothuri, CERN

Accelerating and Simplifying Apache

Alternatives to HIVE SQL in Hadoop File Structure

Transcription:

GeoGrid Project and Experiences with Hadoop Gong Zhang and Ling Liu Distributed Data Intensive Systems Lab (DiSL) Center for Experimental Computer Systems Research (CERCS) Georgia Institute of Technology Acknowledgement: This work is sponsored by a grant from NSF CyberTrust program, an IBM faculty award, and an IBM SUR grant. The first author did internship at IBM Almaden on comparison of HDFS with GPFS.

Internet Usage and Content: Demand for new generation of System Software Technology Explosion and continuous growth fueled by global computing, mobile device access, and user-generated content Mass Distribution: Device numbers exploding (cell phones +1B/yr) Edge computing resources exceed those in core Mass Centralization: Yield management, optimization, & data analysis Data is the asset Move computation closer to data

Mobile Internet & Cloud Computing Vision An emerging computing paradigm where data and services reside in massively scalable data centers and can be ubiquitously accessed from any connected devices over the internet. From one-server, one application model to browser based parallel application model, expanding the Internet to become the compute platform of the future Businesses, from startups to enterprises 4+ billion phones by 2010 [Source: Nokia] Web 2.0- enabled PCs, TVs, etc.

GeoGrid Design and Service Network Management A decentralized and geographical location aware overlay network for scalable and reliable delivery of location based service. A 2-d rectangular geographical area is divided into a number of rectangular regions. 3 11 13 6 4 17 14 (832,825) 8 16 15 Each region is assigned to an owner node in the overlay network to process all the service requests mapped to this region 9 10 2 12 Each node is a common user PC and can join and leave the network at any time. Mobile users connects mobile devices to the GeoGrid node 1 5 7 Service request message 4

GeoGrid Design and Service Network Management Each node maintains a neighbor List 3 11 13 17 16 Greedy forwarding routing 14 (832,825) Node joining New node splits the region with the existing node Node Departure a neighbor node takes over by merging two rectangular regions 6 18 9 10 1 5 4 2 8 15 12 7 5

GeoGrid: Scaling LBSs through Runtime Adaptation, Replication, and Load Balance 1 2 4 3 5 6 7 8 9 10 11 12 13 14 15 Basic GeoGrid Network with 500 nodes

GeoGrid Research: Enhancing Reliability and Scalability of Pervasive Computing Applications Reliability/Fault Tolerance Node failure Network partition failure Scalability/Performance CPU, Disk I/O, Network Bandwidth, Virtualization Energy Efficiency (Mobile clients) Availability 24 by 7 uptime Distributed and Parallel Computing Infrastructure Virtualization Security/Trust/ Pervasive security Cross-Layer Trust Management

Highly skewed node workload distribution 0.0 1.0 Workload distribution Reliable GeoGrid Service Network Load Visualization 8

Replication and Replica based Load Balancing Three Basic Components Service Owner and Executor model 3 11 13 L3 17 14 16 Executor selection from neighbor + shortcut replica L1 6 4 8 15 L4 Executor selection criteria take into account multiple performance factors Load per node workload index Cache factor Proximity factor 9 10 1 5 L2 2 12 7 9

Reducing the deadly failures Replication scheme helps to reduce the deadly failures 10

Hot spot diameter on workload index adaptation 11

Workload Visualization Workload distribution without load balance at time epoch 95 Workload distribution with load balance at time epoch 105 Workload distribution with load balance at time epoch 110 500 nodes, 39871 regular location query event, 35910 hot spot location query event 12

Experiences with Hadoop Based on summer intern project at IBM Almaden Storage Systems Division

Hadoop: Designed for Map-reduce Applications Hadoop: Framework for running applications on large scale commodity hardware Core: HDFS as Storage, Map-reduce as Processing programming model Ship computation to data Scalable, reliable, ease of use Tolerates node/disk failures without impact on applications Large scale data: gigabytes to terabytes of single file Large scale distributed systems: thousands of nodes Why is Hadoop so popular? The first open source distributed file system for Clouding Computing The designated storage management support Map-reduce applications.

Distributed File System Single Namespace for entire cluster Data Coherency Write-once-read-many access model Client can only append to existing files Files are broken up into blocks Typically 64 MB block size Each block replicated on multiple DataNodes Intelligent Client Client can find location of blocks Client accesses data directly from DataNode

Comparing Hadoop with IBM GPFS Motivation Both Hadoop and GPFS are designed as a framework for running applications on large scale commodity hardware (thousands of nodes) Understanding the key properties for distributed file system for mapreduce applications HDFS as Storage + Map-reduce as Processing programming model Ship computation to data Scalable, reliable, ease of use Tolerates node/disk failures without impact on applications Large scale data: gigabytes to terabytes of single file Large scale distributed systems: thousands of nodes

GPFS Basics Provides High-performance I/O Divides files into blocks and stripes the blocks across disks (on multiple storage devices) Reads/Writes the blocks in parallel Tuneable block sizes (depends on your data) Block-level locking mechanism Multiple applications can access the same file concurrently multiple editors can work on different parts of a single file simultaneously. This eliminates the additional storage, merging and management overhead typically required to maintain multiple copies Client-side data-caching Where is data cached?

Comparing Features of Simple coherency model: Writeonce-read-many Splitting file into uniform sized blocks and distributed across cluster nodes (64MB) No locking HDFS HDFS and GPFS GPFS POSIX semantics High performance through striping blocks across multiple nodes and disks Distributed locking Block replication to handle hardware failure Master-Slave architecture Block location exposed to apps Default Replication factor 2 Shared disk architecture Block location not exposed to apps

Testbed Configuration HDFS 0.17.0 GPFS 3.2 8 nodes 2.8 GHz Pentium IV 5 x 36 GB 10K RM SCSI disk 1 GB Memory 1 Gbps network bandwidth Each node has 5 disks. One disk for OS image Two disks are for HDFS Two disks are for GPFS Tools Hadoop logs Ganglia: cluster performance monitoring tool Methodology Creating I/O workload to run on Hadoop Metrics categories: execution time, I/O throughput, network utilization, CPU utilization

Metadata Efficiency (0 KB files)

Metadata Efficiency (1 KB files)

Data Efficiency

Metadata Efficiency (0 KB files)

Function Shipping

Interplay between GPFS and HDFS HDFS could benefit from: Metadata scalability Data transfer efficiency GPFS could benefit from: Function shipping Location metadata

Integrating GeoGrid with Hadoop Goal: Supporting Storage as Service Model Supporting Location based Content Delivery and Dissemination Target Applications Internet Movie, Music, and Photo Sharing through Geo-tagging and GeoGrid Overlay Network

Questions? Thanks!