Big Data Analytics. Lucas Rego Drumond

Similar documents
Big Data Analytics. Lucas Rego Drumond

Hadoop Architecture. Part 1

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

THE HADOOP DISTRIBUTED FILE SYSTEM

Distributed Filesystems

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

Hadoop Distributed File System (HDFS) Overview

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

HDFS: Hadoop Distributed File System

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Cloud Computing at Google. Architecture

HADOOP MOCK TEST HADOOP MOCK TEST I

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

Open source Google-style large scale data analysis with Hadoop

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

Distributed File Systems

The Google File System

Data-Intensive Computing with Map-Reduce and Hadoop

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Hadoop Distributed File System. Dhruba Borthakur June, 2007

Hadoop IST 734 SS CHUNG

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Apache Hadoop. Alexandru Costan

Hadoop Data Replication in HDFS

Apache Hadoop new way for the company to store and analyze big data

A Multilevel Secure MapReduce Framework for Cross-Domain Information Sharing in the Cloud

Distributed File Systems

Parallel Processing of cluster by Map Reduce

Intro to Map/Reduce a.k.a. Hadoop

The Google File System

CSE-E5430 Scalable Cloud Computing Lecture 2

HDFS Space Consolidation

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

BIG DATA What it is and how to use?

Big Data Storage Options for Hadoop Sam Fineberg, HP Storage

Introduction to HDFS. Prasanth Kothuri, CERN

Hadoop Distributed File System. Jordan Prosch, Matt Kipps

Bright Cluster Manager

Parallel Data Mining and Assurance Service Model Using Hadoop in Cloud

A very short Intro to Hadoop

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

Contents. 1. Introduction

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Big Data With Hadoop

COSC 6397 Big Data Analytics. Distributed File Systems (II) Edgar Gabriel Spring HDFS Basics

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Hadoop & its Usage at Facebook

Hypertable Architecture Overview

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Big Data Analytics. Prof. Dr. Lars Schmidt-Thieme

Big Data Technology Core Hadoop: HDFS-YARN Internals

Big Data Analytics. Lucas Rego Drumond

The Recovery System for Hadoop Cluster

MapReduce Job Processing

HDFS scalability: the limits to growth

How To Use Hadoop

L1: Introduction to Hadoop

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

Open source large scale distributed data management with Google s MapReduce and Bigtable

Suresh Lakavath csir urdip Pune, India

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Storage Architectures for Big Data in the Cloud

Hadoop Distributed Filesystem. Spring 2015, X. Zhang Fordham Univ.

The Hadoop Distributed File System

Large scale processing using Hadoop. Ján Vaňo

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems

HDFS Under the Hood. Sanjay Radia. Grid Computing, Hadoop Yahoo Inc.

Introduction to HDFS. Prasanth Kothuri, CERN

Hadoop and its Usage at Facebook. Dhruba Borthakur June 22 rd, 2009

Sunita Suralkar, Ashwini Mujumdar, Gayatri Masiwal, Manasi Kulkarni Department of Computer Technology, Veermata Jijabai Technological Institute

Reduction of Data at Namenode in HDFS using harballing Technique

Hadoop implementation of MapReduce computational model. Ján Vaňo

HDFS Users Guide. Table of contents

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

Chapter 7. Using Hadoop Cluster and MapReduce

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Building & Optimizing Enterprise-class Hadoop with Open Architectures Prem Jain NetApp

Snapshots in Hadoop Distributed File System

MapReduce, Hadoop and Amazon AWS

Big Data Analytics: Hadoop-Map Reduce & NoSQL Databases

RED HAT ENTERPRISE LINUX 7

International Journal of Advance Research in Computer Science and Management Studies

Google File System. Web and scalability

Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science

Scalable Multiple NameNodes Hadoop Cloud Storage System

How to properly misuse Hadoop. Marcel Huntemann NERSC tutorial session 2/12/13

Hadoop Big Data for Processing Data and Performing Workload

Enhancing UNICORE Storage Management using Hadoop

Certified Big Data and Apache Hadoop Developer VS-1221

Hadoop & its Usage at Facebook

Transcription:

Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Big Data Analytics Big Data Analytics 1 / 21

Outline 1. Distributed File Systems Big Data Analytics 1 / 21

1. Distributed File Systems Outline 1. Distributed File Systems Big Data Analytics 1 / 21

1. Distributed File Systems What is Big Data? Big Data is about: Storing and accessing large amounts of (unstructured) data Big Data Analytics 1 / 21

1. Distributed File Systems What is Big Data? Big Data is about: Storing and accessing large amounts of (unstructured) data Processing high volume data streams Big Data Analytics 1 / 21

1. Distributed File Systems What is Big Data? Big Data is about: Storing and accessing large amounts of (unstructured) data Processing high volume data streams Making sense of the data Big Data Analytics 1 / 21

1. Distributed File Systems What is Big Data? Big Data is about: Storing and accessing large amounts of (unstructured) data Processing high volume data streams Making sense of the data Predictive technologies Big Data Analytics 1 / 21

1. Distributed File Systems Overview Part III Machine Learning Algorithms Part II Large Scale Computational Models Part I Distributed Database Distributed File System Big Data Analytics 2 / 21

Outline 1. Distributed File Systems Big Data Analytics 3 / 21

Why do we need a Distributed File System? Big Data Analytics 3 / 21

Why do we need a Distributed File System? Big Data Analytics 4 / 21

Why do we need a Distributed File System? Big Data Analytics 5 / 21

Why do we need a Distributed File System? Read??? - Whole File? - Specific part? Big Data Analytics 6 / 21

Why do we need a Distributed File System? Write??? - Append to the end of the file? - Insert content in the middle? Big Data Analytics 7 / 21

Why do we need a Distributed File System? We want to: Perform multiple parallel reads and writes Have the files available even if one computer crashes (replication) Hide parallelization and distribution details Big Data Analytics 8 / 21

What is a Distributed File System? File Namespace / /home /home/lucas /home/lucas/big_file Big Data Analytics 9 / 21

What is a Distributed File System? File Namespace / /home /home/john /home/john/big_file Big Data Analytics 10 / 21

Examples GFS (Google Inc.) HDFS (Apache Software Foundation) Ceph (Inktank, Red Hat) MooseFS (Core Technology / Gemius) Windows Distributed File System (DFS) (Microsoft) FhGFS (Fraunhofer) GlusterFS (Red Hat) Lustre Ibrix Big Data Analytics 11 / 21

Components A typical distributed filesystem contains the following components Clients - they do the interface with the user Big Data Analytics 12 / 21

Components A typical distributed filesystem contains the following components Clients - they do the interface with the user Chunk nodes - stores chunks of files Big Data Analytics 12 / 21

Components A typical distributed filesystem contains the following components Clients - they do the interface with the user Chunk nodes - stores chunks of files Master node - stores which parts of each file are on which chunk node Big Data Analytics 12 / 21

Distributed File Systems The Google File System Architecture Big Data Analytics 13 / 21

Distributed File Systems - Storing files Master node / /home /home/john /home/john/big_file /home/john/big_file Chunk 1 C1 C7 Chunk 2 C3 C5 Chunk 3 C4 C6 Chunk 4 C2 C8 Chunk 4 Chunk 1 Chunk 2 Chunk 3 C1 C2 C3 C4 C5 C6 C7 C8 Big Data Analytics 14 / 21

Read Example Client Application 1. read(/home/john/big_file, chunk 1) 2. (Chunk 1 handle, {C1, C7}) Master node / /home /home/john /home/john/big_file /home/john/big_file Chunk 1 C1 C7 Chunk 2 C3 C5 Chunk 3 C4 C6 Chunk 4 C2 C8 3. (Chunk 1 handle, byte range) C1 C2 C3 C4 C5 C6 C7 C8 4. Chunk 1 data Big Data Analytics 15 / 21

Write Example Make sure each replica contains the same data all the time Big Data Analytics 16 / 21

Write Example Make sure each replica contains the same data all the time One replica is designated to be the primary replica Big Data Analytics 16 / 21

Write Example Make sure each replica contains the same data all the time One replica is designated to be the primary replica Master pings the nodes to make sure they are alive Big Data Analytics 16 / 21

Write Example 1. write(/home/john/big_file, chunk 1) Master node / Chunk 1 /home/john/big_file C1 C7 Client Application 2. (Chunk 1 handle, {C1, C7}) /home /home/john Chunk 2 Chunk 3 C3 C5 C4 C6 /home/john/big_file Chunk 4 C2 C8 3. (Chunk 1 handle, data) C1 C2 C3 C4 C5 C6 C7 C8 6. done 4. (Chunk 1 handle, offset) 5. Return status (success or failure) Big Data Analytics 17 / 21

Considerations Reads are very efficient operations Big Data Analytics 18 / 21

Considerations Reads are very efficient operations Writes are efficient if they are appends to the end of the file Big Data Analytics 18 / 21

Considerations Reads are very efficient operations Writes are efficient if they are appends to the end of the file Write in the middle of a file can be problematic Big Data Analytics 18 / 21

Considerations Reads are very efficient operations Writes are efficient if they are appends to the end of the file Write in the middle of a file can be problematic Primary replica decides the order in which to make writes: Data is always consistent in all replicas Big Data Analytics 18 / 21

GFS vs. HDFS HDFS GFS Chunk Size 128Mb 64Mb Default replicas 2 Files (data and 3 Chunknodes generation stamp) Master NameNode GFS Master Chunk Nodes DataNode Chunk Server Big Data Analytics 19 / 21

Google File System Big Data Analytics 20 / 21

Hadoop Distributed File System Big Data Analytics 21 / 21