EXPERIMENTATION. HARRISON CARRANZA School of Computer Science and Mathematics



Similar documents
CS54100: Database Systems

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

Apache Hadoop FileSystem Internals

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Apache Hadoop FileSystem and its Usage in Facebook

Hadoop & its Usage at Facebook

Hadoop & its Usage at Facebook

Hadoop and its Usage at Facebook. Dhruba Borthakur June 22 rd, 2009

Hadoop Architecture. Part 1

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Big Data and Apache Hadoop s MapReduce

Using Hadoop for Webscale Computing. Ajay Anand Yahoo! Usenix 2008

Design and Evolution of the Apache Hadoop File System(HDFS)

THE HADOOP DISTRIBUTED FILE SYSTEM

BBM467 Data Intensive ApplicaAons

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

How To Scale Out Of A Nosql Database

Open source Google-style large scale data analysis with Hadoop

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

CSE-E5430 Scalable Cloud Computing Lecture 2

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Application Development. A Paradigm Shift

Chapter 7. Using Hadoop Cluster and MapReduce

Hadoop Architecture and its Usage at Facebook

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Hadoop IST 734 SS CHUNG

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Hadoop implementation of MapReduce computational model. Ján Vaňo

Large scale processing using Hadoop. Ján Vaňo

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

Apache HBase. Crazy dances on the elephant back

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Hadoop: Embracing future hardware

Workshop on Hadoop with Big Data

Hadoop Big Data for Processing Data and Performing Workload

Quantcast Petabyte Storage at Half Price with QFS!

Big Data Technology Core Hadoop: HDFS-YARN Internals

Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015

Suresh Lakavath csir urdip Pune, India

The future: Big Data, IoT, VR, AR. Leif Granholm Tekla / Trimble buildings Senior Vice President / BIM Ambassador

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

MapReduce with Apache Hadoop Analysing Big Data

HDFS. Hadoop Distributed File System

Accelerating and Simplifying Apache

Open source large scale distributed data management with Google s MapReduce and Bigtable

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Hadoop Distributed File System (HDFS) Overview

Distributed File Systems

Data-Intensive Computing with Map-Reduce and Hadoop

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Introduction to Big Data Training

Hadoop Distributed File System. Jordan Prosch, Matt Kipps

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

Apache Hadoop. Alexandru Costan

Processing of Hadoop using Highly Available NameNode

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Hadoop Scalability at Facebook. Dmytro Molkov YaC, Moscow, September 19, 2011

HDFS Under the Hood. Sanjay Radia. Grid Computing, Hadoop Yahoo Inc.

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

MapReduce, Hadoop and Amazon AWS

Matt Benton, Mike Bull, Josh Fyne, Georgie Mackender, Richard Meal, Louise Millard, Bogdan Paunescu, Chencheng Zhang

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Distributed Filesystems

Petabyte-scale Data with Apache HDFS. Matt Foley Hortonworks, Inc.

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Qsoft Inc

Big Data and Hadoop. Module 1: Introduction to Big Data and Hadoop. Module 2: Hadoop Distributed File System. Module 3: MapReduce

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Hadoop. Sunday, November 25, 12

L1: Introduction to Hadoop

Hadoop Ecosystem B Y R A H I M A.

Finding Insights & Hadoop Cluster Performance Analysis over Census Dataset Using Big-Data Analytics

Reduction of Data at Namenode in HDFS using harballing Technique

BIG DATA TRENDS AND TECHNOLOGIES

Big Data Analytics by Using Hadoop

A Brief Outline on Bigdata Hadoop

Hadoop-BAM and SeqPig

Open Cloud System. (Integration of Eucalyptus, Hadoop and AppScale into deployment of University Private Cloud)

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Generic Log Analyzer Using Hadoop Mapreduce Framework

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

A PERFORMANCE ANALYSIS of HADOOP CLUSTERS in OPENSTACK CLOUD and in REAL SYSTEM

Use of Hadoop File System for Nuclear Physics Analyses in STAR

BookKeeper. Flavio Junqueira Yahoo! Research, Barcelona. Hadoop in China 2011

International Journal of Advance Research in Computer Science and Management Studies

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Transcription:

BIG DATA WITH HADOOP EXPERIMENTATION HARRISON CARRANZA Marist College APARICIO CARRANZA NYC College of Technology CUNY ECC Conference 2016 Poughkeepsie, NY, June 12-14, 2016 Marist College

AGENDA Contents WHAT IS AND WHY USE HADOOP? HISTORY AND WHO USES HADOOP KEY HADOOP TECHNOLOGIES HADOOP STRUCTURE RESEARCH AND IMPLEMENTATION CONCLUSIONS

Contents WHAT IS AND WHY USE HADOOP? Open Source Apache Project Hadoop Core includes: Distributed File System - distributes data Map/Reduce - distributes application Written in Java Runs on Linux, Mac OS/X, Windows

Contents WHAT IS AND WHY USE HADOOP? How do you scale up applications? Run jobs processing hundreds of terabytes of data Takes 11 days to read on 1 computer Need lots of cheap computers Fixes speed problem (15 minutes on 1000 computers) Reliability problems In large clusters, computers fail every day Cluster size is not fixed Information on one computer gets lost Need common infrastructure Must be efficient and reliable

Contents HISTORY AND WHO USES HADOOP Yahoo! became the primary contributor in 2006 It increased over the next decade up until now & is expected to grow beyond 2018

Contents HISTORY AND WHO USES HADOOP Yahoo! deployed large scale science clusters in 2007 Tons of Yahoo! Research papers emerge: WWW CIKM SIGIR VLDB Yahoo! began running major production jobs in Q1 2008

Contents HISTORY AND WHO USES HADOOP Amazon/A9 AOL Facebook Fox interactive media Google IBM New York Times PowerSet (now Microsoft) Quantcast Rackspace/Mailtrust Veoh Yahoo! New York Times More at http://wiki.apache.org/hadoop/poweredby

Contents HISTORY AND WHO USES HADOOP When you visit Facebook, you are interacting with data processed with Hadoop!

Contents HISTORY AND WHO USES HADOOP Content Optimization Search Index When you visit Facebook, you are interacting with data processed with Hadoop! Ads Optimization Content Feed Processing

Contents HISTORY AND WHO USES HADOOP NOWADAYS Facebook! has ~20,000 machines running Hadoop The largest clusters are currently 2000 nodes Several petabytes of user data (compressed, unreplicated) Facebook runs hundreds of thousands of jobs every month

Contents KEY HADOOP TECHNOLOGIES Hadoop Core Distributed File System MapReduce Framework Pig (initiated by Yahoo!) Parallel Programming Language and Runtime Hbase (initiated by Powerset) Table storage for semi-structured data Zookeeper (initiated by Yahoo!) Coordinating distributed systems Hive (initiated by Facebook) SQL-like query language and metastore

Contents HADOOP DISTRIBUTED FILE SYSTEM (HDFS) HDFS is designed to reliably store very large files across machines in a large cluster It is inspired by the Google File System Hadoop DFS stores each file as a sequence of blocks, all blocks in a file except the last block are the same size Blocks belonging to a file are replicated for fault tolerance The block size and replication factor are configurable per file Files in HDFS are "write once" and have strictly one writer at any time Hadoop Distributed File System Goals: Store large data sets Cope with hardware failure Emphasize streaming data access

HADOOP Contents STRUCTURE Commodity hardware Linux PCs with local 4 disks Typically in 2 level architecture 40 nodes/rack Uplink from rack is 8 gigabit Rack-internal is 1 gigabit all-to-all

HADOOP Contents STRUCTURE Single namespace for entire cluster Managed by a single namenode Files are single-writer and append-only Optimized for streaming reads of large files Files are broken in to large block Typically 128 MB Replicated to several datanodes, for reliability Client talks to both namenode and datanodes Data is not sent through the namenode Throughput of file system scales nearly linearly with the number of nodes Access from Java, C, or command line

HADOOP Contents STRUCTURE Java and C++ APIs In Java use Objects, while in C++ bytes Each task can process data sets larger than RAM Automatic re-execution on failure In a large cluster, some nodes are always slow or flaky Framework re-executes failed tasks Locality optimizations Map-Reduce queries HDFS for locations of input data Map tasks are scheduled close to the inputs when possible

Contents RESEARCH & IMPLEMENTATION There were three different multi node clusters produced each utilizing: Microsoft Windows 7 Linux Ubuntu Mac OS X Yosemite Hadoop functions were then compared in each of these three different environments The data that was used in this experiment were from Wikipedia s data dump site

Contents RESEARCH & IMPLEMENTATION

Contents RESEARCH & IMPLEMENTATION

Contents RESEARCH & IMPLEMENTATION

Contents RESEARCH & IMPLEMENTATION

Contents RESEARCH & IMPLEMENTATION

Contents RESEARCH & IMPLEMENTATION

CONCLUSIONS Contents When name node memory usage is higher than 75%, reading and writing HDFS files through Hadoop API shall fail; caused by Java Hashmap Name node s QPS (query per second) can be optimized by fine-tuning the handler counts higher than the number of tasks that are run in parallel The memory available affects the namenode s performance in each OS platform; therefore allocation of memory is essential Since memory has such an effect performance, the operating system utilizes less RAM for background and utility features to yield better results Ubuntu Linux would be the ideal environment since it uses the least amount of memory, followed by Mac OSX and then finally Windows

THANK YOU!!!