Hadoop: Understanding the Big Data Processing Method



Similar documents
Istanbul Şehir University Big Data Camp 14. Hadoop Map Reduce. Aslan Bakirov Kevser Nur Çoğalmış

Word count example Abdalrahman Alsaedi

Hadoop Framework. technology basics for data scientists. Spring Jordi Torres, UPC - BSC

Hadoop WordCount Explained! IT332 Distributed Systems

Introduction to MapReduce and Hadoop

CS54100: Database Systems

By Hrudaya nath K Cloud Computing

How To Write A Mapreduce Program In Java.Io (Orchestra)

Data Science in the Wild

Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan

map/reduce connected components

Hadoop & Pig. Dr. Karina Hauser Senior Lecturer Management & Entrepreneurship

Internals of Hadoop Application Framework and Distributed File System

Big Data Management and NoSQL Databases

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Hadoop and Eclipse. Eclipse Hawaii User s Group May 26th, Seth Ladd

School of Parallel Programming & Parallel Architecture for HPC ICTP October, Hadoop for HPC. Instructor: Ekpe Okorafor

Download and install Download virtual machine Import virtual machine in Virtualbox

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Apache Hadoop. Alexandru Costan

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

How To Write A Mapreduce Program On An Ipad Or Ipad (For Free)

HPCHadoop: MapReduce on Cray X-series

Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网. Information Management. Information Management IBM CDL Lab

Step 4: Configure a new Hadoop server This perspective will add a new snap-in to your bottom pane (along with Problems and Tasks), like so:

Hadoop. Dawid Weiss. Institute of Computing Science Poznań University of Technology

Big Data With Hadoop

Word Count Code using MR2 Classes and API

How To Write A Map Reduce In Hadoop Hadooper (Ahemos)

Programming Hadoop Map-Reduce Programming, Tuning & Debugging. Arun C Murthy Yahoo! CCDI acm@yahoo-inc.com ApacheCon US 2008

Hadoop Configuration and First Examples

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop Distributed File System. Dhruba Borthakur June, 2007

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Hadoop implementation of MapReduce computational model. Ján Vaňo

Introduction To Hadoop

MR-(Mapreduce Programming Language)

Case-Based Reasoning Implementation on Hadoop and MapReduce Frameworks Done By: Soufiane Berouel Supervised By: Dr Lily Liang

The Hadoop Eco System Shanghai Data Science Meetup

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

CSE-E5430 Scalable Cloud Computing Lecture 2

Big Data Management. Big Data Management. (BDM) Autumn Povl Koch November 11,

Hadoop IST 734 SS CHUNG

Petabyte-scale Data with Apache HDFS. Matt Foley Hortonworks, Inc.

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Chapter 7. Using Hadoop Cluster and MapReduce

Getting to know Apache Hadoop

Data-Intensive Computing with Map-Reduce and Hadoop

Hadoop Architecture. Part 1

Data Science Analytics & Research Centre

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Hadoop & its Usage at Facebook

Introduc)on to Map- Reduce. Vincent Leroy

Working With Hadoop. Important Terminology. Important Terminology. Anatomy of MapReduce Job Run. Important Terminology

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Creating.NET-based Mappers and Reducers for Hadoop with JNBridgePro

Hadoop & its Usage at Facebook

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Using Hadoop for Webscale Computing. Ajay Anand Yahoo! Usenix 2008

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

THE HADOOP DISTRIBUTED FILE SYSTEM

Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015

Programming in Hadoop Programming, Tuning & Debugging

Apache Hadoop new way for the company to store and analyze big data

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Big Data and Apache Hadoop s MapReduce

MapReduce with Apache Hadoop Analysing Big Data

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

NoSQL and Hadoop Technologies On Oracle Cloud

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Big Data Analytics* Outline. Issues. Big Data

Data management in the cloud using Hadoop

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

How To Scale Out Of A Nosql Database

!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets

Open source Google-style large scale data analysis with Hadoop

A Brief Outline on Bigdata Hadoop

Lecture 3 Hadoop Technical Introduction CSE 490H

A very short Intro to Hadoop

Big Data Storage Options for Hadoop Sam Fineberg, HP Storage

Application Development. A Paradigm Shift

Large scale processing using Hadoop. Ján Vaňo

Transcription:

Hadoop: Understanding the Big Data Processing Method Deepak Chandra Upreti 1, Pawan Sharma 2, Dr. Yaduvir Singh 3 1 PG Student, Department of Computer Science & Engineering, Ideal Institute of Technology 2 Assistant Professor, Department of Computer Science & Engineering, Ideal Institute of Technology, Ghaziabad 3 Professor, Department of Computer Science & Engineering, Ideal Institute of Technology, UPTU, Lucknow, India Abstract: Every day, we create 2.5 quintillion bytes of data so much that 90% of the data in the world today has been created in the last two years alone. This data comes from everywhere: sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals to name a few. This data is big data. Big data requires different approaches: Techniques, tools, architecture and data processing methods. The main focus of the paper is to draw the state-ofthe-art techniques and technologies for Big Data processing with the help of Big Data application- Hadoop Keywords: Apache Hadoop, big data, Java, Google File System, Google MapReduce, open source, MapR, Oracle, distributed file system, HDFS, redundant array of inexpensive disks, replication factor, NameNode, DataNode, MapReduce, JobTracker, TaskTracke,r yet another resource negotiator, PigLatin, HiveQL, Hbase 1. Introduction Big Data has gained much attention from the academia and the IT industry. In the digital and computing world, information is generated and collected at a rate that rapidly exceeds the boundary range. Currently, over 2 billion people worldwide are connected to the Internet, and over 5 billion individuals own mobile phones. By 2020, 50 billion devices are expected to be connected to the Internet. At this point, predicted data production will be 44 times greater than that in 2009. As information is transferred and shared at light speed on optic fiber and wireless networks, the volume of data and the speed of market growth increase. However, the fast growth rate of such large data generates numerous challenges, such as the rapid growth of data, transfer speed, diverse data, and security. This research direction facilitates the exploration of the domain and the development of optimal techniques to address Big Data. Hadoop is a powerful technology, but it is just one component of the big data technology landscape. Hadoop is designed for specific data types and workloads. For example, it is a very cost-effective technology for staging large amounts of raw data, both structured and unstructured, which can then be refined and prepared for analytics. Hadoop can also help you avoid costly upgrades of existing proprietary databases and data warehouse appliances when their capacity is being consumed too quickly with raw, unused data and extract-load-transform processing. Apache Hadoop3 is a Java-based open source platform that makes processing of large data possible over thousands of distributed nodes. Hadoop s development resulted from public- ation of two Google authored whitepapers (i.e., Google File System and Google Map Reduce data in any structure. Hadoop focus in on supporting redundancy, distributed architectures, and parallel processing. It is an open-source software framework from Apache Inspired by GFS (Google File System) Google MapReduce Hadoop has two base components The Hadoop distributed file system (HDFS), Which supports data in structured relational form, in unstructured form, and in any form in between The MapReduce programing paradigm for Managing applications on multiple distributed servers 3. Hadoop Distributed File System (HDFS) Hadoop distributed file system (HDFS) is a base component of the Hadoop framework that manages the data storage.it stores data in the form of data blocks (default size: 64MB) on the local hard disk. The input data size defines the block size to be used in the cluster. A block size of 128MB for a large file set is a good choice. Keeping large block sizes means a small number of blocks can be stored, thereby minimizing the memory requirement on the master node (commonly known as NameNode) for storing the metadata information. Block size also impacts the job execution time, and thereby cluster performance. A file can be stored in HDFS with a different block size during the upload process. For file sizes smaller than the block size, the Hadoop cluster may not perform optimally. 2. Hadoop Hadoop is a distributed file system and data processing engine that is designed to handle extremely high volumes of Paper ID: SUB152290 1620

Data Node a) A Block Server Stores data in the local file system (e.g. ext3) Stores metadata of a block (e.g. CRC) Serves data and metadata to Clients b) Block Report Periodically sends a report of all existing blocks to the NameNode c) Facilitates Pipelining of Data Forwards data to other specified DataNodes Block Placement Functions of a NameNode a) Manages File System Namespace Maps a file name to a set of blocks Maps a block to the DataNodes where it resides b) Cluster Configuration Management c) Replication Engine for Blocks NameNode Metadata a) Metadata in Memory The entire metadata is in main memory No demand paging of metadata b) Types of metadata List of files List of Blocks for each file List of DataNodes for each block File attributes, e.g. creation time, replication factor c) A Transaction Log Records file creations, file deletions etc NameNode Failure a) A single point of failure b) Transaction Log stored in multiple directories A directory on the local file system A directory on a remote file system (NFS/CIFS) c) Need to develop a real high availability (HA) solution a) Current Strategy One replica on local node Second replica on a remote rack Third replica on same remote rack Additional replicas are randomly placed b) Clients read from nearest replicas c) Would like to make this policy pluggable Replication Engine a) NameNode detects DataNode failures Chooses new DataNodes for new replicas Balances disk usage Balances communication traffic to DataNodes 4. MapReduce MapReduce Dataflow MapReduce is the second base component of a Hadoop framework. It manages the data processing part on the Hadoop cluster. A cluster runs MapReduce jobs written in Java or any other language (Python, Ruby, R, etc.) via streaming. A single job may consist of multiple tasks. The number of tasks that can run on a data node is governed by the amount of memory installed. It is recommended to have more memory installed on each data node for better cluster performance. Based on the application workload that is going to run on the cluster, a data node may require high compute power, high disk I/O or additional memory. Paper ID: SUB152290 1621

MapReduce Features a) Fine grained Map and Reduce tasks Improved load balancing Faster recovery from failed tasks b) Automatic re-execution on failure In a large cluster, some nodes are always slow or flaky Framework re-executes failed tasks c) Locality optimizations With large data, bandwidth to data is a problem Map-Reduce + HDFS is a very effective solution Map-Reduce queries HDFS for locations of input data Map tasks are scheduled close to the inputs when possible 1. Hadoop: A Big Data Processing Method Hadoop/MapReduce technology a) Learn the platform (how it is designed and works) How big data are managed in a scalable efficient way b) Learn writing Hadoop jobs in different languages Programming Languages: Java, C, Python High-Level Languages: Apache Pig, Hive c) Learn advanced analytics tools on top of Hadoop RHadoop: Statistical tools for managing big data Mahout: Data mining and machine learning tools over big data. d) Learn state-of-art technology from recent research papers Optimization, indexing techniques, and Other extensions to Hadoop Hadoop Services Schematic The job execution process is handled by the JobTracker and TaskTracker services. The type of MapReduce: High Level Nodes, Trackers, Tasks Master node runs JobTracker instance, which accepts Job requests from clients TaskTracker instances run on slave nodes TaskTracker forks separate Java process for task instances Job Distribution MapReduce programs are contained in a Java jar file + an XML file containing serialized program configuration options Running a MapReduce job places these files into the HDFS and notifies TaskTrackers where to retrieve the relevant program code Where s the data distribution? job scheduler chosen for a cluster depends on the application workload. If the cluster will be running short running jobs, then FIFO scheduler will serve the purpose. For a much more balanced cluster, one can choose to use Fair Scheduler. If the node running JobTracker service fails, all jobs submitted in the cluster will be discontinued and must be Paper ID: SUB152290 1622

resubmitted once the node is recovered or another node is made available. private final static IntWritable one = new IntWritable(1); private Text word = new Text(); Data Distribution a) Implicit in design of MapReduce! All mappers are equivalent; so map whatever data is local to a particular node in HDFS b) If lots of data does happen to pile up on the same node, nearby nodes will map Partition And Shuffle public void map(longwritable key, Text value, OutputCollector<Text, IntWritable> output, ) { String line = value.tostring(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasmoretokens()) { word.set(tokenizer.nexttoken()); output.collect(word, one); 1. Example Program - Wordcount a) map() Receives a chunk of text Outputs a set of word/count pairs b) reduce() Receives a key and all its associated values Outputs the key and the sum of the values package org.myorg; import java.io.ioexception; import java.util.*; import org.apache.hadoop.fs.path; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; import org.apache.hadoop.util.*; public class WordCount { Wordcount main( ) public static void main(string[] args) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setjobname("wordcount"); conf.setoutputkeyclass(text.class); conf.setoutputvalueclass(intwritable.class); conf.setmapperclass(map.class); conf.setreducerclass(reduce.class); conf.setinputformat(textinputformat.class); conf.setoutputformat(textoutputformat.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf); Wordcount map( ) public static class Map extends MapReduceBase { Wordcount reduce( ) public static class Reduce extends MapReduceBase { public void reduce(text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, ) { int sum = 0; while (values.hasnext()) { sum += values.next().get(); output.collect(key, new IntWritable(sum)); 2. Using Hadoop Distributed File System (HDFS) Can access HDFS through various shell commands (see Further Resources slide for link to documentation) a) hadoop put <localsrc> <dst> b) hadoop get <src> <localdst> c) hadoop ls d) hdoop rm file Running a Hadoop Job a) Place input file into HDFS: hadoop fs put./input-file input- file b) Run either normal or streaming version: hadoop jar Wordcount.jar org.myorg.wordcount input-file output- file hadoop jar hadoop-streaming.jar\ ~ input input-file \ ~ output output-file \ ~ file Streaming_Mapper.py \ ~ mapper python Streaming_Mapper.py \ ~ file Streaming_Reducer.py \ ~ reducer python Streaming_Reducer.py \ Related Apache Sub-Projects The Hadoop Ecosystem Hadoop base components are surrounded by numerous ecosystem projects. Many of them are already top-level Apache (open source) projects; some, however, remain in the incubation stage. Here we highlights a few of the ecosystem components. The most common ones include Pig High-level language for data analysis HBase Table storage for semi-structured data Paper ID: SUB152290 1623

Zookeeper Coordinating distributed applications Hive SQL-like Query language and Metastore Mahout Machine learning 5. Conclusion International Journal of Science and Research (IJSR) This paper presents one of the Big Data processing methods called Hadoop. This concept includes the increase in data, the progressive demand for data processing, and the role of Big Data in the current environment of enterprise and technology. To enhance the efficiency of data processing, we have devised a hadoop method that uses the technologies and terminologies of Big Data. The stages in this technology include basics, HDFC, MapReduce and Related Apache subprojects. All these stages are important for big data processing method i.e. hadoop. Information is simultaneously increasing at an exponential rate, but information processing methods are improving relatively slowly. Currently, a limited number of tools are available to completely address the issues in BigData analysis. The stateof-the-art techniques and technologies in many important Big Data applications (i.e., Hadoop, Hbase, and Cassandra) cannot solve the real problems of storage, References [1] Big Data Technology [2] Harnessing-Hadoop [3] Considerations for Big Data: Architecture and Approach by Kapil Bakshi Paper published in IEEE [4] Big Data Analytics by Sachidanand Singh Paper published in 2012 International Conference on Communication Information & Computing Technology (ICCICT), Oct. 19-20, Mumbai, India [5] McKinsey Global Institute [6] http://hadoop.apache.org/ [7] S. Kaisler, F. Armour, J. A. Espinosa, and W.Money, Big data: issues and challe- nges moving forward, in Proceedings of the IEEE 46th Annual Hawaii Internation- al Conference on System Sciences (HICSS 13), pp.995 1004, January 2013. Paper ID: SUB152290 1624