White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP



Similar documents
marlabs driving digital agility WHITEPAPER Big Data and Hadoop

Chapter 7. Using Hadoop Cluster and MapReduce

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

NoSQL and Hadoop Technologies On Oracle Cloud

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop Architecture. Part 1

CSE-E5430 Scalable Cloud Computing Lecture 2

Introduction to Hadoop

Big Data - Infrastructure Considerations

Hadoop IST 734 SS CHUNG

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop

Data-Intensive Computing with Map-Reduce and Hadoop

Big Data With Hadoop

Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science

Big Data and Apache Hadoop s MapReduce

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

Distributed File Systems

A Brief Outline on Bigdata Hadoop

Parallel Processing of cluster by Map Reduce

Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan

The Performance Characteristics of MapReduce Applications on Scalable Clusters

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Big Application Execution on Cloud using Hadoop Distributed File System

Open source Google-style large scale data analysis with Hadoop

Hadoop and Map-Reduce. Swati Gore

Apache Hadoop. Alexandru Costan

Suresh Lakavath csir urdip Pune, India

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Reduction of Data at Namenode in HDFS using harballing Technique

Large scale processing using Hadoop. Ján Vaňo

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

Keywords: Big Data, HDFS, Map Reduce, Hadoop

Survey on Scheduling Algorithm in MapReduce Framework

Hadoop implementation of MapReduce computational model. Ján Vaňo

Processing of Hadoop using Highly Available NameNode

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Big Data Management in the Clouds. Alexandru Costan IRISA / INSA Rennes (KerData team)

The Hadoop Framework

Big Data and Hadoop with components like Flume, Pig, Hive and Jaql

MapReduce with Apache Hadoop Analysing Big Data

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

MapReduce Job Processing

Lecture 10 - Functional programming: Hadoop and MapReduce

Hadoop Big Data for Processing Data and Performing Workload

Load Rebalancing for File System in Public Cloud Roopa R.L 1, Jyothi Patil 2

Open source large scale distributed data management with Google s MapReduce and Bigtable

MapReduce and Hadoop Distributed File System

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Finding Insights & Hadoop Cluster Performance Analysis over Census Dataset Using Big-Data Analytics

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Hadoop. Sunday, November 25, 12

Accelerating and Simplifying Apache

Benchmarking Hadoop & HBase on Violin

BIG DATA USING HADOOP

Fault Tolerance in Hadoop for Work Migration

International Journal of Innovative Research in Computer and Communication Engineering

Big Data and Hadoop. Sreedhar C, Dr. D. Kavitha, K. Asha Rani

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Apache Hadoop FileSystem and its Usage in Facebook

MapReduce (in the cloud)

Hadoop Ecosystem B Y R A H I M A.

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Apache Hadoop new way for the company to store and analyze big data

Introduction to MapReduce and Hadoop

Big Data on Microsoft Platform

How To Scale Out Of A Nosql Database

BBM467 Data Intensive ApplicaAons

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

HadoopRDF : A Scalable RDF Data Analysis System

Networking in the Hadoop Cluster

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies

Big Data Storage Options for Hadoop Sam Fineberg, HP Storage

NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE

Hadoop Scheduler w i t h Deadline Constraint

THE HADOOP DISTRIBUTED FILE SYSTEM

<Insert Picture Here> Oracle and/or Hadoop And what you need to know

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Hadoop & its Usage at Facebook

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Hadoop & its Usage at Facebook

Intro to Map/Reduce a.k.a. Hadoop

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Big Data Analytics: Hadoop-Map Reduce & NoSQL Databases

Transcription:

White Paper Big Data and Hadoop Abhishek S, Java COE www.marlabs.com Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP

Table of contents Abstract.. 1 Introduction. 2 What is Big Data?... 3 Introduction to Hadoop.. 4 What is Hadoop?... 4 What is a Cluster?... 4 What is a Grid?... 4 What makes Hadoop a paramount tool.. 5 Hadoop Kernel. 5 MapReduce 5 High Level Working of MapReduce 7 Hadoop Distributed File System.. 7 HDFS Architecture. 9 Pros and Cons. 10 Conclusion. 12 References. 12 www.marlabs.com Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP

Abstract This paper explicates the significance of new sprung and yet emerging technology called Hadoop. The prime goal of this paper is to unveil the potential of Hadoop framework to the business analysts who frequently face problems in managing Big Data. The focus is on how Hadoop framework can fit the bill for confronting the problems especially in coping Big Data. 1 www.marlabs.com Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP

Introduction The world is moving towards Cloud Computing, a new technological era which has just begun. Have you ever pondered why the word cloud is induced as a terminology in the field of Information Technology? In cloud computing, the word cloud is used as a metaphor for internet. Having said that, cloud computing is a kind of internet based computing where wide variety of services like storages, servers and applications are offered to the enterprises and individuals through the internet. Typically, cloud computing encompasses multiple computing resources rather than having local servers or dedicated devices to handle complex applications. This mechanism is extremely beneficial by harnessing unused or idle computers in the network to solve problem which is too intensive for any standalone computer. Over the past few years several designs, prototypes, methodologies have been developed to tackle parallel computing problems. Moreover, specially designed servers were tailored to meet the parallel computing requirements. The major problem was, these servers were too expensive to handle and yet did not produce expected results. With the advent of multi core processors and virtualization technology, the problems seems to be diminishing and hence effective and powerful tools are built to achieve parallelization using the commodity machines. One of such tools is Hadoop. Cloud computing becomes obligatory when the analysis of data cannot be performed by a single computer. Typically, when a thorough analysis is done on massive data, multiple computers are required to balance the data load and analysis. This massive data is sometimes referred to as the BigData. 2 www.marlabs.com Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP

What is Big Data? One of the most key components in business analytics is data. Data is ubiquitous in every field which mainly helps to forecast, vet, transact and consolidate any given business analytical problems. Sometimes, it also plays a major role as a fail safe by maintaining history of each event carried out during the course of development. In today s competitive business world, the demand for data is increasing exponentially. Currently, the magnitude and type of data available to enterprises and the need for analyzing the data in real time for maximum business benefits is growing rapidly. With the advent of social media and networking services like Facebook, Twitter, search engines like Google, MSN and Yahoo, several e-commerce services and online banking services, the data is proliferating in lightning speed. These data may be unstructured and semi structured. We call this as Big Data. Apparently, Big Data is measured in Terabytes, Petabytes, Exabytes and sometimes even more. Processing, managing and analyzing data with this magnitude is highly strenuous and has been a daunting task for the business analysts over the years. Traditional data management and analytical tools and technologies are striving to process large volumes of data due to the unprecedented weight of Big Data. Hence, new approaches are emerging to cope up this problem which will help the enterprises gain maximum values from Big Data. Thus, an open source framework called Hadoop was evolved. 3 www.marlabs.com Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP

Introduction to Hadoop What is HADOOP? Hadoop is an open source software framework licensed under Apache Software Foundation, which is built for supporting data intensive applications running on large clusters and grids, so as to offer scalable, reliable, distributed computing. Apache Hadoop framework is predominantly designed for the distributed processing of large sets of data residing on clusters of computers using simple programming paradigm. It can be operated from single server to tens of thousands of computers, where each computer is responsible for local computation and storage. Apart from this, Hadoop framework also identifies and tackles node failures at the application layer there by offering high availability of the service. Since, using Hadoop framework involves more than a single machine, it is imperative to understand the meaning of clusters and grids, although both work in a similar fashion with a subtle difference in their setup. What is a Cluster? Typically, a cluster is a group of computers (nodes) with same hardware configurations connected to each other through fast local area network, where each node performs a desired task. The results from all the nodes are aggregated to solve a problem which usually requires high availability of the system with low latency. What is a Grid? A grid is similar to cluster but with a subtle difference where multiple nodes are distributed in different geographical locations and are connected to each other through internet. Apart from that, each node in the grid can have different operating system with different hardware configurations. 4 www.marlabs.com Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP

What makes Hadoop a paramount tool In a distributed environment, data should be meticulously arranged across several computers to avoid inconsistency and redundancy in the results for a given specific problem. More over, processing of these data should be done judiciously to achieve low latency. Hence, several factors influence the speed of the operation in a distributed computing system, like the way data is stored, the storage algorithm to manage the distributed data, the parallel computing algorithm to process distributed data, the fault tolerance check on each node etc. In a nut shell, Apache Hadoop uses a programming model called MapReduce for processing and generating large data sets using parallel computing. MapReduce programming model was initially introduced by Google to support distributed computing on large data sets on clusters of computers. Inspired by Google s work, Apache came up with an open source framework for distributed computing and named it as Hadoop. Hadoop is written in Java which makes it platform independent and hence, it is easy to install and use in any commodity machine having Java installed on it. This eliminates the use of heavier hardware which strives to process big data. The Apache Hadoop framework consists of three major modules: Hadoop Kernel MapReduce Hadoop Distributed File System Hadoop Kernel Hadoop Kernel also known as Hadoop Common provides an efficient way to access the filesystems supported by Hadoop. This common package constitutes of necessary Java Archive (JAR) files and scripts which are required to start Hadoop. MapReduce MapReduce is a programming model primarily implemented for processing large data sets. This model was originally developed by Sanjay Ghemawat and Jeffrey Dean at Google. In a nut shell, MapReduce programming model takes a big task and divides it into discrete tasks that can be done in parallel. At its crux, MapReduce is a composite of two functions, map and reduce. 5 www.marlabs.com Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP

Map function processes a key/value pair to generate a set of intermediate key/value pairs and reduce function merges all intermediate values corresponding to the same intermediate key. Map function is applied in parallel to every key/value pair in the data set and it maintains a list of pairs for each call. The MapReduce framework then collects all the pairs with the same key from all the lists and groups them together. Finally one group is created for each unique key. Reduce function is applied in parallel to each group to produce a collection of values in the same domain. Each reduce call can produce either one value or an empty result. Consider the following pseudo code for counting the number of occurrences of each word in a large collection of documents: map(string key, String value): // key: Document name // value: Document content For each word w in value: emitintermediate(w, 1 ); reduce(string key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result = result + v; emit(result); i.e. map (k1, v1) -> list (k2, v2) reduce (k2,list(v2)) -> list(v3) Typically, map and reduce functions execute in parallel for maximum throughput. Each map and reduce function is executed as a separate thread. The final aggregated result from reduce function is stored on local disk. 6 www.marlabs.com Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP

High Level Working of MapReduce Application Input data Map Task Node Master Node Task Node Application Output data Reduce Task Node Hadoop Distributed File System (HDFS) HDFS is a subproject of the Apache Hadoop project. Hadoop uses HDFS to achieve high data throughput access. HDFS is built using Java and runs on top of local file system. This was designed to process, read and write large data files with size ranging from Terabytes to Petabytes. An ideal file size is a multiple of 64 MB. HDFS stores large files across multiple commodity machines. Using HDFS you can easily access and store large data files split across multiple computers, as if you were accessing or storing local files. High reliability is gained by replicating the data across multiple nodes and hence does not require expensive hardware infrastructure like RAID storage on the nodes. The default replication value is 3 and hence data is replicated on three nodes. 7 www.marlabs.com Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP

One of the advantages of using HDFS is data awareness between JobTracker and TaskTracker. The JobTracker schedules the map and reduce jobs to TaskTrackers with an awareness of data location. For example: Assume that node A contains data (a, b, c, d) and node B contains data (x, y, z). The JobTracker will schedule node A to perform map/reduce tasks on (a, b, c, d) and node B would be scheduled to perform map/reduce tasks on (x, y, z).this will greatly reduce the amount of traffic that goes over the network and prevents unnecessary data transfer. Brining the data to the place where map function resides is more expensive and time-consuming than letting the map function execute at the place where the data resides. This advantage is not available in any other file systems. Hadoop uses several types of nodes to form a proper reliable cluster. The NameNode is the major part of the HDFS file system. Its main goal is to maintain the directory tree of all the files in the file system and tracks where across the cluster the file data is stored. It does not store the data of these files itself. Applications interact with NameNode to create copy, move and delete a file in the HDFS file system. Apart from this, a DataNode stores data in the HDFS file system. On Hadoop system startup, a DataNode connects to the NameNode and waits until the service is up and running. The DataNode will respond to the request from the NameNode for file system operations. Applications can directly talk to a DataNode once the NameNode has provided sufficient information about the location of the data. In this process, the map/reduce tasks are performed by TaskTracker node near a DataNode. One of the important performance tunings is to have the TaskTracker instance deployed on the same server where the DataNode instance exists. This will allow MapReduce operations to be performed close to the data. Typically, the HDFS file system uses TCP/IP layer for communication. 8 www.marlabs.com Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP

D Xxc HDFS Architecture HDFS Cluster DataNode Disk Server NameNode Server DataNode Server DataNode Disk Disk 9 www.marlabs.com Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP

Pros & Cons Pros Distributed data and computation: The data is split into multiple chunks and are distributed across several computers. Computations are done on each data chunk by the system where the data chunk resides. This prevents network overhead and unwanted traffic. Easy to use and cost effective: Easy to setup distributed environment and can run on any cheap commodity hardware, be it a single core, dual core or multi cores. Hadoop can even run on modern Tablet PCs or Smart Phones supporting Linux and Java. Reliability: By incorporating HDFS, multiple copies of data are maintained automatically in several nodes in the cluster. There by ensuring the reliability of data in case of any node failure. Fault Tolerance: Hadoop framework has a powerful fault tolerant engine which helps in detecting faults in a node in the event of failure and automatically applies the fix and recovers quickly. Processing of Huge data: HDFS was designed to manage huge files unlike traditional RDBMS which has technical limitations of processing data with capacity more 10 TB. HDFS also offers low latency while reading or writing the files into the HDFS file system. 10 www.marlabs.com Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP

Cons Hadoop is not a substitute for a database: Databases are easy to use when it comes to accessing, updating and deleting the information, just by specifying SQL statements which is a world wide standard. Databases also offer indexing of the data so that the frequently used data can be retrieved quickly with an amazing response time. To retrieve information from HDFS, one should write map reduce program and run it going through all the data. In contrary, Hadoop is not trying to solve the database problems. MapReduce is not always the best solution: MapReduce is a programming model for parallel computing where each concurrently running task is independent of each other. Moreover, it can not be used in small scale data mining. It is mainly designed for managing very large scale data mining, when traditional databases reach a physical limit to manage the mammoth data. Hadoop adds additional overheads while processing data, which is negligible for processing data with unimaginable size but for data with manageable size, will lead to poor response time. These are overheads of inevitable due to its internal design. Designed for Linux: Works efficiently on Linux platforms. For Windows platforms, separate tools like virtual machines or Linux emulators such as Cygwin are required to simulate the Linux environment. Again, several overheads are added by the virtual machine tools or emulators which will not give the hundred percent performance boost. Cumbersome data migration: For maximum data throughput, the problem data should be stored in HDFS file system. Hence, files which are intended to be processed by Hadoop need to be migrated from local file system to HDFS files system which does not happen automatically and should be done manually. Moreover, the migration should be done periodically as and then new data is added to the system making it cumbersome to handle. 11 www.marlabs.com Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP

Conclusion The Apache Hadoop Framework is booming in the world of cloud computing and has been encouraged by several enterprises looking at its simplicity, scalability, reliability for confronting Big data problems. It imparts the fact that even a commodity Desktop PC can be used efficiently for computation of complex and massive data by forming a cluster of PCs which indeed minimizes the CPU idle time and judiciously delegate the tasks to the processors, making it cost effective. It is innate, no matter how big is your data and how fast it grows, users always crave for the data retrieval speed forgetting about the fact that how complex the data is arranged and how difficult it is to process. The bottom line is, the speed and accuracy of analysis should not be compromised irrespective of the data size. Hence, cloud computing is the better solution which will suffice nonfunctional requirements like scalability and performance. References http://en.wikipedia.org/wiki/apache_hadoop http://hadoop.apache.org/ http://wiki.apache.org/hadoop/ http://www.javaworld.com/javaworld/jw-09-2008/jw-09-hadoop.html http://ayende.com/blog/4435/map-reduce-a-visual-explanation 12 www.marlabs.com Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP

About Marlabs Marlabs is a USA headquartered award winning provider of innovative information Technology and Knowledge process Outsourcing services. Founded in 1996 and headquartered in Piscataway, New Jersey (USA). Marlabs has a culture of success balanced between consistent year-on-year revenue growth and excelling employees. Marlabs also has a strong and dedicated human capital strength of over 2100 and a network of delivery centers in USA, Canada, Mexico, and India. Marlabs follows a unique multishore model utilizing Global Technology Centers of Excellence. Marlabs has assisted hundreds of blue chip customers across different verticals to achieve success through both business and operational excellence For more information, please call us at 1-732694-1000 or email us at sales@marlabs.com USA Canada Latin America India Malaysia 13 www.marlabs.com Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP