A Brief Outline on Bigdata Hadoop

Transcription

1 A Brief Outline on Bigdata Hadoop Twinkle Gupta 1, Shruti Dixit 2 RGPV, Department of Computer Science and Engineering, Acropolis Institute of Technology and Research, Indore, India Abstract- Bigdata is a collection of large data sets. The challenges faced to handle the bigdata are analysis, curation, sharing, capture, search, storage, transfer, visualization and privacy violation. To overcome this challenge, we use a technique known as Hadoop. Hadoop stores massive data and process it in parallel way. Hadoop has two core components- HDFS and Map Reduce which will be described in second section. In this paper, we will also discuss other components of Hadoop like HBase, Hive, Pig, Sqoop, Zookeeper, Avro, Oozie etc. along with pros and cons of this technique. Keywords- Bigdata, Hadoop, HDFS, Hive, MapReduce. I. INTRODUCTION A. Bigdata: Now a days, technology is growing rapidly, new advent devices like Smartphone s, tablets, cameras, microphones, social media like twitter, facebook, LinkedIn, etc., GPS, aerial(remote sensing) etc. generates large amount of data in terabytes and petabytes, thus constitutes the bigdata. Enhancement in data can be observed by taking 3Vs as parameters:- Volume - Volume represents amount of data. Data produced is either semi structured, structured or unstructured. Large amount of data is produced through social media. By 2020, it is estimated that data will be 50 times more than today s data. Variety Production of data is taking place in large variety like text, images, videos, logfiles etc. This variety makes the data more complex to handle. Velocity Data is increasing very fastly with high speed. One minute becomes too late because tasks are performed in milliseconds. B. Hadoop: Hadoop was created by Doug Cutting and Mike Caffarella in Hadoop is open source software framework [4]. Open source software-hadoop is free to download, use, create and manage the program. Framework-Hadoop provides toolset, connections to develop and run software applications. Hadoop distributes the file in small chunks over thousands of nodes and process the data in parallel way. Hadoop stores the massive data and replicates the data. Bigdata is analyzed, handled, operated and utilized by Hadoop. II. HADOOP ARCHITECTURE Hadoop has two core components HDFS (Hadoop Distributed File System) and Map Reduce which can be run on different operating systems like Windows Mac OS/X, Windows, Solaris, LINUX and only requires commodity hardware [2]. Hadoop is designed to run on clusters of machines. 184

2 A. HDFS It is a JAVA based file system. It forms master/slave architecture. Every cluster has a master node, i.e., name node, many slave nodes, i.e., data nodes and secondary name node.. Every well-suited file system should provide location awareness: the name of the rack (network switch) where worker nodes (data nodes) are placed. HDFS provides write-once-read-many access model [3]. A file which is once created, written and closed, need not be changed. To handle large datasets, HDFS offers built-in replication. Name Node and Data Nodes Name node holds Meta data about all the data nodes. It executes namespace file system which allows user to store data in files. These files are divided in blocks by data nodes on instruction of name node [10]. Its default size is 64 MB. Client requests name node to perform operations like renaming, creation, deletion, etc. Name node does not store HDFS data but maps the HDFS file name, a list of blocks in the file and data node on which blocks are stored. Data nodes performs this operations on the request of name node. Data nodes replicates the block. By default, replication factor is 3, but larger the replication factor, better will be the read performance of the cluster for fault-tolerance [7]. Data node gives report to name node in the form of heartbeat message. Heartbeat helps name node to detect whether the data node is working properly or not [11]. Secondary name node is like data back-up. It is recommended to keep the secondary name node configuration same as name node so that in case of name node failure this can be promoted as name node [7]. B. MapReduce MapReduce is the second base component of a hadoop framework. It is the software that provides flexible programming model to write application to perform parallel data processing. MapReduce jobs can be written in JAVA or any other language. It is a developer friendly framework. MapReduce works upon HDFS, it takes data from HDFS and sends back to HDFS after execution. In MapReduce model, computation is divided into two user-defined function Map function and Reduce function. Map function takes key/value pair as input and produces one or more intermediate key/value pair. This intermediate key/value pair as input goes to reduce function and merges all values corresponding to a single key [2]. Here, the value is data record and key is generally the offset of data record. 185

3 Figure 2: MapReduce MapReduce cluster consists of job tracker and many task tracker. In this, Job tracker is the software program on master node that coordinates, manages the job and deals with the fault-tolerance. Every cluster has a job tracker. The job execution starts when the client program submit to job tracker a job configuration, which shows the map, combine and reduce function as well as input and output path of data. Job tracker monitors the job assigned to the task tracker. Task tracker reports their status to job tracker. Task tracker is responsible for launching parallel task and divide the data into computing slots. It manages the tasks map task and reduce task [5]. MapReduce job is like a pipeline with many stages [11] like- 1. Map input Source data is read into map task. 2. Map computation Here computation takes place. 3. Partition and sort in map side This ensure that the records are spilled only once to HDFS. 4. Map output Hadoop background daemons merge sorted partition to disk. 5. Reduce input map output will be read and copied to reduce task. 6. Merge and sort in reduce side Merging and sorting of data takes place. 7. Reduce computation Computation of reduce code takes place. 8. Reduce output This stage will provide the output to HDFS. Here each stage requires different types of resources. For efficient output, we must ensure that the pipeline is clear throughout. 186

4 HBase, Hive, Pig, Zookeeper, Sqoop, Avro, Oozie are the some components of hadoop which are explained below- HBase- HBase is good to work upon sparse data, data in which most of the data is unimportant or empty and only few percent is important. It is non-relational, distributed database system and open source. It is column oriented database management system. It is made for random, real-time read/write access of big data. A HBase system forms a set of tables and this table can be serve as input and output for MapReduce job running in hadoop. It runs on top of HDFS. It does not support structured query language like SQL [8]. Hive- It is data warehouse software which supports the analysis of large datasets. It is also an open source software. Hive provides SQL like query language HiveQL OR Hive Query Language. HiveQL automatically convert into MapReduce jobs [6]. It is used by many companies like Facebook, Netflix, Amazon, etc. Pig- Pig provides a high level scripting language like SQL is known as PigLatin. Figure 3: Components of HADOOP 187 Pig is used to analyze large amount of data and MapReduce job can be done on this platform. Userdefined function can be made on Pig in any language like JAVA, Python, Ruby, JavaScript or Groovy etc. It can work on both structured and unstructured data [11]. Zookeeper- Zookeeper is an application which provides centralized management, synchronization services, and configuration management, naming and group services across a cluster. Zookeeper is a distributed service and highly reliable [9]. Sqoop- Sqoop is a tool for transferring data between hadoop and non-hadoop or external structure data stores such as relational database and data warehouses. It transfers the data in parallel way. In this, data is more efficiently analyzed [9]. Avro- Avro provides data serialization system and exchanges data services too [1]. There is no requirement of code generation for read and write data files. It is compact, fast and data is stored in binary format [7]. Oozie- Oozie is a JAVA web application which is collection of actions [1]. It has scheduler system which schedules job to hadoop [10].

5 III. ADVANTAGES OF HADOOP Open Source Software- It is a platform where developers can create and manage the programs. It is free to use and download. Framework- Hadoop provides toolsets, connections, programs etc. that one need to run or develop the software application. Distributed- Hadoop distributes the data across multiple nodes which helps to process data in parallel manner. Scalability- Hadoop s scalability can be enhanced by adding more nodes. Fault tolerant and inherent data protection- When a system fails, job of that system redirect to other system and process continues. It protects the data by making multiple copies of blocks. Massive Storage- Hadoop has capacity to store huge amount of data by dividing the files in blocks. Low Cost- Hadoop has open-source framework which is free and only requires commodity hardware. IV. DISADVANTAGES OF HADOOP Data is not much secured in hadoop. Hadoop has no encryption at the storage and network levels. As government agencies and others that prefers to keep their data under wraps, it is not suitable for them. Two base components of hadoop- HDFS and MapReduce are in rough manner because these are in under active development. Managements of clusters of hadoop is too hard, many operations like debugging, distributing software etc. are difficult. Hadoop is not suitable for small data, because HDFS is unable to support the random reading of small files. Programming language mostly JAVA is used to develop entire framework which has been exploited by cyber criminals. Some other components like Hive, Sqoop, HBase, Zookeeper, Oozie, Avro, Pig in which most of them are Apache top level projects. REFERENCES [1] Ms. Vibhavari Chavan, Prof. Rajesh N.Phursule Survey Paper on Big Data International Journal of Computer Science and Information Technology volume 5(6). [2] Jeffrey Shafer, Scott Rixner, and Alan L.Cox Rice University Houston TX The Hadoop Distributed Filesystem :Balancing Portability and Performance [3] Hadoop available on: http//hadoop.apache.org./...hdfs design, hadooptutorial.wikispaces.com/hadoop+architecture and en.wikipedia.org/wiki/apache_hadoop [4] [5] Poonam S.Patil, Rajesh N.Phursule Survey Paper on Big Data Processing and Hadoop Components International Journal of Science and Research. [6] Sabia and Love Arora Technologies to Handle Big Data : A Survey Department of Computer Science and Engineering, GuruNanak Dev College University, Regional Campus, Jalandhar, India. [7] Cognizant insights Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizing Analytical Workloads. [8] Hbase is available here: Wikipedia.org/wiki/Apache_HBase, and hortonworks.com/hadoop/hbase/ [9] wiki.apache.org/hadoop/zookeeper, [10] [11] Chen He, Derek Weitzel, David Swanson, Ying Lu HOG: Distributed Hadoop MapReduce on the Grid, Computer Science and Engineering, University of Nebraska Lincoln. About Authors Shruti Dixit pursuing her BE in Computer Science and Engineering from Acropolis Institute of Technology and Research, Indore [email protected] V. CONCLUSION In this paper, we studied a technique named Hadoop, which is used to manage BigData. As there is huge amount of data lying in market but no tool to maintain this Big Data, so Hadoop can be better choice. Hadoop is scalable, flexible, reliable, fast, and portable and can be implemented on low cost hardware. We also explained about Hadoop Architecture which consists of two core components- HDFS and MapReduce. In which, HDFS stores huge amount of data and MapReduce processes the data in parallel manner. Twinkle Gupta pursuing her BE in Computer Science and Engineering from Acropolis Institute of Technology and Research, Indore [email protected] 188