Alternatives to HIVE SQL in Hadoop File Structure



Similar documents
Big Data With Hadoop

Hadoop Architecture. Part 1

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

HDFS Architecture Guide

Hadoop implementation of MapReduce computational model. Ján Vaňo

Using distributed technologies to analyze Big Data

Large scale processing using Hadoop. Ján Vaňo

CSE-E5430 Scalable Cloud Computing Lecture 2

International Journal of Advance Research in Computer Science and Management Studies

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Hadoop and Map-Reduce. Swati Gore

Snapshots in Hadoop Distributed File System

Distributed File Systems

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Hadoop Distributed File System. Jordan Prosch, Matt Kipps

NoSQL and Hadoop Technologies On Oracle Cloud

Hadoop & its Usage at Facebook

Big Data Weather Analytics Using Hadoop

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Hadoop Distributed File System: Architecture and Design

Application Development. A Paradigm Shift

A Brief Outline on Bigdata Hadoop

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Apache Hadoop FileSystem and its Usage in Facebook

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

Distributed File Systems

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Chase Wu New Jersey Ins0tute of Technology

Hadoop and its Usage at Facebook. Dhruba Borthakur June 22 rd, 2009

Hadoop IST 734 SS CHUNG

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Design and Evolution of the Apache Hadoop File System(HDFS)

11/18/15 CS q Hadoop was not designed to migrate data from traditional relational databases to its HDFS. q This is where Hive comes in.

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Apache Hadoop. Alexandru Costan

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

How To Handle Big Data With A Data Scientist

Data-Intensive Computing with Map-Reduce and Hadoop

Introduction to Hadoop Distributed File System Vaibhav Gopal korat

Reduction of Data at Namenode in HDFS using harballing Technique

Chapter 7. Using Hadoop Cluster and MapReduce

Processing of Hadoop using Highly Available NameNode

Hadoop Distributed File System. Dhruba Borthakur June, 2007

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

Hadoop Ecosystem B Y R A H I M A.

IJFEAT INTERNATIONAL JOURNAL FOR ENGINEERING APPLICATIONS AND TECHNOLOGY

What Is Datacenter (Warehouse) Computing. Distributed and Parallel Technology. Datacenter Computing Application Programming

Hadoop Big Data for Processing Data and Performing Workload

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

HDFS. Hadoop Distributed File System

Apache Hadoop FileSystem Internals

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

Hadoop & its Usage at Facebook

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Parallel Processing of cluster by Map Reduce

Enhancing Massive Data Analytics with the Hadoop Ecosystem

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

Comparative analysis of Google File System and Hadoop Distributed File System

Introduction to Hadoop

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP

CASE STUDY OF HIVE USING HADOOP 1

Hadoop Distributed File System (HDFS) Overview

Big Data on Microsoft Platform

Manifest for Big Data Pig, Hive & Jaql

Hive Development. (~15 minutes) Yongqiang He Software Engineer. Facebook Data Infrastructure Team

RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems CLOUD COMPUTING GROUP - LITAO DENG

Hadoop: Embracing future hardware

<Insert Picture Here> Big Data

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Comparison of Different Implementation of Inverted Indexes in Hadoop

HadoopRDF : A Scalable RDF Data Analysis System

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Apache HBase. Crazy dances on the elephant back

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com

Realtime Apache Hadoop at Facebook. Jonathan Gray & Dhruba Borthakur June 14, 2011 at SIGMOD, Athens

Workshop on Hadoop with Big Data

Big Data and Apache Hadoop s MapReduce

A very short Intro to Hadoop

INTERNATIONAL JOURNAL OF INFORMATION TECHNOLOGY & MANAGEMENT INFORMATION SYSTEM (IJITMIS)

Open source Google-style large scale data analysis with Hadoop

Big Data: Tools and Technologies in Big Data

Transcription:

Alternatives to HIVE SQL in Hadoop File Structure Ms. Arpana Chaturvedi, Ms. Poonam Verma ABSTRACT Trends face ups and lows.in the present scenario the social networking sites have been in the vogue. The data that is available from such enterprises or research enterprises are in petabytes, which result in the information explosion. However this data is raw and unstructured. The main reason to get data was to exploit it for the profitable purposes. This data are structured through a Hadoop, which is a free Java programming framework that restructures it appropriately and provides the base for the data mining processes. This framework is applicable on the large datasets. The approach utilized by Hadoop uses distributed file systems which in this information era, has helped to reduce the catastrophic failure caused by the inoperation of the nodes in the system. In this article, we would like to discuss the framework of the Hadoop and the challenges faced by it. Moreover, over the past few years, there has been considerable research on its challenges and therefore many alternatives have been proposed, which will be also discussed hereby. Keywords:- HDFS,HIVE, SQL queries 1. INTRODUCTION Humans have been the species with supreme intellect and his modes of communication have added to his capacity to act as intelligent species. With the advent of the social networking sites, the information of huge amount started accumulating on the daily basis. Even the research firms and army agencies have contributed to this huge datasets of information. This resulted in information explosion. The agencies, if are unable to exploit this set of data, for their business purposes, the efforts will become useless. In order to make use of this big data, a huge amount of data which is unstructured and complicate, needs certain techniques. This data when stored in the data warehouses, if not retrieved at right moment will result into stagnant business modes. Thus the root of the cause, that is Big data needs to be structured. This is carried out with the help of the Hadoop framework. HDFS File System HDFS is a distributed file system that is similar to the Distributed File System, however the HDFS is highly fault tolerant and can be implemented on the low cost hardware. HDFS is more suitable for the large datasets application and gives better results with the large file thus providing high throughput access to applications utilizing the large datasets. Architecture of HDFS HDFS follows the architecture of Master /Slave. There are numerous clusters in the HDFS and has a Single Name Node, which is a master server and has a number of Data Nodes. The basic Task of the Name Node is to manage the File System Namespace and it manages the access to the files by the client. There are number of Data Nodes where each cluster of HDFS has only one node in the cluster. The Data Node manages the storage attached to the nodes that they run on. HDFS has a file system namespace which can store the user data in the files. A file is split into one or more block and these blocks are stored in a set of Data Nodes. The Name Node executes the file system operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to Data Nodes. The Data Nodes are responsible for serving read and write requests from the file system s clients. The Data Nodes also perform block creation, deletion, and replication upon instruction from the Name Node. The Name Node and the Data Node are the softwares that are required to be run on the machines which are based on Linux / GNU based operating system. The basic structure of the HDFS has been prepared in the Java language, therefore it is platform independent and compatible with almost all varieties of the machines. Each of the machines in the cluster runs the software of the Data Node. As the Architecture of the HDFS utilizes a single Name Node, therefore the Name Node is the repository of all the Metadata stored in the HDFS. The system has been designed in such a way that user data never overflows through the NameNode. HDFS is a file system that caters the needs of large set of users, thus there is high probability of the file data being accessed in a write once and read many framework, where the data can be modified by a large number of clients concurrently. Thus it leads to the desynchronized information. This information is handled by a single machine called Name Node. The Name Node stores all the metadata in the main memory for the file system such as filenames, permissions and the locations of each block of each file, so that it leads to faster access of the metadata. To open a file, a client contacts the Name Node which helps to retrieve the file by listing out a set of locations which carry the block of the files. These locations are generally the physical locations of the DataNodes that can serve the user s requirement for reading file directly, sometimes concurrently. Volume 3, Issue 9, September 2014 Page 190

The NameNode is a Node that is not involved in this transactions of the data,thereby reducing it s overhead. However the NameNode also requires a backup plan, because if the NameNode machine fails, then there are multiple systems that permit the NameNode to preserve the file systems metadata. So in case of the Name Node failure, the data is still recoverable. The structure of HDFS can be compared to the star topology where Name Node is the HUB and the Data Nodes are the computers collection added to the HUB. Thus if an Individual DataNode crashes, the entire network continues to operate. However the cluster remains inaccessible until it is manually restored. Features 1) Hardware Failure 2) Streaming Data Access 3) Large Data Sets 4) Simple Coherency Model 5) Moving Computation is Easy then Moving Data 6) Portability across heterogeneous Hardware s and Software s platforms 2. HARDWARE FAILURE An instance of an HDFS consists of hundreds or thousands of server machines, each storing part of the file system s data. As indicated above, the HDFS has a huge number of components and at any point some of the components of the HDFS are going to be non-functional, which leads to the upgrading of the HDFS in such a way that it can detect faults and cause quick automatic recovery. 3. STREAMING DATA ACCESS Applications that run on HDFS need streaming access to their data sets. HDFS supports POSIX Semantics which emphasis on high throughput of data access rather than low latency of data access. 4. LARGE DATA SETS Applications that run on HDFS have large data sets of gigabytes to terabytes in size. HDFS provide high aggregate data bandwidth and scale to hundreds of nodes in a single cluster. It should support tens of millions of files in a single instance. 5. SIMPLE COHERENCY MODEL HDFS applications need a write-once-read-many access model for files. A file once created, written, and closed need not be changed. This assumption simplifies data coherency issues and enables high throughput data access. A Map/Reduce application or a web crawler application fits perfectly with this model. There is a plan to support appending-writes to files in the future. 6. MOVING COMPUTATION IS CHEAPER THAN MOVING DATA HDFS provides interfaces for applications to move themselves closer to where the data is located. When the size of data is huge, computation requested by an application is much more efficient if it is executed near the data it operates on. This minimizes network congestion and increases the overall throughput of the system. The assumption is that it is Volume 3, Issue 9, September 2014 Page 191

often better to migrate the computation closer to where the data is located rather than moving the data to where the application is running. 7. Portability across Heterogeneous Hardware and Software Platforms HDFS has been designed to be easily portable from one platform to another. This facilitates widespread adoption of HDFS as a platform of choice for a large set of applications. HDFS is designed to store a very large amount of information (terabytes or petabytes). This requires spreading the data across a large number of machines. It also supports much larger file sizes than NFS. HDFS should store data reliably. If individual machines in the cluster malfunction, data should remain secure. HDFS should provide fast, scalable access to this information. It should be possible to serve a larger number of clients by simply adding more machines to the cluster. HDFS should integrate well with Hadoop MapReduce, allowing data to be read and computed upon locally when possible. There are a large number of additional decisions and trade-offs that were made with HDFS. In particular Applications that use HDFS are assumed to perform long sequential streaming reads from files. HDFS is optimized to provide streaming read performance; this comes at the expense of random seek times to arbitrary positions in files. Data will be written to the HDFS once and then read several times; updates to files after they have already been closed are not supported. (An extension to Hadoop will provide support for appending new data to the ends of files; it is scheduled to be included in Hadoop 0.19 but is not available yet.) Due to the large size of files, and the sequential nature of reads, the system does not provide a mechanism for local caching of data. The overhead of caching is great enough that data should simply be re-read from HDFS source. Individual machines are assumed to fail on a frequent basis, both permanently and intermittently. The cluster must be able to withstand the complete failure of several machines, possibly many happening at the same time (e.g., if a rack fails all together). While performance may degrade proportional to the number of machines lost, the system as a whole should not become overly slow, nor should information be lost. Data replication strategies combat this problem. HIVE Overview Hive is a data warehousing infrastructure based on Hadoop. Hadoop provides massive scale out and fault tolerance capabilities for data storage and processing (using the map-reduce programming paradigm) on commodity hardware. Hive is designed to enable easy data summarization, ad-hoc querying and analysis of large volumes of data. It provides a simple query language called Hive QL, which is based on SQL and which enables users familiar with SQL to do adhoc querying, summarization and data analysis easily. At the same time, Hive QL also allows traditional map/reduce programmers to be able to plug in their custom mappers and reducers to do more sophisticated analysis that may not be supported by the built-in capabilities of the language. However, Hive is not designed for online transaction processing and does not offer real-time queries and row level updates. It is best used for batch jobs over large sets of immutable data (like web logs). Data Units In the order of granularity - Hive data is organized into: Databases: Namespaces that separate tables and other data units from naming confliction. Tables: Homogeneous units of data which have the same schema. An example of a table could be page_views table, where each row could comprise of the following columns (schema): timestamp - which is of INT type that corresponds to a unix timestamp of when the page was viewed. userid - which is of BIGINT type that identifies the user who viewed the page. page_url - which is of STRING type that captures the location of the page. referer_url - which is of STRING that captures the location of the page from where the user arrived at the current page. IP - which is of STRING type that captures the IP address from where the page request was made. Partitions: Each Table can have one or more partition Keys which determines how the data is stored. Partitions - apart from being storage units - also allow the user to efficiently identify the rows that satisfy a certain criteria. For example, a date_partition of type STRING and country_partition of type STRING. Each unique value of the partition keys defines a partition of the Table. For example all "US" data from "2009-12-23" is a partition of the page_views table. Therefore, if you run analysis on only the "US" data for 2009-12-23, you can run that query only on the relevant partition of the table thereby speeding up the analysis significantly. Note however, that just because a partition is named 2009-12-23 does not mean that it contains all or only data from that date; partitions are named after dates for convenience but it is the user's job to guarantee the relationship between partition name and data content!). Partition columns are virtual columns, they are not part of the data itself but are derived on load. Volume 3, Issue 9, September 2014 Page 192

Buckets (or Clusters): Data in each partition may in turn be divided into Buckets based on the value of a hash function of some column of the Table. For example the page_views table may be bucketed by userid, which is one of the columns, other than the partitions columns, of the page_view table. These can be used to efficiently sample the data. It is not necessary for tables to be partitioned or bucketed, but these abstractions allow the system to prune large quantities of data during query processing, resulting in faster query execution. Language Capabilities Hive query language provides the basic SQL like operations. These operations work on tables or partitions. These operations are: Ability to filter rows from a table using a where clause. Ability to select certain columns from the table using a select clause. Ability to do equi-joins between two tables. Ability to evaluate aggregations on multiple "group by" columns for the data stored in a table. Ability to store the results of a query into another table. Ability to download the contents of a table to a local (e.g., nfs) directory. Ability to store the results of a query in a hadoop dfs directory. Ability to manage tables and partitions (create, drop and alter). Ability to plug in custom scripts in the language of choice for custom map/reduce jobs. HIVE ARCHITECTURE Hive Components HIVE DATA MODEL Alternatives for HiveSQL 1) Impala is a modern, open source, MPP SQL query engine for Apache Hadoop. 2) Tez is a new application framework built on Hadoop Yarn that can execute complex directed acyclic graphs of general data processing tasks. In many ways it can be thought of as a more flexible and powerful successor of the map-reduce framework. 3) Shark is an open source distributed SQL query engine for Hadoop data. It brings state-of-the-art performance and advanced analytics to Hive users. Run unmodified Hive queries on existing warehouses. 4) Presto is an open source distributed SQL Query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes. It was designed and written for the interactive analytics and is successfully solves the scalability issue, if encountered. A single query in Presto can combine data from multiple sources, allowing for analytics across the entire organization. 5) Pig uses a language called Pig Latin that specifies dataflow pipelines. Alike Hive translates HiveQL into MapReduce which it then executes, Pig performs similar MapReduce code generation from the Pig Latin scripts. 7. Conclusion Alternatives of Hive SQL have been described in this paper, which at times increases the performance efficiency of the analysis to be carried out with the Big Data. There are certain issues such as speed, scalability and security which are not solved by the Hive Query. Hence different alternatives have come out in the market, which helps to solve the above listed issues. Volume 3, Issue 9, September 2014 Page 193

References [1.] Borthakur, Dhruba. The Hadoop Distributed File System: Architecture and Design. 2007, The Apache Software Foundation OSCON Data 2011, Adrian Cockcroft, "Data Flow at Netflix" on YouTube [2.] Facebook's Petabyte Scale Data Warehouse using Hive and Hadoop [3.] Ghemawat, S. Gobioff, H. and Leung, S.-T. The Google File System. Proceedings of the 19th ACM Symposium on Operating Systems Principles. pp 29--43. Bolton Landing, NY, USA. 2003. 2003, ACM [4.] Lam, Chuck (2010). Hadoop in Action. Manning Publications. ISBN 1-935182-19-6. [5.] Hive Language Manual [6.] Implement insert, update, and delete in Hive with full ACID support [7.] Hive A Warehousing Solution Over a MapReduce Framework [8.] Hadoop DFS User Guide. 2007, The Apache Software Foundation. [9.] Yongqiang He, Rubao Lee, Yin Huai, Zheng Shao, Namit Jain, Xiaodong Zhang and Zhiwei Xu. "RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems" (PDF). [10.] White, Tom (2010). Hadoop: The Definitive Guide. O'Reilly Media. ISBN 978-1-4493-8973-4. [11.] Working with Students to Improve Indexing in Apache Hive Volume 3, Issue 9, September 2014 Page 194