CASE STUDY OF HIVE USING HADOOP 1



Similar documents
11/18/15 CS q Hadoop was not designed to migrate data from traditional relational databases to its HDFS. q This is where Hive comes in.

Advanced SQL Query To Flink Translator

Performance Overhead on Relational Join in Hadoop using Hive/Pig/Streaming - A Comparative Analysis

Hadoop and Hive. Introduction,Installation and Usage. Saatvik Shah. Data Analytics for Educational Data. May 23, 2014

Spring,2015. Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE

Enhancing Massive Data Analytics with the Hadoop Ecosystem

Alternatives to HIVE SQL in Hadoop File Structure

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig

Hadoop and Big Data Research

Big Data Weather Analytics Using Hadoop

American International Journal of Research in Science, Technology, Engineering & Mathematics

Hadoop IST 734 SS CHUNG

SARAH Statistical Analysis for Resource Allocation in Hadoop

ITG Software Engineering

Big Data on Microsoft Platform

HIVE. Data Warehousing & Analytics on Hadoop. Joydeep Sen Sarma, Ashish Thusoo Facebook Data Team

INTERNATIONAL JOURNAL OF INFORMATION TECHNOLOGY & MANAGEMENT INFORMATION SYSTEM (IJITMIS)

OLH: Oracle Loader for Hadoop OSCH: Oracle SQL Connector for Hadoop Distributed File System (HDFS)

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

Using distributed technologies to analyze Big Data

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Hive Development. (~15 minutes) Yongqiang He Software Engineer. Facebook Data Infrastructure Team

Approaches for parallel data loading and data querying

Implement Hadoop jobs to extract business value from large and varied data sets

Big Data: Tools and Technologies in Big Data

Data processing goes big

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

A Brief Outline on Bigdata Hadoop

Important Notice. (c) Cloudera, Inc. All rights reserved.

Secure Data Storage and Retrieval in the Cloud

Analysis of Big data through Hadoop Ecosystem Components like Flume, MapReduce, Pig and Hive

Internals of Hadoop Application Framework and Distributed File System

CREDIT CARD DATA PROCESSING AND E-STATEMENT GENERATION WITH USE OF HADOOP

Big Data Too Big To Ignore

How To Create A Large Data Storage System

NetFlow Analysis with MapReduce

Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University

Data Migration from Grid to Cloud Computing

Cloudera Certified Developer for Apache Hadoop

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Hive A Petabyte Scale Data Warehouse Using Hadoop

ISSN: (Online) Volume 3, Issue 4, April 2015 International Journal of Advance Research in Computer Science and Management Studies

A Study on Big Data Integration with Data Warehouse

Hadoop Distributed File System. -Kishan Patel ID#

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Data Domain Profiling and Data Masking for Hadoop

Hadoop and Map-Reduce. Swati Gore

HDFS. Hadoop Distributed File System

Survey Paper on Big Data Processing and Hadoop Components

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

Chase Wu New Jersey Ins0tute of Technology

Hadoop/Hive General Introduction

Apache Hive. Big Data 2015

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVIRONMENT

Data Warehousing and Analytics Infrastructure at Facebook

Teradata Connector for Hadoop Tutorial

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

MySQL and Hadoop: Big Data Integration. Shubhangi Garg & Neha Kumari MySQL Engineering

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Big Data Analytics on Cab Company s Customer Dataset Using Hive and Tableau

Hadoop and Hive Development at Facebook. Dhruba Borthakur Zheng Shao {dhruba, Presented at Hadoop World, New York October 2, 2009

Big Data and Scripting Systems build on top of Hadoop

Chapter 7. Using Hadoop Cluster and MapReduce

TP1: Getting Started with Hadoop

Hadoop Job Oriented Training Agenda

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE

Hadoop and its Usage at Facebook. Dhruba Borthakur June 22 rd, 2009

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Creating a universe on Hive with Hortonworks HDP 2.0

APACHE HADOOP JERRIN JOSEPH CSU ID#

Native Connectivity to Big Data Sources in MSTR 10

Big Data and Apache Hadoop s MapReduce

VOL. 5, NO. 2, August 2015 ISSN ARPN Journal of Systems and Software AJSS Journal. All rights reserved

CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment

Problem Solving Hands-on Labware for Teaching Big Data Cybersecurity Analysis

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

project collects data from national events, both natural and manmade, to be stored and evaluated by

CSE-E5430 Scalable Cloud Computing Lecture 2

A very short Intro to Hadoop

Big Data Time Series Analysis

Qsoft Inc

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Constructing a Data Lake: Hadoop and Oracle Database United!

RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems CLOUD COMPUTING GROUP - LITAO DENG

MapR, Hive, and Pig on Google Compute Engine

Big Data and Scripting Systems build on top of Hadoop

LANGUAGES FOR HADOOP: PIG & HIVE

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

Mr. Apichon Witayangkurn Department of Civil Engineering The University of Tokyo

On a Hadoop-based Analytics Service System

Facebook s Petabyte Scale Data Warehouse using Hive and Hadoop

Replicating to everything

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Transcription:

CASE STUDY OF HIVE USING HADOOP 1 Sai Prasad Potharaju, 2 Shanmuk Srinivas A, 3 Ravi Kumar Tirandasu 1,2,3 SRES COE,Department of er Engineering, Kopargaon,Maharashtra, India 1 psaiprasadcse@gmail.com Abstract: Hadoop is a framework of tools. It is not a software that you can download on your computer. These tools are used to running applications on big data which has huge in capacity,need to process quickly and can be in variety forms. To manage the big data HIVE used as a data warehouse system for Hadoop that facilitates ad-hoc queries and the analysis of large datasets stored in Hadoop.Hive provides a SQL-LIKE languages called HIVEQL. In this paper we explains how to use hive using Hadoop with a simple real time example and also explained how to create a table,load the data into table from external file,retrieve the data from table and their different statistics like CPU time for each stage of query execution,cumulative CPU time and time taken to fetch records. Key Words: Hadoop, Hive, MapReduce, HDFS, HIVEQL 1. INTRODUCTION 1.1. Hadoop Hadoop is a open source and is distributed under Apache license. It is a framework of tools and not a software that you can download. These tools are used to running applications on big data.big data means data with respective to its volume, speed, variety forms(unstructured).in traditional approach big data is processed by using powerful computer but this computer will do good job until some limit,because computer is not scalable. It process according to its processor(core) type and speed,memory capacity. Hadoop takes different approach (Fig 1)than its traditional approach. It breaks the data into smaller pieces. Breaking the data into smaller pieces is good idea but how about computation? ation is also broken into pieces i.e processed into different levels/nodes and result will be combined together. Big DATA Combin ed Result Fig.1. At the highest level it has a simple architecture which breaks the data and computation into pieces.hadoop has two important components. A.MapReduce B. File System(HDFS). One important characteristic of Hadoop is that it works on Distributed model and it is Linux based set of tools. All the Machines or nodes under Hadoop will have two components. A. Task Tracker(T)[1] [2] B. Data Node(D)[1][2].(Fig 2) The job of task tracker is to process the task given to particular node and the job of data node is manage the data given to particular node.the nodes which process the piece of data and has task tracker and data nodes are called slaves. Above these slaves there will be a master which has additional components 37

called 1.Job Tracker(J) [1][2]2.Name Node(N)[1][2].(Fig 2) Job Tracker and Task Tracker are comes under MapReduce. Data Node and Name Node are comes under HDFS Fig.3. HIVEQL can also be exchange with custom scalar functions means user defined functions(udf ' S),aggregations(UDFA's) and table functions(udtf's). 3. HIVE ARCHITECTURE 1.2. HIVE Fig. 2. Hive is a Data warehouse system layer built on Hadoop. It allows to define a structure for unstructured big data. It simplifies,analysis and queries with an SQL like scripting language called HIVEQL. Hive is not Relational Database,it uses a database to store meta data,but the data that hive processes is stored in HDFS. Hive is not designed for on-line transaction processing(oltp).hive runs on Hadoop which is a Batch-processing system where jobs can have high latency with substantial overhead. Therefore latency for hive queries is generally high even for small jobs. Hive is not suited for real-time queries and row level updates and it is best used for batch jobs over large sets of immutable data such as web logs. 2. The Typical Use case of Hive Hive it takes large amount of unstructured data and place it into a structured view as shown in Fig.3.,that can be used by business analysts by the business tools. Hive supports use cases such as Ad-hoc queries,summarization,data analysis. Fig.4. As shown in Fig.4. Hive Architecture consists of A. Meta store It stores schema information and provides a structure to stored data. B.HIVEQL Which is for query processing,compiling,optimizing and execution,those queries turned into MapReduce job which invokes mapper and reducer for processing. HIVEQL is similar to other SQLs,it uses familiar relational database concepts(tables,rows,columns,schema),it is based on the SQL-92 specifications. Hive supports multi-table inserts via the code means big data can be accessed via tables. It converts SQL queries into MapReduce jobs. 38

User does not need to know MapReduce. Also supports plugging custom MapReduce scripts into queries. C. User Interface Hive can be approached by number of user interfaces. It has web UI,it has own command line interface and can be accessed using HDInsight. Hive is good choice when you want to query the data,when you need an answer to specific questions if you are familiar with SQL. 4. Using of HIVE with HADOOP. A.Prerequisites to work with Hive The prerequisites for setting up Hive and running queries are 1.User should have stable build of Hadoop 2.Machine Should have Java 1,6 installed 3.Basic Java Programming skills 4.Basic SQL Knowledge 5.Basic Knowledge of Linux To start Hive first start all the services of Hadoop using the command $ start-all.sh. Then check all services are running or not using JPS command,if all are services are running then Use the command $ hive to start HIVE B.HIVE TABLES Here we are explaining about basics of Hive Tables. A Hive table consists of Data: It is typically a file or group of files in HDFS. Schema:It is in the form of meta data stored in a relational database. Schema and data are separate.a schema can be defined for existing data. Data can be added or removed independently. You have to define a schema if you have existing data in HDFS that you want to use in Hive C. Managing Tables See current tables hive>show Tables; Check the schema hive>describe mytable; change the table name hive> ALTER TABLE mytablename RENAME to mt; Add a Column hive>alter TABLE mytable ADD COLOUMNS (mycol STRING); Drop a Partition hive >ALTER TABLE mytable DROP PARTITION(age=17) D. Loading Data Use LOAD DATA to import data into Hive Table hive>load DATA LOCAL INPATH ' path of input data' INTO TABLE mytable; The files are not modified by Hive. Use the word OVERWRITE[3][4] to write over a file of the same name. Hive warehouse default location is /hive/warehouse. Hive can read all of the files in a particular directory. The schema is checked when the data is required and if a row does not match the schema,it will be read as null. E. Insert Use INSERT[3] statement to populate data into a table from another HIVE table. Since query results are usually large it is best to use an INSERT clause to tell HIVE where to store your query. Example: hive>create TABLE age_count(name STRING,age INT); hive>insert OVERWRITE TABLE age_count SELECT age,count(age) FROM mytable; F. Performing Queries SELECT statement is used for this. It supports the following: Where clause UNION ALL and DISTINCT GROUP BY and HAVING LIMIT clause Can Use REGULAR EXPRESSION Column Specification 39

5. Case Study In this we are giving a real example how to use HIVE on Top of the HADOOP and different statistical analysis. In this example we have a predefined dataset (cricket.csv) having more than 20 columns and more than 10000 records in it. In the below example we are explaining how to display few records based on attributes of cricket. The cricket.csv file has different attributes like name,country, Player_iD,Year.stint,teamid,run,igl,D,G,G_ba tting.ab,r,h,2b,3b,hr,rbi,sb,cs,bb,so,i BB,HBP,SH,SF,GIDP,G_OLD Follow the below steps in order to retrieve the data from Big tables. 1.Start Hadoop $ start-all.sh 2.Create a Folder and copy all predefined datasets $ hadoop fs -mkdir sampledataset $ hadoop fs -put /home/sai/hivedataset/ /user/sai/sampledataset Note:hivedataset has Big Data tables in.cs format,those tables are copying from my local file system to Hadoop. 3.Start Hive $ hive 4.Create Table in Hive hive>create table temp_cricket (col_value STRING); 5.Load the Data from csv file to table temp_batting hive> LOAD DATA INPATH '/user/sai/sampledataset/hivedataset/cricket.cs v' OVERWRITE INTO TABLE temp_cricket; 6.Create Another table to Extract data hive>create table batting (player_id STRING, year INT, runs INT); 7.Insert data Into newly created data By extracting hive>insert overwrite table batting SELECT regexp_extract(col_value, '^(?:([^,]*)\,?){1}',3) player_id, regexp_extract(col_value, '^(?:([^,]*)\,?){2}', 4) year, regexp_extract(col_value, '^(?:([^,]*)\,?){9}', 7) run from temp_cricket; In above query from temp_cricket table player_id,year,run are loaded into batting table. Execute Few Queries and observe the result Example 1 Display year and maximum runs scored in each year hive> SELECT year, max(runs) FROM batting GROUP BY year; Total Jobs=1 hadoop job information for stage-1: Number of mappers:1;number of reducers:1 Stage-1 map=0% reduce =0% Stage-1 map=100% reduce =0% cumulative CPU 3.45 Stage-1 map=100% reduce =100% cumulative CPU 6.35 MapReduce Total cumulative CPU time:6.35 Time taken :46.09 Example 2 Display player_id,year and maximum runs scored in each year hive> SELECT a.year, a.player_id, a.runs from batting a JOIN (SELECT year, max(runs) runs FROM batting GROUP BY year ) b 40

b.runs) ; ON (a.year = b.year AND a.runs = Total Jobs=2 Launching job 1 out of 2 hadoop job information for stage-2: Number of mappers:1;number of reducers:1 Stage-2 map=0% reduce =0% Stage-2 map=100% reduce =0% cumulative CPU 3.0 Stage-2 map=100% reduce =100% cumulative CPU 5.95 MapReduce Total cumulative CPU time:5.95 Launching job 2 out of 2 hadoop job information for stage-3: Number of mappers:1;number of reducers:0 Stage-3 map=0% reduce =0% Stage-3 map=100% reduce =0% cumulative CPU 3.12 [2] HDFS(hadoop distributed file system) architecture. http://hadoop.apache.org/common/docs/c urrent/hdfsdesin.html, 2009 [3] Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao,Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff and Raghotham Murthy.Hive - A Warehousing Solution Over a Map- Reduce Framework.http://www.vldb.org/pvldb/2/vldb09-938.pdf [4] JasonRutherglen,Dean Wampler,Edward CaprioloE. http://www.reedbushey.com/99programming %20Hive.pdf Stage-3 map=100% reduce =100% cumulative CPU 3.12 MapReduce Total cumulative CPU time:3.12 Time taken :65.743 As there are two queries (Select Statements) two jobs have been executed. The CPU time varies according to the system configuration Conclusion Hive works with Hadoop to allow you to query and manage large-scale data using a familiar SQL-like interface. Hive provides command line interface to the shell and Microsoft HDInsight provides console access. Hive tables consist of data and schema and they are separated for maximum flexibility. Hive Query Language(HIVEQL) supports familiar SQL operations including joins,sub queries,order By,Sort by etc. The CPU analysis depends on Hadoop configuration nodes and System configuration REFERENCES [1] Hadoop. http://hadoop.apache.org, 2009. 41