11/18/15 CS q Hadoop was not designed to migrate data from traditional relational databases to its HDFS. q This is where Hive comes in.

Size: px

Start display at page:

Download "11/18/15 CS 6030. q Hadoop was not designed to migrate data from traditional relational databases to its HDFS. q This is where Hive comes in."

Nelson Parker
8 years ago
Views:

1 by shatha muhi CS q Big Data: collections of large datasets (huge volume, high velocity, and variety of data). q Apache Hadoop framework emerged to solve big data management and processing challenges. q Hadoop was not designed to migrate data from traditional relational databases to its HDFS. q This is where Hive comes in. 2 q Apache Software Foundation took Hive and developed it further as an open source in 2008 under the name Apache Hive. 3 1

q Hadoop was not designed to migrate data from traditional relational databases to its HDFS.

2 q A petabyte data warehouse software for managing and querying unstructured large datasets as if it were structured residing in distributed storage in Hadoop cluster which provides q Tools to enable easy data extract/transform/load (ETL). q A mechanism to impose structure on a variety of data formats. q Access to files stored either directly in Appach HDFS or in other data storage systems such as Apache Hbase. q Query execution via MapReduce. 4 q Hadoop is a good infrastructure based on relational database while Hive is just a user interface. q True. q False. 5 q A relational database. q A design for Online Transaction Processing (OLTP). q A Language for real-time queries and row-level updates. 6 2

q Access to files stored either directly in Appach HDFS or in other data storage systems such as Apache Hbase. q Query execution via MapReduce.

3 q Internet of things (IOT) need real time system and Hive is the best choice because it isn t a batch system. q True q False 7 8 q The conjunction part of HiveQl process Engine and MapReduce in Hive architecture is Hadoop distributed file system(hdfs) which uses the flavor of MapReduce. q True. q False. 9 3

q True q False 7 8 q The conjunction part of HiveQl process Engine and

4 1- Execute query. 2- Get plan. 3- Get metadata. 4- Send metadata. 5- Send plan. 6- Execute plan. 7- Execute job Metadata ops 8- Fetch results. 9- Send results. 10- Send results. 10 q Uses SQL type language for querying called HiveQL (HQL), unlike pig. q Do the dirty(hard) work of mapping data operations to low-level Map-Reduce Java API which is hard even for experienced Java programmer. q Pluggable in which the underlying execution engine can be changed from MapReduce to Tez or Spark. q Fault-tolerance unlike all other engines including newer engines such as Impala. q Feature-rich because it is the oldest engine while new engines have fewer features. For example, it supports nested data types (structs, array, map, ect) 11 q One of the methods to enhance new versions of Hive is by using different execution engine other than MapReduse such as: q Tez. q Pig. q Impala. q Spark. 12 4

q Do the dirty(hard) work of mapping data operations to low-level Map-Reduce Java API which is hard even for experienced Java programmer.

5 q Performance because it uses MapReduce as the execution engine. q MapReduce is not good choice for running ad hoc and interactive queries because it reads and writes to disk extensively besides the high startup cost. q For instance, multi join query could take minutes not because of data size but because of the number of read and writes to disk. q Pluggable engines and vectorized query execution are two main enhancements to reduce the effects of the performance drawback. 13 q The following are the features that make Hive very popular and a good choice in batch systems: q Oldest system. q Pluggable. q Feature-rich. q Vectorized query execution. q Shared meta store. 14 q Primitive types: q Integers: TINYINT, SMALLINT, INT, BIGINT. q Boolean: BOOLEAN. q FloaMng point numbers: FLOAT, DOUBLE. q String: STRING. q Complex types q Structs: {a INT; b INT}. q Maps: M['group']. q Arrays: ['a', 'b', 'c'], A[1] returns 'b'. 15 5

q For instance, multi join query could take minutes not because of data size but because of the number of read and writes to disk.

6 q Hive engine has rich features which are complex data types such us : q Struct. q Integer. q Map. q Array. 16 q Tables: q Analogous to tables in relational DBs. q Each table has corresponding directory in HDFS. q Partitions: q Analogous to indexes on partition columns. q Nested sub-directories in HDFS for each combination of partition column values. q Allows users to efficiently retrieve rows. q For instance, range partition tables by date. 17 q Buckets q Split data based on hash of a column mainly for parallelism. q Data in each partition may in turn be divided into Buckets based on the value of a hash function of some column of a table. 18 6

q Nested sub-directories in HDFS for each combination of partition column values. q Allows users to efficiently retrieve rows.

7 q Partition in Hive divides rows and efficiently retrieves columns : q True. q False. 19 q Uses of SQL syntax and Hadoop features made Hive very popular and easy to program. q Despite of its latency problem, being feature rich made it more used than the newer engines. q It has many enhancements and flexibility which make it the best choice for processing and querying data in a batch system till this moment Edward Capriolo, Dean Wampler, and Jason Rutherglen Programming Hive (1st ed.). O'Reilly Media, Inc.. 2. Tom White Hadoop: The Defini4ve Guide (1st ed.). O'Reilly Media, Inc.. 3. Thusoo, A.; Sarma, J.S.; Jain, N.; Zheng Shao; Chakka, P.; Ning Zhang; Antony, S.; Hao Liu; Murthy, R., "Hive - a petabyte scale data warehouse using Hadoop, in Data Engineering (ICDE), 2010 IEEE 26th Interna4onal Conference on, vol., no., pp , 1-6 March Shao, Zheng. Hadoop/Hive General IntroducMon Retrieved from hips://view.officeapps.live.com/op/view.aspx?src=hip://u.cs.biu.ac.il/~ariel/download/ds590/ resources/cloud/hadoop/hadoop_general_introducmon.ppt. 5. Perry Hoekstra, Jiaheng Lu, Avinash Lakshman, Prashant Malik, and Jimmy Lin. NoSQL and Big Data Processing Hbase, Hive and Pig, etc.. Retrieved from hips://view.officeapps.live.com/op/view.aspx?src=hip:// HbaseHivePig.pptx. 6. Hive: A data warehouse on Hadoop Retrieved from hips://view.officeapps.live.com/op/view.aspx?src=hip:// fall2011/hivenov11.ppt. 7. Joydeep Sen Sarma, Ashish Thusoo. HIVE Data Warehousing & AnalyMcs on Hadoop. Retrieved from hip:// 8. Alan Gates. Gevng Started Retrieved from hips://cwiki.apache.org/confluence/display/hive/gevngstarted. 9. Wikipedia. Apache Hive Retrieved from hips://en.wikipedia.org/wiki/apache_hive. 10. Hive query language Retrieved from hip:// 21 7

q It has many enhancements and flexibility which make it the best choice for processing and querying data in a batch system till this moment. 20 1. Edward Capriolo, Dean Wampler, and Jason Rutherglen.

8 22 8

Spring,2015. Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE

Spring,2015. Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE Spring,2015 Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE Contents: Briefly About Big Data Management What is hive? Hive Architecture Working