Big Data Hive! 2013-2014 Laurent d Orazio
Introduction! Context Parallel computation on large data sets on commodity hardware Hadoop [hadoop] Definition Open source implementation of MapReduce [DG08] Objective Large scale data sets process and generation Drawbacks Low level (developers are required to write custom programs) Hard to maintain Hard to reuse 2
Outline! Data model Type system Language 3
Data model! Principle Data stored in tables Table composed by rows Row composed by columns Column associated to a primitive or complex type 4
Outline! Data model Type system Language 5
Primitive types! Signed integers bigint(8 bytes) int(4 bytes) smallint(2 bytes) tinyint(1 byte) Floating point numbers float(single precision) double(double precision) String 6
Complex types! Associative arrays map<key-type, value-type> Lists list<element-type> Structs struct<file-name: field-type,... > Composed complex type Example list<map<string, struct<p1:int, p2:int>> 7
Operators. and []! Operator. Access to a field within a struct Operator [] Access to a value in a list or a array Example Table t1(st string, fl float, li list<map<string, struct<p1:int, p2:int>>); Instructions t1.li[0] t1.li[0]['key ] t1.li[0]['key'].p2 8
Outline! Data model Type system Language DDL DML Extensions 9
HiveQL! Principles Subset of SQL Extension for specificities of cloud computing 10
DDL! DDL Create Show Describe Drop Alert 11
Create! Objective Create a table Syntax CREATE TABLE <table_name> (<nom_attribut1> <type1>,... <nom_attribut_n> <typen>); Example Creating students table with the following schema students(num, lastname, firstname, gender, birth_date) create table students(num int, lastname string, firstname string, gender string, birthdate date); 12
Show! Objective List all tables Syntax SHOW TABLES [predicate]; Examples Listing all tables show tables; Listing tables that end with a s show tables '.*s'; 13
Describe! Objective List all columns Syntax DESCRIBE <table_name>; Example Listing columns of students table describe students; 14
Drop! Objective Delete a table Syntax DROP TABLE <table_name>; Example Removing students table drop table students; 15
Alter! Objective Update a table Rename Add a column Replace a column 16
Alter Rename! Syntax ALTER TABLE <old_table_name> RENAME TO <new_table_name>; Example Rename table students into persons alter table students rename to persons; 17
Alter Add column! Syntax ALTER TABLE <table_name> ADD COLUMNS (<attribute_name> <type>); Example Add a column address in table students alter table students add columns(address string); 18
Alter Replace column! Syntax ALTER TABLE <table_name> REPLACE COLUMNS (<attribute_name1> <type1>,..., <new_attribute_namei> <new_typei>,..., <nom_attribut_n> <typen>); Example Replace address in table students by column city alter table students replace columns(..., city string); 19
DML! DML Data management Insert/load Delete Update Querying 20
Limitations! Limitations Insert Impossible into an existing table or data partition Existing data overwritten Lack of INSERT INTO UPDATE DELETE Advantages Concurrency protocol Context: daily or hourly data loaded Example INSERT OVERWRITE TABLE t1 SELECT * FROM t2; 21
Load! Objective Insert data from a file Syntax LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2...)] Example Load data into students from /temp/students.txt load data local inpath '/temp/students' into table students; 22
Insert (1)! Objective Insert data through a query Syntax INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2...)] select_statement1 FROM from_statement Example Insert students lastname into a table students_lastname insert overwrite table students_lastname select lastname from students; 23
Insert (2)! N.B. Possibility to write data into the file system Syntax INSERT OVERWRITE [LOCAL] DIRECTORY <directory> <query> Example Write students data into a directory students INSERT OVERWRITE LOCAL DIRECTORY '/.../students' select * from students; 24
DML! DML Data management Querying Select Project Join Group by Etc. 25
SQL features! from clause subqueries joins inner, left outer, right outer and outer joins cartesian products group bys aggregations union all create table as select functions on primitive and complex types 26
Join! Limitations Join Only equality predicates ANSI join syntax Example SELECT t1.a1 as c1, t2.b1 as c2 FROM t1 JOIN t2 ON (t1.a2 = t2.b2); instead of SELECT t1.a1 as c1, t2.b1 as c2 FROM t1, t2 WHERE t1.a2 = t2.b2; 27
Extensions! Extensions SELECT <-> FROM Support MapReduce analysis Choice of programming language Sort on none distribution attribute Multiple insertions 28
SELECT vs FROM! Possibility to intervert from and select 29
MapReduce analysis! Map or reduce optional 30
Programming language! Example Wordcount and python program FROM ( MAP doc USING 'python wc_mapper.py AS (word, cnt) FROM docs CLUSTER BY word ) a REDUCE word, cnt USING 'python wc_reduce.py'; 31
Sort! Extensions Possibility to sort on a set of columns different from the ones used to do the distribution Example FROM ( ) a FROM session_table SELECT sessionid, tstamp, data DISTRIBUTE BY sessionid SORT BY tstamp REDUCE sessionid, tstamp, data USING 'session_reducer.sh'; 32
Multiple insertions (1)! Principle Inserting different transformation results into different Tables Partitions Hdfs directories Local directories as part of the same query Objective Reducing the number of scans done on the input data 33
Multiple insertions (2)! Example FROM t1 INSERT OVERWRITE TABLE t2 SELECT t3.c2, count(1) FROM t3 WHERE t3.c1 <= 20 GROUP BY t3.c2 INSERT OVERWRITE DIRECTORY '/output_dir SELECT t3.c2, avg(t3.c1) FROM t3 WHERE t3.c1 > 20 AND t3.c1 <= 30 GROUP BY t3.c2 INSERT OVERWRITE LOCAL DIRECTORY '/home/dir SELECT t3.c2, sum(t3.c1) FROM t3 WHERE t3.c1 > 30 GROUP BY t3.c2; 34