Hadoop Distributed File System. -Kishan Patel ID#

Transcription

1 Hadoop Distributed File System -Kishan Patel ID#

2 Emirates Airlines Schedule Schedule of Emirates airlines was downloaded from official website of Emirates. Originally schedule was in pdf format.

3

4

5 Emirates.txt is unstructured text file and takesasinputforwordcountjob.

6

7 We need to convert output from word-count into.txt file. So, I have created new directory where I can save txt file and use getmerge Command.

8

9 Create new table in hive for load data from Em.txt file.

10 After executing SELECT * FROM EMIRATES :-

11 I create another same table where particular code value of destinations would be stored.

12 User table is used for insert extract data from main table.

13 Below query indicate that all destinations will be inserted in user table one-by-one.

14 Below query is going to insert multiple values in User table but hive does not support it.

15 So,weneed to insert all destinations one-by-one in User table

16 Create table emirates11 which has varchar(3) data type and it is used for fetch depart and arrive. Create table emirates22 which has varchar(4) data type and it is used for fetch depart time and arrive time. Create table emirates33 which has varchar(5) data type and it is used for fetch flight number.

17 Create table user11 for insert depart. Create table user22 for insert arrive. Create table user33 for insert depart time. Create table user44 for insert arrive time. Create table user55 for insert flight number.

18

19 User11 :- User22 :-

20 User33 :- User44 :-

21 User55 :-

22 Create table user1 for insert depart and arrive from user11 and user22. User1 :-

23 Create table user2 for insert depart time and arrive time from user33 and user44. User2 :-

24 Create table user3 for insert depart,arrive,depart time and arrive time from user1 and user2. User3 :-

25 Create table user4 for insert depart,arrive,depart time,arrive time and flight number from user3 and user55. User4 :-

26 Complete view of tables from word count output to table User4. Emirates11 (Varchar(3)) Emirates22 (Varchar(4)) Emirates33 (Varchar(5)) User11 (Depart) User22 (Arrive) User33 (Depart time) User44 (Arrive time) User1 (departarrive) User2 (depart time-arrive time) User55 (flightnumber) User3 (depart,arrive,depa rt time,arrive time) User4 (*)

27

28 Create table eschedule in hive for load data from csv file. Table has total five columns.

29 After executing select * from eschedule :-

30

31 Data Mining result :- User table is used to store aggregate data from Main table. User table has five Columns:- 1. Depart 2. D_Time (Departure Time) 3. Arrive 4. A_Time (Arrival Time) 5. Flight Number

32 Find schedule for Abidjan(ABJ) to Amsterdam(AMS) :

33 Here, we need to execute query individual.

34 Schedule for Bahrain(BAH) to Athens(ATH)

35 Number of flights which depart from Dubai(DXB) Departure time from Sao Paulo(SAO)

36 Number of flights which depart from Dubai(DXB) and fly during 00:01 am to 11:59 am Number of flights which arrive to Dubai(DXB) during 12:00pm to 23:59pm

37 Flight Number which depart from Christchurch(CHC) Flight Number which arrive to Copenhagen(CPH) from Dubai(DXB)

38 Create table user which have only one column Flight Number. Flight Number during journey of Doha(DOH) to Dublin(DUB):-

39 Water Treatment Plant Data Set Comma Separated file is downloaded from UCI (Machine Learning Repository) Website. Data is about statistical information from water treatment plant like different chemical demand, ph value etc.

40

41 There are total 39 columns in Log file and each row start with specific date. List of Column:- 1 Q-E (input flow to plant) 2 ZN-E (input Zinc to plant) 3 PH-E (input ph to plant) 4 DBO-E (input Biological demand of oxygen to plant) 5 DQO-E (input chemical demand of oxygen to plant) 6 SS-E (input suspended solids to plant) 7 SSV-E (input volatile supended solids to plant) 8 SED-E (input sediments to plant)

42 9 COND-E (input conductivity to plant) 10 PH-P (input ph to primary settler) 11 DBO-P (input Biological demand of oxygen to primary settler) 12 SS-P (input suspended solids to primary settler) 13 SSV-P (input volatile suspended solids to primary settler) 14 SED-P (input sediments to primary settler) 15 COND-P (input conductivity to primary settler) 16 PH-D (input ph to secondary settler) 17 DBO-D (input Biological demand of oxygen to secondary settler) 18 DQO-D (input chemical demand of oxygen to secondary settler) 19 SS-D (input suspended solids to secondary settler) 20 SSV-D (input volatile suspended solids to secondary settler) 21 SED-D (input sediments to secondary settler)

43 22 COND-D (input conductivity to secondary settler) 23 PH-S (output ph) 24 DBO-S (output Biological demand of oxygen) 25 DQO-S (output chemical demand of oxygen) 26 SS-S (output suspended solids) 27 SSV-S (output volatile suspended solids) 28 SED-S (output sediments) 29 COND-S (output conductivity) 30 RD-DBO-P (performance input Biological demand of oxygen in primary settler) 31 RD-SS-P (performance input suspended solids to primary settler) 32 RD-SED-P (performance input sediments to primary settler) 33 RD-DBO-S (performance input Biological demand of oxygen to secondary settler) 34 RD-DQO-S (performance input chemical demand of oxygen to secondary settler)

44 35 RD-DBO-G (global performance input Biological demand of oxygen) 36 RD-DQO-G (global performance input chemical demand of oxygen) 37 RD-SS-G (global performance input suspended solids) 38 RD-SED-G (global performance input sediments)

45 Create table Water in hive where we can store data from Logfile.

46 Load data from Logfileto water table

47 After executing Select * from water :-

48 Data Mining Result:- Value of input flow to plant, Zink to plant, phto plant on 20 th August,1991:- Total number of days on which statistical data was found :-

49 Average of input flow to plant :- Maximum value of input flow to plant :-

50 Average value of performance input chemical demand of oxygen to secondary settler:- Average value of global performance input Biological demand of oxygen :-

51 Create another table water which clustered into 21 Buckets :-

52 Average value of global performance input Biological demand of oxygen in bucket 1:- Different value of input flow to plant during 1 st Aug to 30 th Aug:-

53 Hive A Warehousing Solution Over a Map-Reduce Framework INTRODUCTION :- Data in the industry is growing rapidly and tradition Warehousing is very expensive. In this situation, Hadoop is a popular Warehousing solution for storing and processing extremely large data. Hive is an open-source data warehousing solution whichrunsontopofhadoopfilesystem. Hive query language compiled into map-reduce job and executed on Hadoop. HiveQL support primitive types, arrays and nested compositions. Hive is used in Facebook for both reporting and adhoc analyses.

54 HIVE Data Model :- Data in hive is organized into tables and each table has Hadoop distributed file system directory, where corresponding table is stored.so,user can easily accessed that data from directory. User can add new data format by custom serialization and de-serialization method. Each table may have partitions which determine the distribution of data within sub directory of table directory. Data in partition may be divided into buckets and each bucket is stored as a file in the partition directory.

55 Query Language :- Hive support SQL like Query language. It support select,project,join,aggregate,union all and subqueries. HiveQL also support DDL and DML statement, but hive does not support updating and deleting rows in tables. On other side,hiveql supports multi table insert. So user can perform multi-table queries on the same input data. Hive support UDF and aggregation UDAF functions. User can embed custom map-reduce scripts.

56 Running Example: Status Meme :- When Facebook user update their status, then update are logged into flat files in an NFS directory /logs/statementupdate which are rotated every day and load data in hive onadailybasis. Status_updates(userid int,status string,ds string) LOAD DATA LOCAL INPATH /LOGS/STATUS_UPDATES INTO TABLE status_updates PARTITION (ds= ) Each status update record has userid,status and ds.table willbepartitionbasedonthedscolumn. Profile info available in profiles(userid int,school string,gender int) table.

57 We will use below query for computing daily statistics. FROM(SELECT a.status, b.school,b.gender FROM status_updates a JOIN profiles b ON (a.userid = b.userid and a.ds= ) subq1 INSERT OVERWRITE TABLE gender_summary PARTITION(ds= ) SELECT subq1.gender,count(1) GROUP BY subq1.gender INSERT OVERWRITE TABLE school_summary PARTITION(ds= ) SELECT subq1.school,count(1) GROUP BY subq1.school

58

59 Hive Architecture :- External Interfaces:- Hive provide user interfaces like CLI and WebUI,and also application programming interfaces like JDBC and ODBC. Thrift Server:- Thrift is a framework for a cross-language services, where a server written in one language can also support client in other languages. Metastore:- Metastore is a system catalog. All other components of hive interact with the metastore. Driver:- Driver manage the life cycle of a HiveQL statement during compilation, optimization and execution. It uses a session handle for hive query statement.

60 Metastore:- Metastore is a system catalog which contains metadata of the table stored in hive and it is specified during table creation and reused every time. Also,metadata distinguishes hive as a traditional warehousing solution. Metastore contains following object:- Database:-Itisanamespacefortablesinhive. Table:- Matadata for table contains list of columns and their types storage and serde information. Partition:-Each partition can have its own columns, SerDe and storage info. HiveQL statements which only accessing metadata objects can be extended with very low latency.

61 Compiler :- The driver invokes the compiler with the HiveQL which can be DDL, DML statements. The compilers convert the string to plan. Plan is a metadata operation in the load statements(ddl).plan is a directed acyclic graph(dag) in insert statement(dml). Parser transforms a query string to a parse tree. Semantic analyzers transform the parse tree to a query representation. Logical plan generator converts internal query representation to a logical plan. Optimizer performs multiple passes over the logical plan and rewrites in several ways. Physical plan generator converts the logical plan into physical plan and it creates a new map-reduce job for each of the operation. It then assign portion to mappers and reducers of the map-reduce jobs.

62 Conclusion We can overwrite or insert result of select query to another table, but this query does not support with multi row values. For example: We cannot overwrite multiple rows in user table. So, we need to overwrite multiple times. Hive is very efficient for consistency with large data On other side, it has limited features for manipulation.

63 Thank You