Hadoop and Hive Introduction,Installation and Usage Saatvik Shah Data Analytics for Educational Data May 23, 2014 Saatvik Shah (Data Analytics for Educational Data) Hadoop and Hive May 23, 2014 1 / 15
1 Big Data 2 Hadoop What is Hadoop? HDFS Installing Hadoop Prerequisites Download and Environment Configuring Hadoop Hadoop Usage 3 Hive Introduction Hive vs. RDBMS Hive Installation Hive Usage 4 References Saatvik Shah (Data Analytics for Educational Data) Hadoop and Hive May 23, 2014 2 / 15
Big Data Big Data Overview and Analysis 1 3 Dimensions[1] 1 Volume 2 Velocity 3 Variety 2 Data is Complex and Structured or Unstructured 3 Examples[2] 1 The Internet Archive stores around 2 petabytes of data, and is growing at a rate of 20 terabytes per month 2 The Large Hadron Collider near Geneva, Switzerland, will produce about 15 petabytes of data per year 3 Facebook hosts approximately 10 billion photos, taking up one petabyte of storage Saatvik Shah (Data Analytics for Educational Data) Hadoop and Hive May 23, 2014 3 / 15
Hadoop What is Hadoop? Hadoop What is Hadoop? 1 Apache Hadoop is an Open Source Project used for managing Big Data 2 Hadoop Distributed File System(HDFS)[2] 1 Distributed File System 2 Fault Tolerant 3 Low Cost Hardware - Every server is treated as an individual node 3 Hadoop Architecture[2] 1 Processors : 8-12 cores/node 2 Number of Nodes : 8-400 3 Disk Space Supported : Variable(few GB to many TB) Saatvik Shah (Data Analytics for Educational Data) Hadoop and Hive May 23, 2014 4 / 15
Hadoop HDFS Hadoop HDFS Keywords Name Node,Data Node,Rack,Secondary Name Node,Replication Factor,Heartbeats[3] Saatvik Shah (Data Analytics for Educational Data) Hadoop and Hive May 23, 2014 5 / 15
Hadoop Installing Hadoop Hadoop Installing Hadoop-Prerequisites Prerequisites[3] 1 Hadoop Client/User 1 sudo addgroup hadoop 2 sudo adduser ingroup hadoop hduser 3 sudo adduser hduser sudo 2 Java JDK(6 or higher) 3 SSH 1 sudo apt-get install openjdk-7-jdk 2 cd /usr/lib/jvm 3 ln -s java-7-openjdk-amd64 jdk 1 sudo apt-get install openssh-server 2 ssh-keygen -t rsa -P 3 ssh localhost Saatvik Shah (Data Analytics for Educational Data) Hadoop and Hive May 23, 2014 6 / 15
Hadoop Installing Hadoop Hadoop Installing Hadoop-Download and Environment[3] Download[3] 1 wget http://apache.mirrors.lucidnetworks.net/hadoop/common/stable/hadoop- 2.2.0.tar.gz 2 sudo tar vxzf hadoop-2.2.0.tar.gz -C /usr/local 3 cd /usr/local 4 sudo mv hadoop-2.2.0 hadoop 5 sudo chown -R hduser:hadoop hadoop Environment Setup - Add the following to.bashrc[3] 1 export JAVA HOME =/usr/lib/jvm/jdk/ 2 export HADOOP INSTALL=/usr/local/hadoop 3 export PATH=$PATH:$HADOOP INSTALL/bin 4 sudo mv hadoop-2.2.0 hadoop 5 sudo chown -R hduser:hadoop hadoop Saatvik Shah (Data Analytics for Educational Data) Hadoop and Hive May 23, 2014 7 / 15
Hadoop Installing Hadoop Hadoop Installing Hadoop-Configuration User based Configurations[3] 1 /usr/local/hadoop/etc/hadoop/core-site.xml 2 /usr/local/hadoop/etc/hadoop/yarn-site.xml 3 /usr/local/hadoop/etc/hadoop/mapred-site.xml.template 4 /usr/local/hadoop/etc/hadoop/hdfs-site.xml Launch![3] 1 mkdir -p mydata/hdfs/namenode 2 mkdir -p mydata/hdfs/datanode 3 hdfs namenode -format 4 start-dfs.sh 5 start-yarn.sh Saatvik Shah (Data Analytics for Educational Data) Hadoop and Hive May 23, 2014 8 / 15
Hadoop Hadoop Usage Hadoop Hadoop Usage 1 Hadoop Commands are similar to Bash Commands 2 Setup 1 Create a file : touch test.txt 2 echo Hello World of HDFS > test.txt 3 Start HDFS : start-dfs.sh,start-yarn.sh 4 HDFS Details : http://localhost:50070/dfshealth.jsp(mine at) 3 Make Directory in HDFS 1 hadoop fs -mkdir -p /user/hduser/docs/examples/ 4 List Files in a Directory and its Subdirectories 1 hadoop fs -ls /user/hduser/ 5 Copy From/To Local Machine to/from HDFS 1 hadoop fs -copyfromlocal test.txt /user/hduser/docs/examples/test.txt 2 hadoop fs -copytolocal /user/hduser/docs/examples/test.txt /tmp/test.txt Saatvik Shah (Data Analytics for Educational Data) Hadoop and Hive May 23, 2014 9 / 15
Hive Introduction Hive Introduction 1 SQL for Hadoop : HQL(Hive Query Language) supports DDL and DML statements [4, 5] 2 Hive Architecture[4, 5] 1 Hive Metastore:Schemas and Statistics for Data Acquisition and Query Optimization 2 Tables:Analogous to Relational Databases with each table having a HDFS directory 3 Partitions:Data in a table directory is partitioned into subdirectories of the directory 4 Buckets:Data in each partition may in turn be divided into buckets based on the hash of a column in the table Saatvik Shah (Data Analytics for Educational Data) Hadoop and Hive May 23, 2014 10 / 15
Hive Hive vs. RDBMS Hive Hive vs. RDBMS [6] Relational DBMS 1 Small size Data 2 Structured Data 3 Real Time Response and Low Latency Hive 1 Large Size Data 2 Structured or Unstructured Data 3 Scalable,Extensible,Batch Job Handling aatvik Shah (Data Analytics for Educational Data) Hadoop and Hive May 23, 2014 11 / 15
Hive Hive Installation Hive Hive Installation Download and Setup[7] 1 wget http://apache.mirrors.hoobly.com/hive/stable/apachehive-0.13.0-bin.tar.gz 2 sudo tar -zxvf (hive install) 3 sudo mv (hive-install) /usr/local/hive Environment Configuration 1 export HIVE PREFIX=/usr/local/hive 2 export PATH=$PATH:$HIVE PREFIX/bin Launch:hive Saatvik Shah (Data Analytics for Educational Data) Hadoop and Hive May 23, 2014 12 / 15
Hive Hive Usage Hive Hive Usage[7] Launch Hive $hive Create Table CREATE TABLE books(id INT,name STRING,author STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY, STORED AS TEXTFILE; Loading Data LOAD DATA (LOCAL) INPATH books.txt INTO TABLE books; Extracting Data SELECT * FROM books LIMIT 10; Saatvik Shah (Data Analytics for Educational Data) Hadoop and Hive May 23, 2014 13 / 15
References References I [1] IBM Documentation of Big Data. Available at http://www-01.ibm.com/software/in/data/bigdata/. Downloaded in March 2014. [2] T. White, Hadoop: The Definitive Guide: The Definitive Guide. O Reilly Media, 2009. [3] Apache Documentation. Available at http://hadoop.apache.org/docs/r0.18.2/. Downloaded in May 2014. [4] A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy, Hive: a warehousing solution over a map-reduce framework, Proceedings of the VLDB Endowment, vol. 2, no. 2, pp. 1626 1629, 2009. Saatvik Shah (Data Analytics for Educational Data) Hadoop and Hive May 23, 2014 14 / 15
References References II [5] A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Antony, H. Liu, and R. Murthy, Hive-a petabyte scale data warehouse using hadoop, in Data Engineering (ICDE), 2010 IEEE 26th International Conference on, pp. 996 1005, IEEE, 2010. [6] W. Chen, K.-C. Yin, D.-L. Yang, and M.-C. Hung, Data migration from grid to cloud computing., Applied Mathematics & Information Sciences, vol. 7, no. 1, 2013. [7] Hive Documentation Wiki. Available at https://cwiki.apache.org/confluence/display/hive. Downloaded in May 2014. Saatvik Shah (Data Analytics for Educational Data) Hadoop and Hive May 23, 2014 15 / 15