VOL. 5, NO. 2, August 2015 ISSN 2225-7217 ARPN Journal of Systems and Software 2009-2015 AJSS Journal. All rights reserved

Similar documents
Market Basket Analysis Algorithm on Map/Reduce in AWS EC2

Hadoop on Windows Azure: Hive vs. JavaScript for Processing Big Data

Big Data Analytics OverOnline Transactional Data Set

Modernizing Your Data Warehouse for Hadoop

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

The Inside Scoop on Hadoop

Big Analytics in the Cloud. Matt Winkler PM, Big

Introduc)on to RHadoop Master s Degree in Informa1cs Engineering Master s Programme in ICT Innova1on: Data Science (EIT ICT Labs Master School)

Bringing Big Data to People

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Microsoft Research Microsoft Azure for Research Training

Hadoop in the Hybrid Cloud

Big Data on Microsoft Platform

Spring,2015. Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE

BIG DATA What it is and how to use?

Accessing bigger datasets in R using SQLite and dplyr

A Brief Outline on Bigdata Hadoop

Microsoft Research Windows Azure for Research Training

Hadoop Big Data for Processing Data and Performing Workload

Data Refinery with Big Data Aspects

Open Source Technologies on Microsoft Azure

E6893 Big Data Analytics Lecture 2: Big Data Analytics Platforms

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

Using distributed technologies to analyze Big Data

Workshop on Hadoop with Big Data

Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies

Hadoop Technology for Flow Analysis of the Internet Traffic

Implement Hadoop jobs to extract business value from large and varied data sets

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Improving Data Processing Speed in Big Data Analytics Using. HDFS Method

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

SQLSaturday #399 Sacramento 25 July, Big Data Analytics with Excel

Chase Wu New Jersey Ins0tute of Technology

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

Visualization with Excel Tools and Microsoft Azure

PROPOSAL To Develop an Enterprise Scale Disease Modeling Web Portal For Ascel Bio Updated March 2015

Application Development. A Paradigm Shift

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Big Data Analytics on Cab Company s Customer Dataset Using Hive and Tableau

International Journal of Innovative Research in Computer and Communication Engineering

Please give me your feedback

Hadoop and Big Data Research

Microsoft Analytics Platform System. Solution Brief

How To Handle Big Data With A Data Scientist

Hadoop and Map-Reduce. Swati Gore

A Study of Data Management Technology for Handling Big Data

Big Data Too Big To Ignore

Big Data and Industrial Internet

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Big Data Explained. An introduction to Big Data Science.

BIG DATA TRENDS AND TECHNOLOGIES

Cloud Computing with Azure PaaS for Educational Institutions

Evolution from Big Data to Smart Data

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

ASAM ODS, Peak ODS Server and openmdm as a company-wide information hub for test and simulation data. Peak Solution GmbH, Nuremberg

Oracle Big Data SQL Technical Update

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Microsoft Big Data. Solution Brief

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Distributed Framework for Data Mining As a Service on Private Cloud

Big Data Weather Analytics Using Hadoop

HDP Hadoop From concept to deployment.

Manifest for Big Data Pig, Hive & Jaql

and Hadoop Technology

Keywords Big Data, NoSQL, Relational Databases, Decision Making using Big Data, Hadoop

Data processing goes big

CASE STUDY OF HIVE USING HADOOP 1

BIG DATA USING HADOOP

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

On a Hadoop-based Analytics Service System

Integrating Big Data into the Computing Curricula

Advanced Analytics & IoT Architectures

Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc.

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Transforming the Telecoms Business using Big Data and Analytics

HiBench Introduction. Carson Wang Software & Services Group

Moving From Hadoop to Spark

Keywords: Big Data, Hadoop, cluster, heterogeneous, HDFS, MapReduce

ITG Software Engineering

SELLING PROJECTS ON THE MICROSOFT BUSINESS ANALYTICS PLATFORM

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

Assignment # 1 (Cloud Computing Security)

Big Data Scenario mit Power BI vs. SAP HANA Gerhard Brückl

Fast Data in the Era of Big Data: Twitter s Real-

Market Basket Analysis Algorithm with Map/Reduce of Cloud Computing

Modern Data Warehousing

Register on projectbotticelli.com. Introduction to BI & Big Data DAX MDX Data Mining

Forecast of Big Data Trends. Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 3 September 2014

Constructing a Data Lake: Hadoop and Oracle Database United!

In-Memory Analytics for Big Data

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

ISSN: (Online) Volume 3, Issue 4, April 2015 International Journal of Advance Research in Computer Science and Management Studies

Apriori-Map/Reduce Algorithm

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

UPS battery remote monitoring system in cloud computing

High Performance Predictive Analytics in R and Hadoop:

Suresh Lakavath csir urdip Pune, India

Transcription:

Big Data Analysis of Airline Data Set using Hive Nillohit Bhattacharya, 2 Jongwook Woo Grad Student, 2 Prof., Department of Computer Information Systems, California State University Los Angeles nbhatta2 @ calstatela.edu, 2 jwoo5@calstatela.edu ABSTRACT In this paper, the analysis of the airline data set is performed using Microsoft Azure HDInsight which runs Hadoop in the cloud. Hive and Hive QL statements have been used for querying the data. Data visualization has been done by extracting the output of the HIVE query in excel and plotting the data using line and scatter plot charts. The visualization of the data shows some patterns that exist between flight diversions and flight distance, flight cancellation and flight distance and so forth. Keywords: Hadoop, HDInsight, Big Data, Hive, Data Visualization, Data Analysis. INTRODUCTION There is no doubt that a lot of excitement exists with the term Big Data. Big Data in simple words can be large-scale data which does not have a well-defined structure. The size of the data itself is so huge that it is not practically easy for a single computer to store and process all the data by itself. Traditional computing approached the problem in a different way, the focus was always to increase the processing speed and power of the computer. As the data grows exponentially, the processing power of the single computer becomes a bottleneck and thus a new approach was needed to address the issue at hand. A new way was developed where many non-expensive commodity computers all working together in harmony with each other, in order to store and process this big data in parallel that allows us to extract meaningful information from a large data set. Moreover, current technologies using the cloud infrastructure allows us to easily create clusters of computers by renting them for as much time as required and then releasing the computing resources when no longer needed. Thus with cloud technologies we get the computing power of the clusters of computers with minimal investment. The airline data has been taken from the United States Department of Transportation, Bureau of Transportation Statistics []. The data consists of the arrival and departure records of all US domestic flights from the period 202 to 204. Section 2 gives a brief introduction of the Airline Data set, Hive and HDInsight. Section 3 describes the mechanism by which the data set is analyzed. Section 4 describes the experimental observations from the data set. Section 5 is the conclusion of this paper. 2. AIRLINE DATA, HIVE AND HDINSIGHT This section briefly describes the characteristics of the Airline Data set, introduces HIVE and HDINSIGHT. 2. Characteristics of the Airline Data Total number of files: 36. File type: csv (comma separated values) Total file size: 3.65 GB. Total number of records: 8.2 Million (8,286,090) 2.2 Hive Hive, allows SQL developers to write Hive Query Language (HQL) statements that are similar to standard SQL statements. HQL statements are broken down into MapReduce jobs and executed across a Hadoop cluster. Even though, HQL statements are similar to SQL statements, there are several key differences because Hive is based on Hadoop and MapReduce Operations. The first is that Hadoop is intended for long sequential scans, and because Hive is based on Hadoop, queries tend to have a very high latency (many minutes). This means that Hive would not be appropriate for applications that need very fast response times. Hive is read-based and therefore not appropriate for transaction processing that typically involves a high percentage of write operations [2]. 2.3 HD Insight Azure HDInsight deploys and provisions Apache Hadoop clusters in the cloud, providing a software framework designed to manage, analyze, and report on big data with high reliability and availability. HDInsight uses the Hortonworks Data Platform (HDP) Hadoop distribution. Hadoop often refers to the entire Hadoop ecosystem of components, which includes Storm and HBase clusters, as well as other technologies under the Hadoop umbrella. Azure HD Insight deploys and provisions Hadoop clusters in the cloud, by using either Linux or Windows as the underlying Operating System [3]. 3. ANALYSIS OF AIRLINE DATA USING HDINSIGHT In order to analyze the Airline Data, the data needs to be first saved to the Azure Blob Storage [4], which is a cloud data storage service provided by Microsoft Azure. For transferring the Airline Data to the Azure Blob storage, a client utility program CloudBerry Explorer for Azure Blob Storage was used [5]. 22

Azure Blob storage is a robust, general-purpose storage solution that integrates seamlessly with HDInsight. Through a Hadoop distributed file system (HDFS) interface, the full set of components in HDInsight can operate directly on structured or unstructured data in Blob storage. Storing data in Blob storage provides the ability to safely delete the HDInsight clusters that are used for computation without losing user data. Azure HDInsight provides a full-featured Hadoop distributed file system (HDFS) over Azure Blob storage. It enables the full set of components in the Hadoop ecosystem to operate directly on the data it manages. Azure Blob storage and HDFS are distinct file systems that are optimized for storage of data and computations on that data [6]. Once the Azure Blob storage account is created and the data is transferred, the HDInsight cluster can be launched from Microsoft Azure Portal. The overall structure of the system is as shown in the Fig below. ARR_DELAY BIGINT, ARR_DELAY_NEW BIGINT, ARR_DEL5 BIGINT, ARR_DELAY_GROUP BIGINT, ARR_TIME_BLK STRING, CANCELLED BIGINT, CANCELLATION_CODE STRING, DIVERTED BIGINT, CRS_ELAPSED_TIME STRING, ACTUAL_ELAPSED_TIME STRING, AIR_TIME BIGINT, FLIGHTS BIGINT, DISTANCE BIGINT, DISTANCE_GROUP BIGINT, CARRIER_DELAY BIGINT, WEATHER_DELAY BIGINT, NAS_DELAY BIGINT, SECURITY_DELAY BIGINT, LATE_AIRCRAFT_DELAY BIGINT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION 'wasb://hdgradstudy@gradstudy.blob.core.windows.net/ Airline_Data'; above job is 4.97 seconds. 4. EXPERIMENTAL RESULTS For analyzing the Airline Data, a cluster of 4 data nodes (4 compute machines) running Microsoft Windows Server 202 R2 Datacenter operating system was launched. Hive runs on the cluster by default when the cluster is up and running. Using the Hive query console, the data is analyzed as follows: 4. Total Number of Flights Cancelled Each Month for flights that were cancelled every month from 202 SELECT YEAR, MONTH, COUNT (CANCELLED) AS TOTAL_CANCELLED Figure : System architecture for analyzing the airline data set 3. Hive Query for Creating an External Table in Hive Once the HD Insight cluster is up and running, Hive queries can be executed on the cluster from the Hive query console window. The following HQL statement is executed to create an external table Airline. Note that the following HQL statements are very similar to SQL statements. DROP TABLE IF EXISTS Airline; CREATE EXTERNAL TABLE Airline (YEAR BIGINT, MONTH BIGINT, FL_DATE STRING, CARRIER STRING, TAIL_NUM STRING, ORIGIN STRING, DEST STRING, CRS_DEP_TIME STRING, DEP_TIME STRING, DEP_DELAY BIGINT, DEP_DELAY_NEW BIGINT, DEP_DEL5 BIGINT, DEP_DELAY_GROUP BIGINT, DEP_TIME_BLK STRING, TAXI_OUT BIGINT, WHEELS_OFF STRING, WHEELS_ON STRING, TAXI_IN BIGINT, CRS_ARR_TIME STRING, ARR_TIME STRING, GROUP BY YEAR, MONTH ORDER BY YEAR, MONTH above job is 22.496 seconds. 4.2 Total Number of Flights Diverted Each Month for flights that were diverted every month from 202 to 204 is: SELECT YEAR, MONTH, COUNT (DIVERTED) AS TOTAL_DIVERTED GROUP BY YEAR, MONTH ORDER BY YEAR, MONTH 23

above job is 28.767 seconds. The figure Fig (2) below shows the trend of cancelled and diverted flights by month from January to December for the period 202 to 204. Figure 3: Cancelled and Diverted Flights by Year Figure 2: Cancelled and Diverted Flights by Month 4.3 Total Number of Flights Cancelled Each Year for flights that were cancelled every year from 202 to 204 is: SELECT YEAR, COUNT (CANCELLED) AS TOTAL_CANCELLED GROUP BY YEAR ORDER BY YEAR above job is 36.229 seconds. 4.5 Effect of Flight Distance on Diversions for 202-204 diverted flights as a function of flight distance from 202 SELECT DISTANCE, COUNT (DIVERTED) AS DIVERTED_COUNT ORDER BY DIVERTED_COUNT DESC LIMIT 500; above job is 37. seconds. The figure Fig (4) below shows the scatter plot of the total number of flight diversions with respect to flight distance (in miles) for the period 202 to 204. 4.4 Total Number of Flights Diverted Each Year for flights that were diverted every year from 202 SELECT YEAR, COUNT (DIVERTED) AS TOTAL_DIVERTED GROUP BY YEAR ORDER BY YEAR above job is 23.7 seconds. The figure Fig (3) below shows the trend of cancelled and diverted flights by year for the period 202 to 204. Figure 4: Number of flight diversions vs flight distance 4.6 Effect of Flight Distance on Cancellations for cancelled flights as a function of flight distance from 202 SELECT DISTANCE, COUNT (CANCELLED) AS CANCELLED_COUNT ORDER BY CANCELLED_COUNT DESC 24

LIMIT 500; above job is 02.096 seconds. The figure Fig (5) below shows the scatter plot of the total number of flight cancellations with respect to flight distance (in miles) for the period 202 to 204. SELECT MONTH, AVG (DEP_DELAY) AS GROUP BY MONTH ORDER BY MONTH; above job is 05.263 seconds. The figure Fig (7) below shows the trend of the average flight departure delay as a function of month from January to December for the period 202 to 204. Figure 5: Number of flight cancellations vs flight distance 4.7 Effect of Flight Distance on Average Departure Delay for The Hive QL for querying the average flight departure delay as a function of flight distance from 202 SELECT DISTANCE, AVG (DEP_DELAY) AS ORDER BY DESC LIMIT 000; above job is.69 seconds. The figure Fig (6) below shows the scatter plot of the average flight departure delay (in minutes) with respect to flight distance for the period 202 to 204. Figure 7: Average departure delay vs month 4.9 Yearly Average Departure Delay for The Hive QL for querying the yearly average flight departure delay for 202 SELECT YEAR, AVG (DEP_DELAY) AS GROUP BY YEAR; above job is 76.79 seconds. The figure Fig (8) below shows the trend of the average flight departure delay as a function of year for the period 202 to 204. Figure 6: Average departure delay vs flight distance 4.8 Monthly Average Departure Delay for The Hive QL for querying the monthly average flight departure delay for 202 Figure 8: Average departure delay vs year 4.0 Total Number of Flights Each Year for 202-204 flights per year for 202 25

SELECT COUNT (*) WHERE YEAR = 204; above job is 0.428 seconds. The figure Fig (9) below shows the trend of the total number of flights each year for the period 202 to 204. http://www.transtats.bts.gov/dl_selectfields.asp? Table_ID=236 [2] What is Hive? http://www- 0.ibm.com/software/data/infosphere/hadoop/hive/. [3] Introduction to Hadoop in HDInsight: Big-data analysis and processing in the cloud, https://azure.microsoft.com/enus/documentation/articles/hdinsight-hadoopintroduction/ [4] An Introduction to Windows Azure BLOB Storage, https://www.simple-talk.com/cloud/cloud-data/anintroduction-to-windows-azure-blob-storage-/ [5] Explorer for Microsoft Azure Storage: Freeware Client,http://www.cloudberrylab.com/freemicrosoft-azure-explorer.aspx Figure 9: Total number of flights vs year 5. CONCLUSION From the above experimental results, we can see that interesting sets of trends and patterns exists in large data sets which helps us to get a better understanding of the data. Recent advancement in cloud technologies helps us to harness the power of parallel processing of a cluster of computers with little investment and almost no maintenance of the underlying computer hardware. From the experimental results we also see the following observations: a. Average flight departure delay is at peak during the months of June and July every year and there is a sharp increase in the average delay from November to December. b. Average flight departure delay is increasing continuously over the period 202 to 204 in spite of the fact that the total number of flights have decreased from 203 to 204. c. Highest average departure delay for flights has been observed for flight distance of less than 500 miles. d. Highest numbers of flights which are cancelled have a flight distance of less than 000 miles. e. There is a trend of increasing flights which are cancelled every year from 202 to 204. f. There is a sharp rise in the number of flights which are cancelled from the month of November to January every year for 202 to 204. REFERENCES [] Airline Data Set, United States Department of Transportation, Office of the Assistant Secretary for Research and Technology, Bureau of Transportation Statistics, [6] Upload data for Hadoop jobs in HDInsight, https://azure.microsoft.com/enus/documentation/articles/hdinsight-upload-data/ [7] Market Basket Analysis Algorithms with MapReduce, Jongwook Woo, DMKD-0050, Wiley Interdisciplinary Reviews Data Mining and Knowledge Discovery, Oct 28 203, Volume 3, Issue 6, pp445-452, ISSN 942-4795 AUTHOR PROFILE Nillohit Bhattacharya is currently pursuing his Master s degree in Information Systems at California State University Los Angeles. He is a member of the HiPIC (High-Performance Information Computing Center) at the Computer Information Systems Department of the University. He completed his Bachelor degree in Electronics from Pune University, India in 2005 and Post Graduate Program in Management from Symbiosis International University, India in 20. His interests include learning new and emerging technologies in Big Data and Analyzing Data Sets to extract patterns and information from them. Jongwook Woo is currently a Full Professor at Computer Information Systems at California State University Los Angeles. He is a director of the HiPIC (High-Performance Information Computing Center) at the university. He received his BS and the MS degree, both in Electronic Engineering from Yonsei University in 989 and 99, respectively. He obtained his second MS degree in Computer Science and received the PhD degree in Computer Engineering, both from University of Southern California in 998 and 200, respectively. His research interests are Information Retrieval /Integration /Sharing on Big Data, Map/Reduce, In-Memory Processing, and functional algorithm on Hadoop Parallel/Distributed/Cloud Computing, and n-tier Architecture application in e-business, smartphone, social networking and bioinformatics applications. He has published more than 40 peer reviewed conference and journal papers. He also has consulted many entertainment companies in Hollywood. 26