Big Data Weather Analytics Using Hadoop

Transcription

1 Big Data Weather Analytics Using Hadoop Veershetty Dagade #1 Mahesh Lagali #2 Supriya Avadhani #3 Priya Kalekar #4 Professor, Computer science and Engineering Department, Jain College of Engineering, Belgaum, Karnataka, India Abstract We want to build a platform that is extremely flexible and scalable to be able to analyze pentabytes of data across an extremely wide increasing wealth of weather variables. Here in this paper we are working on data analysis using Apache Hadoop and Apache Spark. We are performing experiments to decide the best tools among Hadoop using Pig and Hive Queries.And also we are comparing their performance based on pseudo node and Hadoop Distributed Multinode cluster. Index Terms Apache Hadoop, Big Data,Hive, Multinode Cluster, Singlenode Cluster. I. INTRODUCTION Big Data is that data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast and is unstructured and doesn't fit the structures of the architectures. To gain value from this data we need an alternative way to process it. Various fields for example that generate such large amounts of huge data are Facebook, Twitter,Weather stations,new York Stock Exchange, Worldwide electric transmissions etc. Thus in our project we are dealing with huge amount of unstructured weather data. Our paper focuses on the shifting of processes from single node data processing to Hadoop distributed file system for faster processing and the best technique to process the queries. Weather forecasting is always a big challenge for the meteorologists to predict the state of the atmosphere at some future time and the weather conditions that may be expected. It is obvious that knowing the future of the weather can be important for individuals and organizations. Accurate weather forecasts can tell a farmer the best time to plant, an airport control tower what information to send to planes that are landing and taking off, and residents of a coastal region when a hurricane might strike. Humans have been looking for ways to forecast the weather for centuries. Scientifically-based weather forecasting was not possible until meteorologists were able to collect data about current weather conditions from a relatively widespread system of observing stations and organize that data in a timely fashion. Vilhelm and Jacob Bjerknes developed a weather station network in the 19s that allowed for the collection of regional weather data. The weather data collected by the network could be transmitted nearly instantaneously by use of the telegraph, invented in the 183s by Samuel F. B. Morse. The age of scientific forecasting, also referred to as synoptic forecasting, was under way. In the United States, weather forecasting is the responsibility of the National Weather Service (NWS). The future modernized structure of the NWS will include 116 weather forecast offices (WFO) and 13 river forecast centers, all collocated with WFOs. Thus Global weather data are collected at more than 1, observation points around the world and then sent to central stations maintained by the World Meteorological Organization, a division of the United Nations. Thus there is a need for a flexible platform for the maintenance of this Big Data and help Weather forecasting using that Big Data. Thus Apache open source Hadoop and Spark are the solutions for it, that provides high speed clustered processing for the analysis of large set of data smoothly and efficiently. Hadoop & Map Reduce are the most widely used models used today for Big Data processing. Hadoop is an opensource large-scale data processing framework that supports distributed processing of large chunks of data using simple programming models. The Apache Hadoop project consists of the HDFS and Hadoop Map Reduce in addition to other modules. The software is modeled to harvest upon the processing power of clustered computing while managing failures at node level. Apache Spark is the new competitor in the Big Data field. Spark design is not tied to MapReduce,it has proved to be 1 times faster than Hadoop MapReduce in certain cases. Spark supports in-memory computing and performs much better on iterative algorithms, where the same code is executed multiple times and the output of one iteration is the input of the next one. Apache Pig and Hive are two projects which are layered on top of Hadoop, and provide higher-level language to use Hadoop s MapReduce library. Pig provides the scripting language to describe operations like the reading, filtering and transforming, joining, and writing data which are exactly the same operations that MapReduce was originally designed for. And Hive offers even more specific and higher-level language, to query data by running Hadoop jobs, instead of directly scripting step-by-step all operation of several 847

2 MapReduce jobs on Hadoop. The language is, very much SQL-like, by design. Apache Hive is still intended as a tool for long-running batch-oriented queries over a massive data and it s not real-time in any sense. A. Hadoop Hadoop is an open-source framework for processing large amount of data across clusters of computers with the use of high-level data processing languages. It s modules provides easy to use languages, graphical interfaces and administration tools for handling petabytes of data on thousands of computers. Figure #2: HDFS Architecture and Daemons 2) MapReduce Figure #1: A Hadoop cluster has many parallel machines that store and process large data sets. Client computers send jobs into this computer cloud and obtain results The Map Reduce software framework which was originally introduced by Google in 4 is a programming model, which now adopted by Apache Hadoop, consists of splitting the large chunks of data and 'Map' & 'Reduce' phases. Map reduce is a processing large datasets in parallel using lots of computer running in a cluster. We can extend the mapper class with our own instruction for handling various input in specific manner. During map master computer instructs worker computers to process local input data and Hadoop performs shuffle process. Thus master computer collects the results from all reducers and compilers to answer overall query. Hadoop cluster is a set of commodity machines networked together in one location. Data storage and processing all occur within this cloud of machines. Different users can submit computing jobs to Hadoop from individual clients, which can be their own desktop machines in remote locations from the Hadoop cluster. Two main components of Hadoop are Hadoop Distributed File System(HDFS) and Map Reduce. HDFS provides clustered file system management for large datasets of sizes of gigabytes and petabytes. And Map Reduce is a tool implemented for managing and processing vast amounts of unstructured data in parallel based on division of a big work item in smaller independent task units. 1) HDFS Architecture: HDFS has a master/slave architecture. An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients. In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on. Figure #3: Traditional MapReduce workflow, Map tasks operate on local blocks, but intermediate data transfer during the Reduce phase is an all-to-all operation. B. Pig Pig was initially developed by Yahoo research in 6 and was later moved into Apache software Foundation in 7.Pig consists of an language and an execution environment. Pig's language, called pig latin, is a data flow language. Pig can operate on complex data structures, even those that can have levels of nesting. Pig doesn't require the data to have schema, so it is well suited to process the unstructured data. Pig is relationally complete like SQL, Which means it is at least as powerful as a relational algebra. Pig is made up of two pieces: The language used to express data flows, called Pig Latin. 848

3 The execution environment to run Pig Latin programs. There are currently two environments: local execution in a single JVM and distributed execution on a Hadoop cluster. Fig#5: Hive Architecture Figure#4: Pig Ecosystem Hadoop ecosystem has Pig as one of it part but this part makes the distributed platform useful by anyone. The Figure shows the Hadoop ecosystem/ At its core Hadoop consists of only HDFS (Hadoop Distributed File System) and MapReduce. Only using this core it is possible to develop and execute distributed application. All other elements make it simple and increase the capabilities of the platform. The migration of data from non-hadoop data stores into HDFS can be done using the Sqoop. Pig, Hive (a data warehouse system), Scalding (a Scala engine for building MapReduce applications) are some of the applications that make use of the Hadoop MapReduce platform. The reason why Pig is important is it simplifies the processing of data for decision making which is the core task of Hadoop. C. Hive Hive is a data warehouse solution for hadoop that is flexible to work with large scale data volumes and it eliminates the need of writing long and complex java programs.hive Bridges the gap between low-level java programming for hadoop and SQL and also leverages Hadoop supports partitioning for scalability and performance. Due to this Hive queries have high latency even when involved datasets are small. For interactive data browsing, queries over small data sets or test queries Hive tries to provide acceptable latency. It was not designed to offer real time queries, row level updates and online transaction processing. Hive works well with batch jobs over large sets of immutable data. User Interface: Hive is a data warehouse tool that acts as a interface between user and HDFS. Hive supports Hive command line, Hive HD Insight(in Windows Server) and Hive Web UI as user interfaces. Metastore: The schema or Metadata of tables, columns in table, databases, their data types and HDFS mapping are stored in the database servers chosen by Hive. HiveQL Process Engine: For querying on schema information on metastore Hive uses HiveQL which is similar to SQL.It can be used to replace the MapReduce approach of writing huge programs. Here we write a query for MapReduce job and process it instead of writing programs in JAVA for MapReduce. Execution Engine: Hive Execution Engine is the conjunction part of Hive QL process Engine and MapReduce. The processing of query is done by the execution engine and it generates results which is same as that of MapReduce results. HDFS: HDFS is a distributed file system that provides high-performance access to data across Hadoop clusters. II. WEATHER DATA ANALYTICS The Big data collected by ncdc(national Climatic Data Center) is collected across more than 116 weather stations and more than 1 observations centers.the data generated by them is unstructured,which becomes a challenging task to analyze it. The sample picture of the raw data is shown below Fig#6: Unstructured weather data This data is transformed into understandable format using java programs. These huge amount of data is the loaded onto a hadoop distributed file system. This file system consists of number of clusters.once the data is loaded onto the hdfs file system the data is balanced across the clusters. These hdfs 849

4 system are fault tolerant,as there exists replication of data among the clusters and if any one of the node fails the whole setup doesn't crash, hdfs obtains the replicated file system. Thus one the semi structured data is uploaded on to hdfs file system the data can be used by tools for analysis. We are using Apache hive and apache pig to process the data,which uses mappers and reducers to process the data. We use the following sample Pig query on the weather dataset to analyze the data the for Average temperatures per year per station temperature, yearly minimum temperature and yearly precipitation. One such graph is shown below. A = LOAD '/user/hdfs/weather/weather.csv' using PigStorage(','); B = FOREACH A GENERATE $1 as station, (int) SUBSTRING($2,, 4) as year, (int) $4 as tmax, (int) $5 as tmin; C1 = FILTER B BY tmax > -9999; D1 = GROUP C1 by (station, year); E1 = FOREACH D1 GENERATE group.station as station, (int) group.year, AVG(C1.tmax); STORE E1 into '/user/hdfs/weather_pig3.csv' using PigStorage(','); This query first loads the data file required to be analyzed and the generates the schema for that particular unstructured file and then we are filtering the whole data for 'tmax>-9999',then we grouping the data by station and year. Further we can either store the data on a file over hdfs or we can directly dump the data on the terminal. Similarly, the Hive queries perform over unstructured data and the sample query for the weather data is as follows. CREATE EXTERNAL TABLE weather1 (stationcode STRING, station STRING, datefield STRING, prcp DOUBLE, tmax DOUBLE, tmin DOUBLE) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; LOAD DATA INPATH '/user/hdfs/weather.csv' OVERWRITE INTO TABLE weather1; SELECT station, datefield, tmax FROM weather1 WHERE tmax <> ORDER BY tmax DESC; Here we first create the external table providing the schema for the huge unstructured data. Then we Load data in path provided by the query and thus we perform SELECT query on the data and display results on the terminal.the following image shows the snapshot of working of hive query. Fig#8: yearly maximum temperature for last 5 years. III. PERFORMANCE ASSESSMENT As our project deals with working with the fastest tool we perform the performance evaluation between the pig and hive queries and below are the theoretical comparison between performance of pig and hive. And according to the data analyzing speed and efficiency HIVE proves to be BETTER than PIG st 2nd 3rd 4th 5th Pig Hive Fig#9: Pig Hive performance queries Fig#7: Working of hive query on terminal Using various such queries the visualization of data is done for various attributes of weather like yearly maximum Thus according to the experiment carried out over the performance, the speed of HIVE is found to be better than the PIG processing of data. Single Node and Distributed node processing Our project also focuses on shifting from single node processing to multi node clustered processing as single node carries out whole task individually and thus it proves to be inefficient to analyze big data and thus may lead to break down of the system. Thus the comparison between the 85

5 working of data process over pseudo node and multi node is given below Single Node HIVE Multinode HIVE [3] [4] [5] Hadoop in Action - Chuk Lam [6] [7] [8] [9] Guidelines on Climate Metadata And homogenization World Climate data and monitoring program, Geneva. 1st 2nd 3rd 4th Fig#1: Single node and Multinode Hive Performance st Qtr 2nd Qtr 3rd Qtr 4th Qtr Single node Pig Multi node pig Fig#11: Single node and Multinode Pig Performance Thus the experiments show that the clustered processing proves far more better than pseudo node processing. The main idea behind Multinode is that the master divides the work among multiple clients or slaves,these slaves work on the concept of map reduce and process independent chunk of data and thus proves to be faster. IV. CONCLUSION With the increasing amount of daily data its impossible to process and analyze data on a single system and thus there's a need of Multiple Node HDFS system. Once shifted to HDFS System Hive programming proves to be better tool to analyze data for huge volumes. Thus huge weather data can be easily processed with high end systems using Hadoop distributed file system in a very efficient manner The query tools makes the analytics much easier by providing random access to Big Data.. References [1] [2] Hadoop -A Definitive Guide, O'REILLY 851