Big Data Time Series Analysis

Size: px
Start display at page:

Download "Big Data Time Series Analysis"

Transcription

1 Big Data Time Series Analysis Integrating MapReduce and R Visakh C R Department of Information and Computer Science Aalto University Espoo, Finland visakh.nair@aalto.fi Abstract Explosion of data, especially from sensor devices, calls for new and innovative methods of storing, processing and analysis. In this paper, we look at the integration of big data technologies with a statistical computing framework, R. We tackle various time series analysis use cases and also develop a proof-ofconcept frontend application which can interact with the data and perform the computations. Keywords time series, machine learning, big data, apache hadoop, mapreduce, apache hive, R I. INTRODUCTION We live in the era of data. Web logs, internet media, social networks and sensor devices are generating petabytes of data every day. The advent of Internet-of-Things (IOT) devices has been heralded as a game changer. It has been estimated that there will be more than 25 billion IOT devices by 2020 [1]. These devices generate a huge amount of data, and it is imperative to have a proven and scalable mechanism to store, process and analyze this data. In this paper, we propose and develop a framework for storing and analyzing sensor data. Sensor data is predominantly a time series. Therefore, we integrate a proven big data platform, Apache Hive, with one of the most powerful open source statistical tool, R. We develop an architecture for the combined system and also discuss on how to seamlessly integrate the technologies. We discuss various machine learning use cases which are implemented using the combined power of R and Hive. We also provide a proof-of-concept dashboard developed in Shiny, which enables to perform various analytical and machine learning computations using a web based frontend. This paper is divided as follows: Section II introduces the various technologies and frameworks used. Section III details the data, collection methods, data models and pre-processing steps. In Section IV, we discuss selected use cases and their implementation. Section V details the experiments and the results. We also discuss the dashboard implementation of the use cases using Shiny. Section VI summarizes the project and discusses the takeaways. We also briefly outline the future direction for this project. We use Python scripts for collecting the data. The data is then loaded into Apache Hive tables. Hive queries spawn MapReduce jobs, which forms the core programming model of Apache Hadoop. We will discuss the architecture and implementation of Hive. For implementing the time series analyses and machine learning, we use R and various R libraries. For developing the dashboard, we utilize Shiny. A. Python Python is a general purpose, high-level programming language. First developed in 1991, Python has been widely accepted in the programming world. It has been ranked in the top ten widely used programming languages in the past few years [2]. In this project, we use Python for collecting the sensor data from a HTTP server and generate the source data files needed for loading the Hive tables. B. Apache Hadoop Apache Hadoop is an open-source framework that allows the distributed processing of large datasets across clusters of computers. The essence of Hadoop lies in two components a storage component known as Hadoop Distributed File System (HDFS) and a processing component known as MapReduce [3]. HDFS is a scalable and distributed file system. It stores large files across multiple machines. The data is replicated, therefore providing high reliability. MapReduce is the programming paradigm which drives Hadoop jobs. A MapReduce job consists of two components: a Map component which performs filtering and sorting, and a Reduce component which performs the summary operation. To elaborate further, the Map component takes an input pair and produces key-value pairs. These key-value pairs are collected and provided as input to the Reduce component. The Reduce component merges the values based on the key value [4]. Fig. 1 shows a high-level flow of a MapReduce job [5]. II. TOOLS & TECHNOLOGIES In this section, we discuss the various tools and technologies used in this project.

2 g) stats: stats package provides a wide array of statistical functions [14]. E. Shiny Shiny is a web application framework in R. Shiny makes it easier to develop web applications, especially without any HTML, CSS or JavaScript [9]. Fig MapReduce job flow C. Apache Hive Apache Hive is a data warehouse platform built on top of Hadoop. It facilitates the querying of large datasets stored in HDFS format. Hive provides a declarative SQL like query language called Hive Query Language (HQL), which translates internally to MapReduce programs, thereby providing a much simplistic method to query, analyze and process the data [6]. Hive supports User Defined Functions (UDFs) which helps to extend and implement your own logic for processing the data. We utilize a few Hive UDFs in this project, for data processing as well as for computing Moving Averages and Moving Standard Deviation. These UDFs are traditionally written in Java, though support for Python has been improving recently. D. R R is a programming language predominantly used for statistical computing. It is possible to perform time series analysis, classification, clustering, linear and non-linear modeling and various visualization methods in R [7]. The R platform includes thousands of packages which extends and adds more functionality. Now, we will briefly mention the major packages used in this project. a) RHive: RHive is used to connect R with Hive and thereby perform distributed computing using Hive queries in R [8]. We connect to Hive tables in R scripts using HiveServer, extract the relevant data and process it using R functions. b) shiny: Shiny enables to build web applications with R [9]. Shiny forms the core of the web based dashboard that we develop in this project and will be explained in detail later. c) TSA: This is a repository of time series analysis functions [10]. d) forecast: This package provides functions for univariate time series forecasting [11]. e) tseries: This packages provides functionality for time series analysis [12]. f) cluster: This package provides methods for cluster analysis [13]. In this project, we use Shiny to create various dashboard which will help users to interact with the time series data and perform analysis and machine learning without tinkering with any code. A selected few of these dashboards are showcased in Section V. III. DATA & DATA MODELS As discussed in the introductory section, this project is about analyzing huge amount of time series data from sensors. In this section, we expand upon the type of data (as to what kind of sensor data), the data collection methods and data preprocessing steps. We will also look at the data models developed for storing the data in Hive. A. The Data A time series can be defined as a sequence of data points measured over a time interval. There is a multitude of sensor devices around us generating different types of data. Since these devices tend to measure various phenomenon, they tend to provide a sequence of data values over a time period. In other words, sensor devices produce time series data. In this project, we utilize data from traffic sensors. A partner company [15] has installed sensors in a five lane road in a city in Northern Finland. Each lane has its own sensor. Three lanes take traffic towards South, while two lanes take traffic towards North. The sensors measure the following values: a) Sensor ID: Each sensor has been provided with a distinct sensor id. They range from 4 to 8. b) Timestamp: The timestamp when the vehicle information was captured by the sensor. c) Speed: The speed of the vehicle in km/h d) Height: The height of the vehicle in centimeters e) Length: The length of the vehicle in meters f) Snow Depth: The amount of snow on road in millimeters B. Data Collection The data from the sensors is accessed through a HTTP interface. The data collection is performed using a Python script. In order not to overload the server, the script is executed every hour. Each execution of the script collects the data from the previous one hour. The scheduling of the job is done using cron daemon in Linux. The outline of the script is as follows: The script pings the HTTP server and reads the data using the Python library urllib2. Using another library called BeautifulSoup [16], the relevant data is extracted, cleaned and transformed into a Python list. This data is written into a CSV file.

3 C. Data Model Apache Hive serves as the repository for storing the traffic sensor data. Hive provides an encapsulation of the data and serves it as a table, which is similar to a database table. We utilize staging tables to load the raw data from the CSV files. The data is transformed using SQL functions and loaded into a final target table. We now provide the data model and transformations performed. a) Staging table To read the CSV file, we utilize a Hive Serializer / Deserializer (SerDe). The Deserializer reads a record and translate it to a Java object. The Serializer takes the Java object and convert it to a format which can be written to HDFS [17]. We also perform type conversions on the various attributes and cast them to the correct data types. In addition to this, we also generate a derived column, Day of the week. Based on the timestamp, we utilize Hive UDFs to get an integer which specifies the day of the week. Monday to Sunday is represented using 1 to 7. b) Target table The data from staging table is loaded into the target table. The data definition of the relevant columns is provided in Table 1. Column Name ts sensor_id speed height length snow_depth day_of_week Data Type TIMESTAMP INTEGER DOUBLE DOUBLE DOUBLE DOUBLE INTEGER Table 1 Data Definition of target table IV. USE CASES & IMPLEMENTATION In this section, we discuss the various use cases of time series analysis and machine learning which we implement in this project. We divide them into six categories, namely moving average, moving standard deviation, forecasting, clustering, correlation analysis and dashboards. First, we provide the mathematical and scientific background for the use case. Then we elaborate on the implementation of the use case in this project. We provide a brief summary of the techniques used and how they were implemented in this project. A. Moving Average Moving average is a statistical calculation of creating a series of averages of subsets of the original time series. For a series, the moving average is computed by taking the arithmetic mean of subsequent terms of the series. The first element of the moving average is computed by taking the arithmetic mean of the first elements in the series. The series is shifted forward, excluding the first element of the series and including the next element, and computing the mean again. This is performed for the entire series. For the series defined as [18] : =, the moving average can be (1) Moving average is used to smoothen the fluctuations in the data. The resulting series helps to identify trends and cycles in the data. Moving average is also known as rolling average or running average. In this project, we compute the moving average of speed of the vehicles. From the resulting series, it becomes easier to understand the trends in the speed of vehicles for each lane for different time of a day, for different days in a week or for different months. Moving average can be computed using R functions. But since R holds the data objects in memory, the size of data on which we can perform computations is limited by the memory of the system [19]. Since we have the data already in HDFS, we implement a process to compute the moving average using MapReduce. The outline of the process is as follows: i. Get the attribute name (in this case, the column of the Hive table) and the moving average time window ii. iii. iv. Hive query extracts the data from HDFS and passes it as a Java object Based on the time window provided, we compute the averages using Hive UDF The computed values are returned to R and then visualized By using UDF, we are able to perform the computation using MapReduce paradigm. The values returned from the UDF is a time series of the moving averages, which can then be plotted or analyzed using various R functions. B. Moving Standard Deviation Standard deviation is a measure of the spread of the values in a series. It measures the amount of variation of the values in a dataset [20]. In other words, standard deviation can be described as how tightly the values are clustered around the mean. Standard deviation is calculated by taking the square root of the averages of the squared deviations of values from their mean. In this project, we compute the moving standard deviation of the speed of vehicles for various time periods. The objective is to understand how varied the speed observations are when compared to the mean value. It also helps to understand level of uncertainty in the speed forecasts. Moving standard deviation is also computed using Java UDF in Hive, similar to the moving average process.

4 C. Forecasting Time series forecasting involves the creation of a model to predict future values based on the previous values. There are various techniques by which a forecasting model can be created and applied to the data. In this project, we use exponential smoothing and autoregressive integrated moving average (ARIMA). In this section, we provide a summary of these models and how they are implemented in this project. We also provide the details of the statistical testing measures utilized to validate the forecast models. i. Forecast models We introduce the two forecasting models used, Exponential Smoothing and ARIMA. a. Exponential Smoothing Exponential smoothing is generally used to create a smoothened time series. The concept is similar to moving average, but the difference is that in exponential smoothing, the weights for the previous values in the time series exponentially decrease over time. For a series, the exponential smoothing output can be calculated using the following formula:. + (1) (2) where the smoothing factor,, is between 0 and 1. The model described by (2) is suitable for forecasting data with no trend or seasonal characteristic. Since we are dealing with traffic sensor data, especially speed of vehicles, we utilize higher order exponential smoothing, namely, Holt- Winters double exponential smoothing and triple exponential smoothing. These methods take the trend and seasonality into account by introducing extra terms in the model [21]. b. ARIMA Autoregressive Integrated Moving Average (ARIMA) is a generalization of Autoregressive Moving Average (ARMA) models. ARMA is a forecasting model in which both autoregressive (AR) and moving average (MA) methods are applied to the time series. We will first define the AR and MA part and then define the combined models. An autoregressive model of order p can be defined as: = + + (3) where,., are the model parameters, is the constant and is the white noise. A moving average model of order q can be defined as: + (4) where,..., are the model parameters, µ is the expectation of, and is the white noise. From (3) and (4), we can define an ARMA(p,q) model with p autoregressive terms and q moving average terms as: (5) = ARIMA introduces an additional parameter d to the model, which refers to the integrated part, which in turn corresponds to the degree of differencing. Thus, an ARIMA model can be applied to a time series even if it is not stationary, which is not the case with ARMA models. The following steps outline the general implementation of exponential smoothing and ARIMA models in this project. i. Using HQL queries, we extract the relevant data from Hive table. ii. The extracted data is stored as a data frame in R. iii. iv. We perform time series transformations on the data frame using TSA R library. The model is generated using the corresponding R function. v. Using forecast R library, we generate the forecasted values as needed. ii. Statistical testing We utilize Ljung-Box test to check whether the autocorrelations of the residuals in the forecast differs from zero. The null and alternate hypothesis are defined as: Null Hypothesis: : The data are independently distributed. Alternate Hypothesis: : The data are not independently distributed The alternate hypothesis can be rephrased to say that the data exhibit serial correlation [23]. D. Correlation Analysis Correlation can be defined as the tendency of two values to change together. A correlation value of 0 implies no relationship, -1 shows a perfect linear negative relationship and +1 implies a perfect linear positive relationship [24]. We compute the correlation co-efficient between various attributes of the time series using standard R functions. E. Clustering Clustering is a machine learning technique in which objects which are similar to each other are grouped together. It is an unsupervised learning technique. There are various methods to perform clustering, out of which we consider two: Centroid model (k-means) and distribution model (Expectation- Maximization). a. k-means The objective in k-means clustering is to partition n objects into k clusters. Each observation is identified with the cluster that has the closest mean. Though the original algorithm is NP-hard, most of the implementations utilize heuristic methods to converge to a local optimum. For brevity, we refer the readers to [25] for a detailed explanation of k-means clustering. b. Expectation-Maximization (EM)

5 EM algorithm is an iterative method to find the maximum likelihood estimates of parameters. The expectation step (E) creates a function for the expectation of the log-likelihood, and maximization step (M) computes the parameters by maximizing the expected log-likelihood computed in the E step. We again refer the readers to [26] for a deeper analysis. The objective of clustering in this project is to identify whether the vehicles can be divided into separate groups based on length and height. We implement the clustering algorithms using various R packages. F. Dashboards The success of a data science project depends on how the models and results can be presented to the end user. In the previous sections, we looked at the technology used and the models and use cases developed. We will now describe how we develop a web based dashboard which will allow a user to select the data, choose modeling techniques and visualize the results. Shiny is a web application framework for R. An application developed in Shiny consists of two components: a user interface script, which controls the layout and appearance, and a server script, which contains the scripts necessary to build the application. Fig. 2 shows the flow of a Shiny application. User selects input attributes and analysis functions Hive query extracts data and process/aggregate as needed Hive output is saved as a data frame in R Analysis is performed in R Output is shown to user using Shiny interface Fig. 2 Process flow of a Shiny application We will now describe the details of the various dashboards created. i. Moving Average & Moving Standard Deviation We have already discussed regarding the implementation of the computation of moving average and moving standard deviation. We created an application which will serve as an interface to choose various attributes and perform the calculations. The various attributes which can be selected are: a. Date: User can select a range of date for which the data needs to be pulled from Hive ii. iii. iv. b. Sensor id: User can choose the particular sensor id for which the data needs to be selected. c. Day of week: This provides the user with an option to choose data for a particular day of the week alone (for eg: Monday or Wednesday). d. Order: The order for the moving average or moving standard deviation can be provided. Once the options are selected, a connection is made to Hive and the data requested is selected. Hive UDFs to compute the moving average or moving standard deviation are executed, which in turn invokes MapReduce jobs. The final result set is saved as a data frame in R, which is then plotted and displayed in the application. Forecasting This dashboard serves as an interface for the various forecast methods. As in the moving average dashboard, the user selected the values for various attributes. The model as well as the number of forecast steps is also chosen. Based on the values provided, data from Hive table is extracted and saved as a data frame in R. Time series computations are performed in R and the final forecast result is displayed in the application. Clustering This application provides an interface to choose the date, sensor id and the columns on which the clustering needs to be performed. The number of clusters needed are also inputted. As in the other applications, the data from Hive is extracted and saved as a data frame in R. Clustering is performed on this data and the results are displayed in the application. Speed Comparison The speed comparison application serves as a tool to visualize the time series, especially the speed. The following values can be provided in the application: a. Dates: Two different dates can be chosen b. Sensor id c. Order: This value defines the smoothing order for the time series Based on the two dates provided, we select two sets of data from Hive. It is smoothened based on the order value and then displayed in the application. Comparison of speed with plots helps in understanding the data better. The purpose of the dashboards is to serve as a proofof-concept application which shows the seamless integration of HDFS, MapReduce and R. The data is stored in HDFS, MapReduce jobs are invoked to select the relevant data and R functions are used for processing the data and plotting it.

6 V. EXPERIMENTS & RESULTS So far, we have explained the details of the technology, the various use cases and the implementation steps and strategy followed in this project. In this section, we will first discuss the experimental setup. We will also discuss the results for each of the use cases. Table 2 shows the technology stack and the releases used. Technology Apache Hadoop Apache Hive R Python 2.7 Release details Table 2 Technology stack The configuration details of the server on which the experiments were run are given in Table 3. OS CPU Cores 8 Ubuntu LTS 64-bit Processors 2 x Intel(R) Xeon(R) 2.67GHz Memory 8 x 8 GB DIMM DDR3 Table 3 Server details We will now present the results for the various use cases. A. Moving Average We compute the moving average for a sensor for a particular date. The moving average computation is handled by Hive UDF. The results are saved as a data frame in R, which is then visualized. Fig. 3 shows the moving average for sensor 4 and sensor 5 for March 20, The source code is provided in Appendix A.1. B. Moving Standard Deviation Similar to moving average, we compute the moving standard deviation for sensors 4 and 5 for March 20, The results are shown in Fig. 4. The source code is provided in Appendix A.2. Fig. 4 Moving Standard Deviation C. Forecasting In section IV, we explained the two methods used in forecasting, ARIMA and exponential smoothing. Now, we present the speed forecast results using these two methods. Fig. 5 shows the output from Holt-Winter s exponential smoothing. In this, the black line represents the actual observed values, and the red line represents the fitted values. Fig. 5 Holt Winters forecast We forecast fifty future values using Holt-Winters and ARIMA models. To understand the feasibility of the model, we look at the residual values (the forecast errors). If there are correlations between forecast errors for successive predictions, then it might be possible to improve the forecasting model using other methods. In Fig. 6 and Fig. 7, we plot the correlogram of the forecast errors for the two methods. From the correlogram, we can see that the autocorrelation at lag 1 is within the significance bounds. Fig. 3 Moving Average

7 Fig. 6 Correlogram of forecast errrors of Holt Winters Fig. 7 Correlogram of forecast errors of ARIMA We also perform Ljung-Box test on the residuals to check whether there is any evidence of non-zero autocorrelations in the residuals. This test statistic has been explained in Section IV. Table 3 and 4 shows the Ljung-Box test results for forecasts using Holt-Winters and ARIMA models respectively. Box-Ljung test data: speed.forecast.hw$residuals X-squared = , df = 38, p-value < 2.2e-16 Fig. 8 Correlation analysis of various attributes From Fig. 8, we can infer that there is a positive correlation between speed and distance between cars and a negative correlation between speed and snow depth. The source code for the correlation analysis is provided in Appendix A.4. E. Clustering We perform clustering on the attributes length and height. The objective is to analyze whether the different vehicles that pass through the traffic sensors have any common factors and natural grouping. We conduct our experiments using k-means and Expectation-Maximization (EM) clustering methods. Visualizing the clustering results and understanding them is as important as the algorithm used. For k-means, we utilize two visualizations. In the first one, we plot the data points and color code them based on the cluster to which they have been assigned to. Fig. 9 shows the plot for the clustering. In the second one, we utilize a discriminant projection plot. In this, we plot the cluster results using the discriminant coordinates of the data. Fig. 10 shows the discriminant projection plot. Table 3 Box test result for Holt Winters residuals Box-Ljung test data: speed.forecast.arima$residuals X-squared = , df = 38, p-value = Table 4 Box test result for ARIMA residuals Based on the test statistic and the p value, we have little evidence of non-zero autocorrelations in the forecast errors. The source code for forecasting steps performed is provided in Appendix A.3. D. Correlation The correlation analysis is performed to understand the relationship between various attributes. We extract the following values from Hive tables for one sensor for one week: speed, length, height, distance between vehicles and snow depth. We perform correlation analysis on this data. Fig. 8 shows the correlation plot between the attributes. Fig. 9 k-means clusters Fig. 10 k-means discriminant plot

8 The results from EM clustering are plotted in Fig. 11. The source code for clustering is provided in Appendix A.5. Fig. 11 EM clusters F. Dashboards We developed various applications using Shiny to serve as dashboards for the analysis of the data. Fig. 2 provides a high level work flow of the dashboard. In this section, we showcase a few of the dashboards developed. Fig. 12 shows the dashboard screen for moving average and moving standard deviation computation. Fig. 13 shows the dashboard for forecasting the speed. Fig. 13 Shiny dashboard for forecasting speed The source code for the Shiny applications can be found in Appendix A.6. Fig. 12 Shiny dashboard for moving average & standard deviation VI. SUMMARY & CONCLUSION In this project, we have designed and implemented a method to process huge volume of time series data and perform time series analysis and machine learning on it. In Section II, we provided the necessary background about the various technologies used. We also outlined the methods by which these various technologies were brought together to perform as a single ecosystem for big data machine learning. In Section III, we discussed the data collection and processing strategies and also provided the details of the data model. In Section IV, we discussed various use cases and also described the implementation strategy for each of them. In Section V, we provided the details of the experiments conducted and the results. There are various paths which this project can take further. One of the planned future work is to extend this big data time

9 series analysis model to real-time streaming data. Another aspect is to replace MapReduce with Apache Spark and/or Apache Tez. Data infusion from multiple sources is also another aspect we would like to look into in future. REFERENCES [1] Press Release, "Gartner Says 4.9 Billion Connected "Things" Will Be in Use in 2015", Gartnet Inc, November [2] Tiobe Software, "TIOBE Index for May 2015", May [3] Apache Software Foundation, Apache Hadoop, [4] Jeffrey Dean & Sanjay Ghemawat, "MapReduce: simplified data processing on large clusters", OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation, Volume 6 [5] Jeremy Kun, "On the Computational Complexity of MapReduce", October [6] Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff and Raghotham Murthy, "Hive - A Warehousing Solution Over a Map-Reduce Framework", VLDB [7] "The R project for statistical computing", The R Foundation, [8] The Comprehensive R Archive Network, "RHive: R and Hive", NexR [9] The Comprehensive R Archive Network, "shiny: Web Application Framework for R", Winston Chang et al. [10] The Comprehensive R Archive Network, "TSA: Time Series Analysis", Kung-Sik Chan, Brian Ripley [13] The Comprehensive R Archive Network, "cluster: Cluster Analysis Extended", Peter Rousseeuw et al [14] The Comprehensive R Archive Network, "R statistical functions" [15] Noptel Oy, [16] Leonard Richardson, Beautiful Soup Documentation, [17] Dave Lewis, SerDes architectures and applications, DesignCon 2004 [18] Rob J Hyndman, "Moving Averages", November 2009, Unpublished [19] R Core Team, R installation and administration manual, April 2015 [20] Population Moments, James D Hamilton, "Time Series Analysis", Princeton University Press, 1994, ISBN , p [21] Exponential Smoothing, James D Hamilton, "Time Series Analysis", Princeton University Press, 1994, ISBN , p [22] ARMA models, James D Hamilton, "Time Series Analysis", Princeton University Press, 1994, ISBN , p43-64 [23] Ljung, G. M. and G. E. P. Box, "On a measure of lack of fit in time series models", Biometrika , p [24] Correlation, James D Hamilton, "Time Series Analysis", Princeton University Press, 1994, ISBN , p [25] k-means clustering, Ethem Alpaydin, Introduction to Machine Learning, Second Edition, 2010, ISBN , p [26] Expectation-Maximization Algorithm, Ethem Alpaydin, Introduction to Machine Learning, Second Edition, 2010, ISBN , p To be added in the final report APPENDIX [11] The Comprehensive R Archive Network, "forecast: Forecasting Functions for Time Series and Linear Models", Rob J Hyndman et al [12] The Comprehensive R Archive Network, "tseries: Time Series Analysis and Computational Finance", Adrian Trapletti, Kurt Hornik, Blake LeBaron

Enhancing Massive Data Analytics with the Hadoop Ecosystem

Enhancing Massive Data Analytics with the Hadoop Ecosystem www.ijecs.in International Journal Of Engineering And Computer Science ISSN:2319-7242 Volume 3, Issue 11 November, 2014 Page No. 9061-9065 Enhancing Massive Data Analytics with the Hadoop Ecosystem Misha

More information

NetFlow Analysis with MapReduce

NetFlow Analysis with MapReduce NetFlow Analysis with MapReduce Wonchul Kang, Yeonhee Lee, Youngseok Lee Chungnam National University {teshi85, yhlee06, lee}@cnu.ac.kr 2010.04.24(Sat) based on "An Internet Traffic Analysis Method with

More information

CASE STUDY OF HIVE USING HADOOP 1

CASE STUDY OF HIVE USING HADOOP 1 CASE STUDY OF HIVE USING HADOOP 1 Sai Prasad Potharaju, 2 Shanmuk Srinivas A, 3 Ravi Kumar Tirandasu 1,2,3 SRES COE,Department of er Engineering, Kopargaon,Maharashtra, India 1 psaiprasadcse@gmail.com

More information

Advanced SQL Query To Flink Translator

Advanced SQL Query To Flink Translator Advanced SQL Query To Flink Translator Yasien Ghallab Gouda Full Professor Mathematics and Computer Science Department Aswan University, Aswan, Egypt Hager Saleh Mohammed Researcher Computer Science Department

More information

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

Approaches for parallel data loading and data querying

Approaches for parallel data loading and data querying 78 Approaches for parallel data loading and data querying Approaches for parallel data loading and data querying Vlad DIACONITA The Bucharest Academy of Economic Studies diaconita.vlad@ie.ase.ro This paper

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

11/18/15 CS 6030. q Hadoop was not designed to migrate data from traditional relational databases to its HDFS. q This is where Hive comes in.

11/18/15 CS 6030. q Hadoop was not designed to migrate data from traditional relational databases to its HDFS. q This is where Hive comes in. by shatha muhi CS 6030 1 q Big Data: collections of large datasets (huge volume, high velocity, and variety of data). q Apache Hadoop framework emerged to solve big data management and processing challenges.

More information

Distributed Framework for Data Mining As a Service on Private Cloud

Distributed Framework for Data Mining As a Service on Private Cloud RESEARCH ARTICLE OPEN ACCESS Distributed Framework for Data Mining As a Service on Private Cloud Shraddha Masih *, Sanjay Tanwani** *Research Scholar & Associate Professor, School of Computer Science &

More information

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE

More information

Comparison of Different Implementation of Inverted Indexes in Hadoop

Comparison of Different Implementation of Inverted Indexes in Hadoop Comparison of Different Implementation of Inverted Indexes in Hadoop Hediyeh Baban, S. Kami Makki, and Stefan Andrei Department of Computer Science Lamar University Beaumont, Texas (hbaban, kami.makki,

More information

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Scalable Cloud Computing Solutions for Next Generation Sequencing Data Scalable Cloud Computing Solutions for Next Generation Sequencing Data Matti Niemenmaa 1, Aleksi Kallio 2, André Schumacher 1, Petri Klemelä 2, Eija Korpelainen 2, and Keijo Heljanko 1 1 Department of

More information

Data processing goes big

Data processing goes big Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,

More information

Hadoop Job Oriented Training Agenda

Hadoop Job Oriented Training Agenda 1 Hadoop Job Oriented Training Agenda Kapil CK hdpguru@gmail.com Module 1 M o d u l e 1 Understanding Hadoop This module covers an overview of big data, Hadoop, and the Hortonworks Data Platform. 1.1 Module

More information

Advanced Forecasting Techniques and Models: ARIMA

Advanced Forecasting Techniques and Models: ARIMA Advanced Forecasting Techniques and Models: ARIMA Short Examples Series using Risk Simulator For more information please visit: www.realoptionsvaluation.com or contact us at: admin@realoptionsvaluation.com

More information

Performance Overhead on Relational Join in Hadoop using Hive/Pig/Streaming - A Comparative Analysis

Performance Overhead on Relational Join in Hadoop using Hive/Pig/Streaming - A Comparative Analysis Performance Overhead on Relational Join in Hadoop using Hive/Pig/Streaming - A Comparative Analysis Prabin R. Sahoo Tata Consultancy Services Yantra Park, Thane Maharashtra, India ABSTRACT Hadoop Distributed

More information

The Internet of Things and Big Data: Intro

The Internet of Things and Big Data: Intro The Internet of Things and Big Data: Intro John Berns, Solutions Architect, APAC - MapR Technologies April 22 nd, 2014 1 What This Is; What This Is Not It s not specific to IoT It s not about any specific

More information

HIVE. Data Warehousing & Analytics on Hadoop. Joydeep Sen Sarma, Ashish Thusoo Facebook Data Team

HIVE. Data Warehousing & Analytics on Hadoop. Joydeep Sen Sarma, Ashish Thusoo Facebook Data Team HIVE Data Warehousing & Analytics on Hadoop Joydeep Sen Sarma, Ashish Thusoo Facebook Data Team Why Another Data Warehousing System? Problem: Data, data and more data 200GB per day in March 2008 back to

More information

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Jeffrey D. Ullman slides. MapReduce for data intensive computing Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very

More information

Client Overview. Engagement Situation. Key Requirements

Client Overview. Engagement Situation. Key Requirements Client Overview Our client is one of the leading providers of business intelligence systems for customers especially in BFSI space that needs intensive data analysis of huge amounts of data for their decision

More information

Big Data With Hadoop

Big Data With Hadoop With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

More information

Energy Load Mining Using Univariate Time Series Analysis

Energy Load Mining Using Univariate Time Series Analysis Energy Load Mining Using Univariate Time Series Analysis By: Taghreed Alghamdi & Ali Almadan 03/02/2015 Caruth Hall 0184 Energy Forecasting Energy Saving Energy consumption Introduction: Energy consumption.

More information

Manifest for Big Data Pig, Hive & Jaql

Manifest for Big Data Pig, Hive & Jaql Manifest for Big Data Pig, Hive & Jaql Ajay Chotrani, Priyanka Punjabi, Prachi Ratnani, Rupali Hande Final Year Student, Dept. of Computer Engineering, V.E.S.I.T, Mumbai, India Faculty, Computer Engineering,

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.

More information

Big Data and Apache Hadoop s MapReduce

Big Data and Apache Hadoop s MapReduce Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23

More information

SARAH Statistical Analysis for Resource Allocation in Hadoop

SARAH Statistical Analysis for Resource Allocation in Hadoop SARAH Statistical Analysis for Resource Allocation in Hadoop Bruce Martin Cloudera, Inc. Palo Alto, California, USA bruce@cloudera.com Abstract Improving the performance of big data applications requires

More information

CREDIT CARD DATA PROCESSING AND E-STATEMENT GENERATION WITH USE OF HADOOP

CREDIT CARD DATA PROCESSING AND E-STATEMENT GENERATION WITH USE OF HADOOP CREDIT CARD DATA PROCESSING AND E-STATEMENT GENERATION WITH USE OF HADOOP Ashvini A.Mali 1, N. Z. Tarapore 2 1 Research Scholar, Department of Computer Engineering, Vishwakarma Institute of Technology,

More information

Lavastorm Analytic Library Predictive and Statistical Analytics Node Pack FAQs

Lavastorm Analytic Library Predictive and Statistical Analytics Node Pack FAQs 1.1 Introduction Lavastorm Analytic Library Predictive and Statistical Analytics Node Pack FAQs For brevity, the Lavastorm Analytics Library (LAL) Predictive and Statistical Analytics Node Pack will be

More information

Workshop on Hadoop with Big Data

Workshop on Hadoop with Big Data Workshop on Hadoop with Big Data Hadoop? Apache Hadoop is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly

More information

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems Rekha Singhal and Gabriele Pacciucci * Other names and brands may be claimed as the property of others. Lustre File

More information

INTERNATIONAL JOURNAL OF INFORMATION TECHNOLOGY & MANAGEMENT INFORMATION SYSTEM (IJITMIS)

INTERNATIONAL JOURNAL OF INFORMATION TECHNOLOGY & MANAGEMENT INFORMATION SYSTEM (IJITMIS) INTERNATIONAL JOURNAL OF INFORMATION TECHNOLOGY & MANAGEMENT INFORMATION SYSTEM (IJITMIS) International Journal of Information Technology & Management Information System (IJITMIS), ISSN 0976 ISSN 0976

More information

How To Scale Out Of A Nosql Database

How To Scale Out Of A Nosql Database Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 thomas.steinmaurer@scch.at www.scch.at Michael Zwick DI

More information

Implement Hadoop jobs to extract business value from large and varied data sets

Implement Hadoop jobs to extract business value from large and varied data sets Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to

More information

A Performance Analysis of Distributed Indexing using Terrier

A Performance Analysis of Distributed Indexing using Terrier A Performance Analysis of Distributed Indexing using Terrier Amaury Couste Jakub Kozłowski William Martin Indexing Indexing Used by search

More information

A Brief Outline on Bigdata Hadoop

A Brief Outline on Bigdata Hadoop A Brief Outline on Bigdata Hadoop Twinkle Gupta 1, Shruti Dixit 2 RGPV, Department of Computer Science and Engineering, Acropolis Institute of Technology and Research, Indore, India Abstract- Bigdata is

More information

Big data blue print for cloud architecture

Big data blue print for cloud architecture Big data blue print for cloud architecture -COGNIZANT Image Area Prabhu Inbarajan Srinivasan Thiruvengadathan Muralicharan Gurumoorthy Praveen Codur 2012, Cognizant Next 30 minutes Big Data / Cloud challenges

More information

Apache Flink Next-gen data analysis. Kostas Tzoumas ktzoumas@apache.org @kostas_tzoumas

Apache Flink Next-gen data analysis. Kostas Tzoumas ktzoumas@apache.org @kostas_tzoumas Apache Flink Next-gen data analysis Kostas Tzoumas ktzoumas@apache.org @kostas_tzoumas What is Flink Project undergoing incubation in the Apache Software Foundation Originating from the Stratosphere research

More information

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park Big Data: Using ArcGIS with Apache Hadoop Erik Hoel and Mike Park Outline Overview of Hadoop Adding GIS capabilities to Hadoop Integrating Hadoop with ArcGIS Apache Hadoop What is Hadoop? Hadoop is a scalable

More information

Click Stream Data Analysis Using Hadoop

Click Stream Data Analysis Using Hadoop Governors State University OPUS Open Portal to University Scholarship Capstone Projects Spring 2015 Click Stream Data Analysis Using Hadoop Krishna Chand Reddy Gaddam Governors State University Sivakrishna

More information

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of

More information

Analysis of algorithms of time series analysis for forecasting sales

Analysis of algorithms of time series analysis for forecasting sales SAINT-PETERSBURG STATE UNIVERSITY Mathematics & Mechanics Faculty Chair of Analytical Information Systems Garipov Emil Analysis of algorithms of time series analysis for forecasting sales Course Work Scientific

More information

A Study of Data Management Technology for Handling Big Data

A Study of Data Management Technology for Handling Big Data Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 9, September 2014,

More information

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop

More information

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia Monitis Project Proposals for AUA September 2014, Yerevan, Armenia Distributed Log Collecting and Analysing Platform Project Specifications Category: Big Data and NoSQL Software Requirements: Apache Hadoop

More information

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future

More information

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model

More information

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, sborkar95@gmail.com Assistant Professor, Information

More information

Big Data Explained. An introduction to Big Data Science.

Big Data Explained. An introduction to Big Data Science. Big Data Explained An introduction to Big Data Science. 1 Presentation Agenda What is Big Data Why learn Big Data Who is it for How to start learning Big Data When to learn it Objective and Benefits of

More information

Bayesian networks - Time-series models - Apache Spark & Scala

Bayesian networks - Time-series models - Apache Spark & Scala Bayesian networks - Time-series models - Apache Spark & Scala Dr John Sandiford, CTO Bayes Server Data Science London Meetup - November 2014 1 Contents Introduction Bayesian networks Latent variables Anomaly

More information

Big Data Rethink Algos and Architecture. Scott Marsh Manager R&D Personal Lines Auto Pricing

Big Data Rethink Algos and Architecture. Scott Marsh Manager R&D Personal Lines Auto Pricing Big Data Rethink Algos and Architecture Scott Marsh Manager R&D Personal Lines Auto Pricing Agenda History Map Reduce Algorithms History Google talks about their solutions to their problems Map Reduce:

More information

Distributed Computing and Big Data: Hadoop and MapReduce

Distributed Computing and Big Data: Hadoop and MapReduce Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:

More information

Advanced Big Data Analytics with R and Hadoop

Advanced Big Data Analytics with R and Hadoop REVOLUTION ANALYTICS WHITE PAPER Advanced Big Data Analytics with R and Hadoop 'Big Data' Analytics as a Competitive Advantage Big Analytics delivers competitive advantage in two ways compared to the traditional

More information

Time Series Analysis: Basic Forecasting.

Time Series Analysis: Basic Forecasting. Time Series Analysis: Basic Forecasting. As published in Benchmarks RSS Matters, April 2015 http://web3.unt.edu/benchmarks/issues/2015/04/rss-matters Jon Starkweather, PhD 1 Jon Starkweather, PhD jonathan.starkweather@unt.edu

More information

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15 Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 15 Big Data Management V (Big-data Analytics / Map-Reduce) Chapter 16 and 19: Abideboul et. Al. Demetris

More information

IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE

IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE Mr. Santhosh S 1, Mr. Hemanth Kumar G 2 1 PG Scholor, 2 Asst. Professor, Dept. Of Computer Science & Engg, NMAMIT, (India) ABSTRACT

More information

Axibase Time Series Database

Axibase Time Series Database Axibase Time Series Database Axibase Time Series Database Axibase Time-Series Database (ATSD) is a clustered non-relational database for the storage of various information coming out of the IT infrastructure.

More information

An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce.

An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce. An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce. Amrit Pal Stdt, Dept of Computer Engineering and Application, National Institute

More information

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe

More information

Bringing Big Data Modelling into the Hands of Domain Experts

Bringing Big Data Modelling into the Hands of Domain Experts Bringing Big Data Modelling into the Hands of Domain Experts David Willingham Senior Application Engineer MathWorks david.willingham@mathworks.com.au 2015 The MathWorks, Inc. 1 Data is the sword of the

More information

On a Hadoop-based Analytics Service System

On a Hadoop-based Analytics Service System Int. J. Advance Soft Compu. Appl, Vol. 7, No. 1, March 2015 ISSN 2074-8523 On a Hadoop-based Analytics Service System Mikyoung Lee, Hanmin Jung, and Minhee Cho Korea Institute of Science and Technology

More information

BiDAl: Big Data Analyzer for Cluster Traces

BiDAl: Big Data Analyzer for Cluster Traces BiDAl: Big Data Analyzer for Cluster Traces Alkida Balliu, Dennis Olivetti, Ozalp Babaoglu, Moreno Marzolla, Alina Sirbu Department of Computer Science and Engineering University of Bologna, Italy BigSys

More information

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Managing Big Data with Hadoop & Vertica A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Copyright Vertica Systems, Inc. October 2009 Cloudera and Vertica

More information

Big Data and Analytics: A Conceptual Overview. Mike Park Erik Hoel

Big Data and Analytics: A Conceptual Overview. Mike Park Erik Hoel Big Data and Analytics: A Conceptual Overview Mike Park Erik Hoel In this technical workshop This presentation is for anyone that uses ArcGIS and is interested in analyzing large amounts of data We will

More information

Hadoop and Hive. Introduction,Installation and Usage. Saatvik Shah. Data Analytics for Educational Data. May 23, 2014

Hadoop and Hive. Introduction,Installation and Usage. Saatvik Shah. Data Analytics for Educational Data. May 23, 2014 Hadoop and Hive Introduction,Installation and Usage Saatvik Shah Data Analytics for Educational Data May 23, 2014 Saatvik Shah (Data Analytics for Educational Data) Hadoop and Hive May 23, 2014 1 / 15

More information

Hadoop and Map-Reduce. Swati Gore

Hadoop and Map-Reduce. Swati Gore Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data

More information

TIME SERIES ANALYSIS

TIME SERIES ANALYSIS TIME SERIES ANALYSIS Ramasubramanian V. I.A.S.R.I., Library Avenue, New Delhi- 110 012 ram_stat@yahoo.co.in 1. Introduction A Time Series (TS) is a sequence of observations ordered in time. Mostly these

More information

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)

More information

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Mahesh Maurya a, Sunita Mahajan b * a Research Scholar, JJT University, MPSTME, Mumbai, India,maheshkmaurya@yahoo.co.in

More information

Hadoop Ecosystem B Y R A H I M A.

Hadoop Ecosystem B Y R A H I M A. Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open

More information

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin

More information

TIME SERIES ANALYSIS

TIME SERIES ANALYSIS TIME SERIES ANALYSIS L.M. BHAR AND V.K.SHARMA Indian Agricultural Statistics Research Institute Library Avenue, New Delhi-0 02 lmb@iasri.res.in. Introduction Time series (TS) data refers to observations

More information

Testing Big data is one of the biggest

Testing Big data is one of the biggest Infosys Labs Briefings VOL 11 NO 1 2013 Big Data: Testing Approach to Overcome Quality Challenges By Mahesh Gudipati, Shanthi Rao, Naju D. Mohan and Naveen Kumar Gajja Validate data quality by employing

More information

Alternatives to HIVE SQL in Hadoop File Structure

Alternatives to HIVE SQL in Hadoop File Structure Alternatives to HIVE SQL in Hadoop File Structure Ms. Arpana Chaturvedi, Ms. Poonam Verma ABSTRACT Trends face ups and lows.in the present scenario the social networking sites have been in the vogue. The

More information

Apache MRQL (incubating): Advanced Query Processing for Complex, Large-Scale Data Analysis

Apache MRQL (incubating): Advanced Query Processing for Complex, Large-Scale Data Analysis Apache MRQL (incubating): Advanced Query Processing for Complex, Large-Scale Data Analysis Leonidas Fegaras University of Texas at Arlington http://mrql.incubator.apache.org/ 04/12/2015 Outline Who am

More information

BIG DATA TRENDS AND TECHNOLOGIES

BIG DATA TRENDS AND TECHNOLOGIES BIG DATA TRENDS AND TECHNOLOGIES THE WORLD OF DATA IS CHANGING Cloud WHAT IS BIG DATA? Big data are datasets that grow so large that they become awkward to work with using onhand database management tools.

More information

Analysis of Web Archives. Vinay Goel Senior Data Engineer

Analysis of Web Archives. Vinay Goel Senior Data Engineer Analysis of Web Archives Vinay Goel Senior Data Engineer Internet Archive Established in 1996 501(c)(3) non profit organization 20+ PB (compressed) of publicly accessible archival material Technology partner

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK OVERVIEW ON BIG DATA SYSTEMATIC TOOLS MR. SACHIN D. CHAVHAN 1, PROF. S. A. BHURA

More information

The Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah LinkedIn

The Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah LinkedIn The Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah LinkedIn Presented by :- Ishank Kumar Aakash Patel Vishnu Dev Yadav CONTENT Abstract Introduction Related work The Ecosystem Ingress

More information

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing

More information

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks Hadoop Introduction Olivier Renault Solution Engineer - Hortonworks Hortonworks A Brief History of Apache Hadoop Apache Project Established Yahoo! begins to Operate at scale Hortonworks Data Platform 2013

More information

Spring,2015. Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE

Spring,2015. Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE Spring,2015 Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE Contents: Briefly About Big Data Management What is hive? Hive Architecture Working

More information

Making big data simple with Databricks

Making big data simple with Databricks Making big data simple with Databricks We are Databricks, the company behind Spark Founded by the creators of Apache Spark in 2013 Data 75% Share of Spark code contributed by Databricks in 2014 Value Created

More information

A Cost-Benefit Analysis of Indexing Big Data with Map-Reduce

A Cost-Benefit Analysis of Indexing Big Data with Map-Reduce A Cost-Benefit Analysis of Indexing Big Data with Map-Reduce Dimitrios Siafarikas Argyrios Samourkasidis Avi Arampatzis Department of Electrical and Computer Engineering Democritus University of Thrace

More information

Creating a universe on Hive with Hortonworks HDP 2.0

Creating a universe on Hive with Hortonworks HDP 2.0 Creating a universe on Hive with Hortonworks HDP 2.0 Learn how to create an SAP BusinessObjects Universe on top of Apache Hive 2 using the Hortonworks HDP 2.0 distribution Author(s): Company: Ajay Singh

More information

Big Data and Market Surveillance. April 28, 2014

Big Data and Market Surveillance. April 28, 2014 Big Data and Market Surveillance April 28, 2014 Copyright 2014 Scila AB. All rights reserved. Scila AB reserves the right to make changes to the information contained herein without prior notice. No part

More information

Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014

Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014 Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014 Defining Big Not Just Massive Data Big data refers to data sets whose size is beyond the ability of typical database software tools

More information

DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVIRONMENT

DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVIRONMENT DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVIRONMENT Gita Shah 1, Annappa 2 and K. C. Shet 3 1,2,3 Department of Computer Science & Engineering, National Institute of Technology,

More information

IBM SPSS Forecasting 22

IBM SPSS Forecasting 22 IBM SPSS Forecasting 22 Note Before using this information and the product it supports, read the information in Notices on page 33. Product Information This edition applies to version 22, release 0, modification

More information

Data Mining in the Swamp

Data Mining in the Swamp WHITE PAPER Page 1 of 8 Data Mining in the Swamp Taming Unruly Data with Cloud Computing By John Brothers Business Intelligence is all about making better decisions from the data you have. However, all

More information

Chase Wu New Jersey Ins0tute of Technology

Chase Wu New Jersey Ins0tute of Technology CS 698: Special Topics in Big Data Chapter 4. Big Data Analytics Platforms Chase Wu New Jersey Ins0tute of Technology Some of the slides have been provided through the courtesy of Dr. Ching-Yung Lin at

More information

What is Analytic Infrastructure and Why Should You Care?

What is Analytic Infrastructure and Why Should You Care? What is Analytic Infrastructure and Why Should You Care? Robert L Grossman University of Illinois at Chicago and Open Data Group grossman@uic.edu ABSTRACT We define analytic infrastructure to be the services,

More information

Introduction to Big Data Analytics p. 1 Big Data Overview p. 2 Data Structures p. 5 Analyst Perspective on Data Repositories p.

Introduction to Big Data Analytics p. 1 Big Data Overview p. 2 Data Structures p. 5 Analyst Perspective on Data Repositories p. Introduction p. xvii Introduction to Big Data Analytics p. 1 Big Data Overview p. 2 Data Structures p. 5 Analyst Perspective on Data Repositories p. 9 State of the Practice in Analytics p. 11 BI Versus

More information

Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies

Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online at: www.ijarcsms.com Image

More information

An Approach to Implement Map Reduce with NoSQL Databases

An Approach to Implement Map Reduce with NoSQL Databases www.ijecs.in International Journal Of Engineering And Computer Science ISSN: 2319-7242 Volume 4 Issue 8 Aug 2015, Page No. 13635-13639 An Approach to Implement Map Reduce with NoSQL Databases Ashutosh

More information

L1: Introduction to Hadoop

L1: Introduction to Hadoop L1: Introduction to Hadoop Feng Li feng.li@cufe.edu.cn School of Statistics and Mathematics Central University of Finance and Economics Revision: December 1, 2014 Today we are going to learn... 1 General

More information

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ End to End Solution to Accelerate Data Warehouse Optimization Franco Flore Alliance Sales Director - APJ Big Data Is Driving Key Business Initiatives Increase profitability, innovation, customer satisfaction,

More information

Data Migration from Grid to Cloud Computing

Data Migration from Grid to Cloud Computing Appl. Math. Inf. Sci. 7, No. 1, 399-406 (2013) 399 Applied Mathematics & Information Sciences An International Journal Data Migration from Grid to Cloud Computing Wei Chen 1, Kuo-Cheng Yin 1, Don-Lin Yang

More information

An Oracle White Paper June 2012. High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

An Oracle White Paper June 2012. High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database An Oracle White Paper June 2012 High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database Executive Overview... 1 Introduction... 1 Oracle Loader for Hadoop... 2 Oracle Direct

More information

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview Programming Hadoop 5-day, instructor-led BD-106 MapReduce Overview The Client Server Processing Pattern Distributed Computing Challenges MapReduce Defined Google's MapReduce The Map Phase of MapReduce

More information