Big Data Time Series Analysis


 Carmella Maxwell
 1 years ago
 Views:
Transcription
1 Big Data Time Series Analysis Integrating MapReduce and R Visakh C R Department of Information and Computer Science Aalto University Espoo, Finland Abstract Explosion of data, especially from sensor devices, calls for new and innovative methods of storing, processing and analysis. In this paper, we look at the integration of big data technologies with a statistical computing framework, R. We tackle various time series analysis use cases and also develop a proofofconcept frontend application which can interact with the data and perform the computations. Keywords time series, machine learning, big data, apache hadoop, mapreduce, apache hive, R I. INTRODUCTION We live in the era of data. Web logs, internet media, social networks and sensor devices are generating petabytes of data every day. The advent of InternetofThings (IOT) devices has been heralded as a game changer. It has been estimated that there will be more than 25 billion IOT devices by 2020 [1]. These devices generate a huge amount of data, and it is imperative to have a proven and scalable mechanism to store, process and analyze this data. In this paper, we propose and develop a framework for storing and analyzing sensor data. Sensor data is predominantly a time series. Therefore, we integrate a proven big data platform, Apache Hive, with one of the most powerful open source statistical tool, R. We develop an architecture for the combined system and also discuss on how to seamlessly integrate the technologies. We discuss various machine learning use cases which are implemented using the combined power of R and Hive. We also provide a proofofconcept dashboard developed in Shiny, which enables to perform various analytical and machine learning computations using a web based frontend. This paper is divided as follows: Section II introduces the various technologies and frameworks used. Section III details the data, collection methods, data models and preprocessing steps. In Section IV, we discuss selected use cases and their implementation. Section V details the experiments and the results. We also discuss the dashboard implementation of the use cases using Shiny. Section VI summarizes the project and discusses the takeaways. We also briefly outline the future direction for this project. We use Python scripts for collecting the data. The data is then loaded into Apache Hive tables. Hive queries spawn MapReduce jobs, which forms the core programming model of Apache Hadoop. We will discuss the architecture and implementation of Hive. For implementing the time series analyses and machine learning, we use R and various R libraries. For developing the dashboard, we utilize Shiny. A. Python Python is a general purpose, highlevel programming language. First developed in 1991, Python has been widely accepted in the programming world. It has been ranked in the top ten widely used programming languages in the past few years [2]. In this project, we use Python for collecting the sensor data from a HTTP server and generate the source data files needed for loading the Hive tables. B. Apache Hadoop Apache Hadoop is an opensource framework that allows the distributed processing of large datasets across clusters of computers. The essence of Hadoop lies in two components a storage component known as Hadoop Distributed File System (HDFS) and a processing component known as MapReduce [3]. HDFS is a scalable and distributed file system. It stores large files across multiple machines. The data is replicated, therefore providing high reliability. MapReduce is the programming paradigm which drives Hadoop jobs. A MapReduce job consists of two components: a Map component which performs filtering and sorting, and a Reduce component which performs the summary operation. To elaborate further, the Map component takes an input pair and produces keyvalue pairs. These keyvalue pairs are collected and provided as input to the Reduce component. The Reduce component merges the values based on the key value [4]. Fig. 1 shows a highlevel flow of a MapReduce job [5]. II. TOOLS & TECHNOLOGIES In this section, we discuss the various tools and technologies used in this project.
2 g) stats: stats package provides a wide array of statistical functions [14]. E. Shiny Shiny is a web application framework in R. Shiny makes it easier to develop web applications, especially without any HTML, CSS or JavaScript [9]. Fig MapReduce job flow C. Apache Hive Apache Hive is a data warehouse platform built on top of Hadoop. It facilitates the querying of large datasets stored in HDFS format. Hive provides a declarative SQL like query language called Hive Query Language (HQL), which translates internally to MapReduce programs, thereby providing a much simplistic method to query, analyze and process the data [6]. Hive supports User Defined Functions (UDFs) which helps to extend and implement your own logic for processing the data. We utilize a few Hive UDFs in this project, for data processing as well as for computing Moving Averages and Moving Standard Deviation. These UDFs are traditionally written in Java, though support for Python has been improving recently. D. R R is a programming language predominantly used for statistical computing. It is possible to perform time series analysis, classification, clustering, linear and nonlinear modeling and various visualization methods in R [7]. The R platform includes thousands of packages which extends and adds more functionality. Now, we will briefly mention the major packages used in this project. a) RHive: RHive is used to connect R with Hive and thereby perform distributed computing using Hive queries in R [8]. We connect to Hive tables in R scripts using HiveServer, extract the relevant data and process it using R functions. b) shiny: Shiny enables to build web applications with R [9]. Shiny forms the core of the web based dashboard that we develop in this project and will be explained in detail later. c) TSA: This is a repository of time series analysis functions [10]. d) forecast: This package provides functions for univariate time series forecasting [11]. e) tseries: This packages provides functionality for time series analysis [12]. f) cluster: This package provides methods for cluster analysis [13]. In this project, we use Shiny to create various dashboard which will help users to interact with the time series data and perform analysis and machine learning without tinkering with any code. A selected few of these dashboards are showcased in Section V. III. DATA & DATA MODELS As discussed in the introductory section, this project is about analyzing huge amount of time series data from sensors. In this section, we expand upon the type of data (as to what kind of sensor data), the data collection methods and data preprocessing steps. We will also look at the data models developed for storing the data in Hive. A. The Data A time series can be defined as a sequence of data points measured over a time interval. There is a multitude of sensor devices around us generating different types of data. Since these devices tend to measure various phenomenon, they tend to provide a sequence of data values over a time period. In other words, sensor devices produce time series data. In this project, we utilize data from traffic sensors. A partner company [15] has installed sensors in a five lane road in a city in Northern Finland. Each lane has its own sensor. Three lanes take traffic towards South, while two lanes take traffic towards North. The sensors measure the following values: a) Sensor ID: Each sensor has been provided with a distinct sensor id. They range from 4 to 8. b) Timestamp: The timestamp when the vehicle information was captured by the sensor. c) Speed: The speed of the vehicle in km/h d) Height: The height of the vehicle in centimeters e) Length: The length of the vehicle in meters f) Snow Depth: The amount of snow on road in millimeters B. Data Collection The data from the sensors is accessed through a HTTP interface. The data collection is performed using a Python script. In order not to overload the server, the script is executed every hour. Each execution of the script collects the data from the previous one hour. The scheduling of the job is done using cron daemon in Linux. The outline of the script is as follows: The script pings the HTTP server and reads the data using the Python library urllib2. Using another library called BeautifulSoup [16], the relevant data is extracted, cleaned and transformed into a Python list. This data is written into a CSV file.
3 C. Data Model Apache Hive serves as the repository for storing the traffic sensor data. Hive provides an encapsulation of the data and serves it as a table, which is similar to a database table. We utilize staging tables to load the raw data from the CSV files. The data is transformed using SQL functions and loaded into a final target table. We now provide the data model and transformations performed. a) Staging table To read the CSV file, we utilize a Hive Serializer / Deserializer (SerDe). The Deserializer reads a record and translate it to a Java object. The Serializer takes the Java object and convert it to a format which can be written to HDFS [17]. We also perform type conversions on the various attributes and cast them to the correct data types. In addition to this, we also generate a derived column, Day of the week. Based on the timestamp, we utilize Hive UDFs to get an integer which specifies the day of the week. Monday to Sunday is represented using 1 to 7. b) Target table The data from staging table is loaded into the target table. The data definition of the relevant columns is provided in Table 1. Column Name ts sensor_id speed height length snow_depth day_of_week Data Type TIMESTAMP INTEGER DOUBLE DOUBLE DOUBLE DOUBLE INTEGER Table 1 Data Definition of target table IV. USE CASES & IMPLEMENTATION In this section, we discuss the various use cases of time series analysis and machine learning which we implement in this project. We divide them into six categories, namely moving average, moving standard deviation, forecasting, clustering, correlation analysis and dashboards. First, we provide the mathematical and scientific background for the use case. Then we elaborate on the implementation of the use case in this project. We provide a brief summary of the techniques used and how they were implemented in this project. A. Moving Average Moving average is a statistical calculation of creating a series of averages of subsets of the original time series. For a series, the moving average is computed by taking the arithmetic mean of subsequent terms of the series. The first element of the moving average is computed by taking the arithmetic mean of the first elements in the series. The series is shifted forward, excluding the first element of the series and including the next element, and computing the mean again. This is performed for the entire series. For the series defined as [18] : =, the moving average can be (1) Moving average is used to smoothen the fluctuations in the data. The resulting series helps to identify trends and cycles in the data. Moving average is also known as rolling average or running average. In this project, we compute the moving average of speed of the vehicles. From the resulting series, it becomes easier to understand the trends in the speed of vehicles for each lane for different time of a day, for different days in a week or for different months. Moving average can be computed using R functions. But since R holds the data objects in memory, the size of data on which we can perform computations is limited by the memory of the system [19]. Since we have the data already in HDFS, we implement a process to compute the moving average using MapReduce. The outline of the process is as follows: i. Get the attribute name (in this case, the column of the Hive table) and the moving average time window ii. iii. iv. Hive query extracts the data from HDFS and passes it as a Java object Based on the time window provided, we compute the averages using Hive UDF The computed values are returned to R and then visualized By using UDF, we are able to perform the computation using MapReduce paradigm. The values returned from the UDF is a time series of the moving averages, which can then be plotted or analyzed using various R functions. B. Moving Standard Deviation Standard deviation is a measure of the spread of the values in a series. It measures the amount of variation of the values in a dataset [20]. In other words, standard deviation can be described as how tightly the values are clustered around the mean. Standard deviation is calculated by taking the square root of the averages of the squared deviations of values from their mean. In this project, we compute the moving standard deviation of the speed of vehicles for various time periods. The objective is to understand how varied the speed observations are when compared to the mean value. It also helps to understand level of uncertainty in the speed forecasts. Moving standard deviation is also computed using Java UDF in Hive, similar to the moving average process.
4 C. Forecasting Time series forecasting involves the creation of a model to predict future values based on the previous values. There are various techniques by which a forecasting model can be created and applied to the data. In this project, we use exponential smoothing and autoregressive integrated moving average (ARIMA). In this section, we provide a summary of these models and how they are implemented in this project. We also provide the details of the statistical testing measures utilized to validate the forecast models. i. Forecast models We introduce the two forecasting models used, Exponential Smoothing and ARIMA. a. Exponential Smoothing Exponential smoothing is generally used to create a smoothened time series. The concept is similar to moving average, but the difference is that in exponential smoothing, the weights for the previous values in the time series exponentially decrease over time. For a series, the exponential smoothing output can be calculated using the following formula:. + (1) (2) where the smoothing factor,, is between 0 and 1. The model described by (2) is suitable for forecasting data with no trend or seasonal characteristic. Since we are dealing with traffic sensor data, especially speed of vehicles, we utilize higher order exponential smoothing, namely, Holt Winters double exponential smoothing and triple exponential smoothing. These methods take the trend and seasonality into account by introducing extra terms in the model [21]. b. ARIMA Autoregressive Integrated Moving Average (ARIMA) is a generalization of Autoregressive Moving Average (ARMA) models. ARMA is a forecasting model in which both autoregressive (AR) and moving average (MA) methods are applied to the time series. We will first define the AR and MA part and then define the combined models. An autoregressive model of order p can be defined as: = + + (3) where,., are the model parameters, is the constant and is the white noise. A moving average model of order q can be defined as: + (4) where,..., are the model parameters, µ is the expectation of, and is the white noise. From (3) and (4), we can define an ARMA(p,q) model with p autoregressive terms and q moving average terms as: (5) = ARIMA introduces an additional parameter d to the model, which refers to the integrated part, which in turn corresponds to the degree of differencing. Thus, an ARIMA model can be applied to a time series even if it is not stationary, which is not the case with ARMA models. The following steps outline the general implementation of exponential smoothing and ARIMA models in this project. i. Using HQL queries, we extract the relevant data from Hive table. ii. The extracted data is stored as a data frame in R. iii. iv. We perform time series transformations on the data frame using TSA R library. The model is generated using the corresponding R function. v. Using forecast R library, we generate the forecasted values as needed. ii. Statistical testing We utilize LjungBox test to check whether the autocorrelations of the residuals in the forecast differs from zero. The null and alternate hypothesis are defined as: Null Hypothesis: : The data are independently distributed. Alternate Hypothesis: : The data are not independently distributed The alternate hypothesis can be rephrased to say that the data exhibit serial correlation [23]. D. Correlation Analysis Correlation can be defined as the tendency of two values to change together. A correlation value of 0 implies no relationship, 1 shows a perfect linear negative relationship and +1 implies a perfect linear positive relationship [24]. We compute the correlation coefficient between various attributes of the time series using standard R functions. E. Clustering Clustering is a machine learning technique in which objects which are similar to each other are grouped together. It is an unsupervised learning technique. There are various methods to perform clustering, out of which we consider two: Centroid model (kmeans) and distribution model (Expectation Maximization). a. kmeans The objective in kmeans clustering is to partition n objects into k clusters. Each observation is identified with the cluster that has the closest mean. Though the original algorithm is NPhard, most of the implementations utilize heuristic methods to converge to a local optimum. For brevity, we refer the readers to [25] for a detailed explanation of kmeans clustering. b. ExpectationMaximization (EM)
5 EM algorithm is an iterative method to find the maximum likelihood estimates of parameters. The expectation step (E) creates a function for the expectation of the loglikelihood, and maximization step (M) computes the parameters by maximizing the expected loglikelihood computed in the E step. We again refer the readers to [26] for a deeper analysis. The objective of clustering in this project is to identify whether the vehicles can be divided into separate groups based on length and height. We implement the clustering algorithms using various R packages. F. Dashboards The success of a data science project depends on how the models and results can be presented to the end user. In the previous sections, we looked at the technology used and the models and use cases developed. We will now describe how we develop a web based dashboard which will allow a user to select the data, choose modeling techniques and visualize the results. Shiny is a web application framework for R. An application developed in Shiny consists of two components: a user interface script, which controls the layout and appearance, and a server script, which contains the scripts necessary to build the application. Fig. 2 shows the flow of a Shiny application. User selects input attributes and analysis functions Hive query extracts data and process/aggregate as needed Hive output is saved as a data frame in R Analysis is performed in R Output is shown to user using Shiny interface Fig. 2 Process flow of a Shiny application We will now describe the details of the various dashboards created. i. Moving Average & Moving Standard Deviation We have already discussed regarding the implementation of the computation of moving average and moving standard deviation. We created an application which will serve as an interface to choose various attributes and perform the calculations. The various attributes which can be selected are: a. Date: User can select a range of date for which the data needs to be pulled from Hive ii. iii. iv. b. Sensor id: User can choose the particular sensor id for which the data needs to be selected. c. Day of week: This provides the user with an option to choose data for a particular day of the week alone (for eg: Monday or Wednesday). d. Order: The order for the moving average or moving standard deviation can be provided. Once the options are selected, a connection is made to Hive and the data requested is selected. Hive UDFs to compute the moving average or moving standard deviation are executed, which in turn invokes MapReduce jobs. The final result set is saved as a data frame in R, which is then plotted and displayed in the application. Forecasting This dashboard serves as an interface for the various forecast methods. As in the moving average dashboard, the user selected the values for various attributes. The model as well as the number of forecast steps is also chosen. Based on the values provided, data from Hive table is extracted and saved as a data frame in R. Time series computations are performed in R and the final forecast result is displayed in the application. Clustering This application provides an interface to choose the date, sensor id and the columns on which the clustering needs to be performed. The number of clusters needed are also inputted. As in the other applications, the data from Hive is extracted and saved as a data frame in R. Clustering is performed on this data and the results are displayed in the application. Speed Comparison The speed comparison application serves as a tool to visualize the time series, especially the speed. The following values can be provided in the application: a. Dates: Two different dates can be chosen b. Sensor id c. Order: This value defines the smoothing order for the time series Based on the two dates provided, we select two sets of data from Hive. It is smoothened based on the order value and then displayed in the application. Comparison of speed with plots helps in understanding the data better. The purpose of the dashboards is to serve as a proofofconcept application which shows the seamless integration of HDFS, MapReduce and R. The data is stored in HDFS, MapReduce jobs are invoked to select the relevant data and R functions are used for processing the data and plotting it.
6 V. EXPERIMENTS & RESULTS So far, we have explained the details of the technology, the various use cases and the implementation steps and strategy followed in this project. In this section, we will first discuss the experimental setup. We will also discuss the results for each of the use cases. Table 2 shows the technology stack and the releases used. Technology Apache Hadoop Apache Hive R Python 2.7 Release details Table 2 Technology stack The configuration details of the server on which the experiments were run are given in Table 3. OS CPU Cores 8 Ubuntu LTS 64bit Processors 2 x Intel(R) Xeon(R) 2.67GHz Memory 8 x 8 GB DIMM DDR3 Table 3 Server details We will now present the results for the various use cases. A. Moving Average We compute the moving average for a sensor for a particular date. The moving average computation is handled by Hive UDF. The results are saved as a data frame in R, which is then visualized. Fig. 3 shows the moving average for sensor 4 and sensor 5 for March 20, The source code is provided in Appendix A.1. B. Moving Standard Deviation Similar to moving average, we compute the moving standard deviation for sensors 4 and 5 for March 20, The results are shown in Fig. 4. The source code is provided in Appendix A.2. Fig. 4 Moving Standard Deviation C. Forecasting In section IV, we explained the two methods used in forecasting, ARIMA and exponential smoothing. Now, we present the speed forecast results using these two methods. Fig. 5 shows the output from HoltWinter s exponential smoothing. In this, the black line represents the actual observed values, and the red line represents the fitted values. Fig. 5 Holt Winters forecast We forecast fifty future values using HoltWinters and ARIMA models. To understand the feasibility of the model, we look at the residual values (the forecast errors). If there are correlations between forecast errors for successive predictions, then it might be possible to improve the forecasting model using other methods. In Fig. 6 and Fig. 7, we plot the correlogram of the forecast errors for the two methods. From the correlogram, we can see that the autocorrelation at lag 1 is within the significance bounds. Fig. 3 Moving Average
7 Fig. 6 Correlogram of forecast errrors of Holt Winters Fig. 7 Correlogram of forecast errors of ARIMA We also perform LjungBox test on the residuals to check whether there is any evidence of nonzero autocorrelations in the residuals. This test statistic has been explained in Section IV. Table 3 and 4 shows the LjungBox test results for forecasts using HoltWinters and ARIMA models respectively. BoxLjung test data: speed.forecast.hw$residuals Xsquared = , df = 38, pvalue < 2.2e16 Fig. 8 Correlation analysis of various attributes From Fig. 8, we can infer that there is a positive correlation between speed and distance between cars and a negative correlation between speed and snow depth. The source code for the correlation analysis is provided in Appendix A.4. E. Clustering We perform clustering on the attributes length and height. The objective is to analyze whether the different vehicles that pass through the traffic sensors have any common factors and natural grouping. We conduct our experiments using kmeans and ExpectationMaximization (EM) clustering methods. Visualizing the clustering results and understanding them is as important as the algorithm used. For kmeans, we utilize two visualizations. In the first one, we plot the data points and color code them based on the cluster to which they have been assigned to. Fig. 9 shows the plot for the clustering. In the second one, we utilize a discriminant projection plot. In this, we plot the cluster results using the discriminant coordinates of the data. Fig. 10 shows the discriminant projection plot. Table 3 Box test result for Holt Winters residuals BoxLjung test data: speed.forecast.arima$residuals Xsquared = , df = 38, pvalue = Table 4 Box test result for ARIMA residuals Based on the test statistic and the p value, we have little evidence of nonzero autocorrelations in the forecast errors. The source code for forecasting steps performed is provided in Appendix A.3. D. Correlation The correlation analysis is performed to understand the relationship between various attributes. We extract the following values from Hive tables for one sensor for one week: speed, length, height, distance between vehicles and snow depth. We perform correlation analysis on this data. Fig. 8 shows the correlation plot between the attributes. Fig. 9 kmeans clusters Fig. 10 kmeans discriminant plot
8 The results from EM clustering are plotted in Fig. 11. The source code for clustering is provided in Appendix A.5. Fig. 11 EM clusters F. Dashboards We developed various applications using Shiny to serve as dashboards for the analysis of the data. Fig. 2 provides a high level work flow of the dashboard. In this section, we showcase a few of the dashboards developed. Fig. 12 shows the dashboard screen for moving average and moving standard deviation computation. Fig. 13 shows the dashboard for forecasting the speed. Fig. 13 Shiny dashboard for forecasting speed The source code for the Shiny applications can be found in Appendix A.6. Fig. 12 Shiny dashboard for moving average & standard deviation VI. SUMMARY & CONCLUSION In this project, we have designed and implemented a method to process huge volume of time series data and perform time series analysis and machine learning on it. In Section II, we provided the necessary background about the various technologies used. We also outlined the methods by which these various technologies were brought together to perform as a single ecosystem for big data machine learning. In Section III, we discussed the data collection and processing strategies and also provided the details of the data model. In Section IV, we discussed various use cases and also described the implementation strategy for each of them. In Section V, we provided the details of the experiments conducted and the results. There are various paths which this project can take further. One of the planned future work is to extend this big data time
9 series analysis model to realtime streaming data. Another aspect is to replace MapReduce with Apache Spark and/or Apache Tez. Data infusion from multiple sources is also another aspect we would like to look into in future. REFERENCES [1] Press Release, "Gartner Says 4.9 Billion Connected "Things" Will Be in Use in 2015", Gartnet Inc, November [2] Tiobe Software, "TIOBE Index for May 2015", May [3] Apache Software Foundation, Apache Hadoop, [4] Jeffrey Dean & Sanjay Ghemawat, "MapReduce: simplified data processing on large clusters", OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation, Volume 6 [5] Jeremy Kun, "On the Computational Complexity of MapReduce", October [6] Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff and Raghotham Murthy, "Hive  A Warehousing Solution Over a MapReduce Framework", VLDB [7] "The R project for statistical computing", The R Foundation, [8] The Comprehensive R Archive Network, "RHive: R and Hive", NexR [9] The Comprehensive R Archive Network, "shiny: Web Application Framework for R", Winston Chang et al. [10] The Comprehensive R Archive Network, "TSA: Time Series Analysis", KungSik Chan, Brian Ripley [13] The Comprehensive R Archive Network, "cluster: Cluster Analysis Extended", Peter Rousseeuw et al [14] The Comprehensive R Archive Network, "R statistical functions" [15] Noptel Oy, [16] Leonard Richardson, Beautiful Soup Documentation, [17] Dave Lewis, SerDes architectures and applications, DesignCon 2004 [18] Rob J Hyndman, "Moving Averages", November 2009, Unpublished [19] R Core Team, R installation and administration manual, April 2015 [20] Population Moments, James D Hamilton, "Time Series Analysis", Princeton University Press, 1994, ISBN , p [21] Exponential Smoothing, James D Hamilton, "Time Series Analysis", Princeton University Press, 1994, ISBN , p [22] ARMA models, James D Hamilton, "Time Series Analysis", Princeton University Press, 1994, ISBN , p4364 [23] Ljung, G. M. and G. E. P. Box, "On a measure of lack of fit in time series models", Biometrika , p [24] Correlation, James D Hamilton, "Time Series Analysis", Princeton University Press, 1994, ISBN , p [25] kmeans clustering, Ethem Alpaydin, Introduction to Machine Learning, Second Edition, 2010, ISBN , p [26] ExpectationMaximization Algorithm, Ethem Alpaydin, Introduction to Machine Learning, Second Edition, 2010, ISBN , p To be added in the final report APPENDIX [11] The Comprehensive R Archive Network, "forecast: Forecasting Functions for Time Series and Linear Models", Rob J Hyndman et al [12] The Comprehensive R Archive Network, "tseries: Time Series Analysis and Computational Finance", Adrian Trapletti, Kurt Hornik, Blake LeBaron
Enhancing Massive Data Analytics with the Hadoop Ecosystem
www.ijecs.in International Journal Of Engineering And Computer Science ISSN:23197242 Volume 3, Issue 11 November, 2014 Page No. 90619065 Enhancing Massive Data Analytics with the Hadoop Ecosystem Misha
More informationCASE STUDY OF HIVE USING HADOOP 1
CASE STUDY OF HIVE USING HADOOP 1 Sai Prasad Potharaju, 2 Shanmuk Srinivas A, 3 Ravi Kumar Tirandasu 1,2,3 SRES COE,Department of er Engineering, Kopargaon,Maharashtra, India 1 psaiprasadcse@gmail.com
More informationNetFlow Analysis with MapReduce
NetFlow Analysis with MapReduce Wonchul Kang, Yeonhee Lee, Youngseok Lee Chungnam National University {teshi85, yhlee06, lee}@cnu.ac.kr 2010.04.24(Sat) based on "An Internet Traffic Analysis Method with
More informationAdvanced SQL Query To Flink Translator
Advanced SQL Query To Flink Translator Yasien Ghallab Gouda Full Professor Mathematics and Computer Science Department Aswan University, Aswan, Egypt Hager Saleh Mohammed Researcher Computer Science Department
More informationIntroduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
More informationDistributed Framework for Data Mining As a Service on Private Cloud
RESEARCH ARTICLE OPEN ACCESS Distributed Framework for Data Mining As a Service on Private Cloud Shraddha Masih *, Sanjay Tanwani** *Research Scholar & Associate Professor, School of Computer Science &
More informationChapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
More information11/18/15 CS 6030. q Hadoop was not designed to migrate data from traditional relational databases to its HDFS. q This is where Hive comes in.
by shatha muhi CS 6030 1 q Big Data: collections of large datasets (huge volume, high velocity, and variety of data). q Apache Hadoop framework emerged to solve big data management and processing challenges.
More informationApproaches for parallel data loading and data querying
78 Approaches for parallel data loading and data querying Approaches for parallel data loading and data querying Vlad DIACONITA The Bucharest Academy of Economic Studies diaconita.vlad@ie.ase.ro This paper
More informationCSEE5430 Scalable Cloud Computing Lecture 2
CSEE5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.92015 1/36 Google MapReduce A scalable batch processing
More informationLargeScale Data Sets Clustering Based on MapReduce and Hadoop
Journal of Computational Information Systems 7: 16 (2011) 59565963 Available at http://www.jofcis.com LargeScale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE
More informationComparison of Different Implementation of Inverted Indexes in Hadoop
Comparison of Different Implementation of Inverted Indexes in Hadoop Hediyeh Baban, S. Kami Makki, and Stefan Andrei Department of Computer Science Lamar University Beaumont, Texas (hbaban, kami.makki,
More informationData processing goes big
Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,
More informationScalable Cloud Computing Solutions for Next Generation Sequencing Data
Scalable Cloud Computing Solutions for Next Generation Sequencing Data Matti Niemenmaa 1, Aleksi Kallio 2, André Schumacher 1, Petri Klemelä 2, Eija Korpelainen 2, and Keijo Heljanko 1 1 Department of
More informationPerformance Overhead on Relational Join in Hadoop using Hive/Pig/Streaming  A Comparative Analysis
Performance Overhead on Relational Join in Hadoop using Hive/Pig/Streaming  A Comparative Analysis Prabin R. Sahoo Tata Consultancy Services Yantra Park, Thane Maharashtra, India ABSTRACT Hadoop Distributed
More informationClient Overview. Engagement Situation. Key Requirements
Client Overview Our client is one of the leading providers of business intelligence systems for customers especially in BFSI space that needs intensive data analysis of huge amounts of data for their decision
More informationHIVE. Data Warehousing & Analytics on Hadoop. Joydeep Sen Sarma, Ashish Thusoo Facebook Data Team
HIVE Data Warehousing & Analytics on Hadoop Joydeep Sen Sarma, Ashish Thusoo Facebook Data Team Why Another Data Warehousing System? Problem: Data, data and more data 200GB per day in March 2008 back to
More informationAdvanced Forecasting Techniques and Models: ARIMA
Advanced Forecasting Techniques and Models: ARIMA Short Examples Series using Risk Simulator For more information please visit: www.realoptionsvaluation.com or contact us at: admin@realoptionsvaluation.com
More informationHadoop Job Oriented Training Agenda
1 Hadoop Job Oriented Training Agenda Kapil CK hdpguru@gmail.com Module 1 M o d u l e 1 Understanding Hadoop This module covers an overview of big data, Hadoop, and the Hortonworks Data Platform. 1.1 Module
More informationThe Internet of Things and Big Data: Intro
The Internet of Things and Big Data: Intro John Berns, Solutions Architect, APAC  MapR Technologies April 22 nd, 2014 1 What This Is; What This Is Not It s not specific to IoT It s not about any specific
More informationINTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.
More informationEnergy Load Mining Using Univariate Time Series Analysis
Energy Load Mining Using Univariate Time Series Analysis By: Taghreed Alghamdi & Ali Almadan 03/02/2015 Caruth Hall 0184 Energy Forecasting Energy Saving Energy consumption Introduction: Energy consumption.
More informationJeffrey D. Ullman slides. MapReduce for data intensive computing
Jeffrey D. Ullman slides MapReduce for data intensive computing Singlenode architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very
More informationBig Data With Hadoop
With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
More informationManifest for Big Data Pig, Hive & Jaql
Manifest for Big Data Pig, Hive & Jaql Ajay Chotrani, Priyanka Punjabi, Prachi Ratnani, Rupali Hande Final Year Student, Dept. of Computer Engineering, V.E.S.I.T, Mumbai, India Faculty, Computer Engineering,
More informationCREDIT CARD DATA PROCESSING AND ESTATEMENT GENERATION WITH USE OF HADOOP
CREDIT CARD DATA PROCESSING AND ESTATEMENT GENERATION WITH USE OF HADOOP Ashvini A.Mali 1, N. Z. Tarapore 2 1 Research Scholar, Department of Computer Engineering, Vishwakarma Institute of Technology,
More informationBig Data and Apache Hadoop s MapReduce
Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23
More informationSARAH Statistical Analysis for Resource Allocation in Hadoop
SARAH Statistical Analysis for Resource Allocation in Hadoop Bruce Martin Cloudera, Inc. Palo Alto, California, USA bruce@cloudera.com Abstract Improving the performance of big data applications requires
More informationImplement Hadoop jobs to extract business value from large and varied data sets
Hadoop Development for Big Data Solutions: HandsOn You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to
More informationA Performance Analysis of Distributed Indexing using Terrier
A Performance Analysis of Distributed Indexing using Terrier Amaury Couste Jakub Kozłowski William Martin Indexing Indexing Used by search
More informationINTERNATIONAL JOURNAL OF INFORMATION TECHNOLOGY & MANAGEMENT INFORMATION SYSTEM (IJITMIS)
INTERNATIONAL JOURNAL OF INFORMATION TECHNOLOGY & MANAGEMENT INFORMATION SYSTEM (IJITMIS) International Journal of Information Technology & Management Information System (IJITMIS), ISSN 0976 ISSN 0976
More informationLavastorm Analytic Library Predictive and Statistical Analytics Node Pack FAQs
1.1 Introduction Lavastorm Analytic Library Predictive and Statistical Analytics Node Pack FAQs For brevity, the Lavastorm Analytics Library (LAL) Predictive and Statistical Analytics Node Pack will be
More informationBig data blue print for cloud architecture
Big data blue print for cloud architecture COGNIZANT Image Area Prabhu Inbarajan Srinivasan Thiruvengadathan Muralicharan Gurumoorthy Praveen Codur 2012, Cognizant Next 30 minutes Big Data / Cloud challenges
More informationWorkshop on Hadoop with Big Data
Workshop on Hadoop with Big Data Hadoop? Apache Hadoop is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly
More informationClick Stream Data Analysis Using Hadoop
Governors State University OPUS Open Portal to University Scholarship Capstone Projects Spring 2015 Click Stream Data Analysis Using Hadoop Krishna Chand Reddy Gaddam Governors State University Sivakrishna
More informationPerformance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems
Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems Rekha Singhal and Gabriele Pacciucci * Other names and brands may be claimed as the property of others. Lustre File
More informationA Study of Data Management Technology for Handling Big Data
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 9, September 2014,
More informationMonitis Project Proposals for AUA. September 2014, Yerevan, Armenia
Monitis Project Proposals for AUA September 2014, Yerevan, Armenia Distributed Log Collecting and Analysing Platform Project Specifications Category: Big Data and NoSQL Software Requirements: Apache Hadoop
More informationTime Series Analysis: Basic Forecasting.
Time Series Analysis: Basic Forecasting. As published in Benchmarks RSS Matters, April 2015 http://web3.unt.edu/benchmarks/issues/2015/04/rssmatters Jon Starkweather, PhD 1 Jon Starkweather, PhD jonathan.starkweather@unt.edu
More informationIntroduction to Hadoop. New York Oracle User Group Vikas Sawhney
Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIGDATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Ecosystem Hadoop
More informationFirebird meets NoSQL (Apache HBase) Case Study
Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 thomas.steinmaurer@scch.at www.scch.at Michael Zwick DI
More informationCSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model
More informationA REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, sborkar95@gmail.com Assistant Professor, Information
More informationA Brief Outline on Bigdata Hadoop
A Brief Outline on Bigdata Hadoop Twinkle Gupta 1, Shruti Dixit 2 RGPV, Department of Computer Science and Engineering, Acropolis Institute of Technology and Research, Indore, India Abstract Bigdata is
More informationApache Flink Nextgen data analysis. Kostas Tzoumas ktzoumas@apache.org @kostas_tzoumas
Apache Flink Nextgen data analysis Kostas Tzoumas ktzoumas@apache.org @kostas_tzoumas What is Flink Project undergoing incubation in the Apache Software Foundation Originating from the Stratosphere research
More informationBig Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park
Big Data: Using ArcGIS with Apache Hadoop Erik Hoel and Mike Park Outline Overview of Hadoop Adding GIS capabilities to Hadoop Integrating Hadoop with ArcGIS Apache Hadoop What is Hadoop? Hadoop is a scalable
More informationAdvanced Big Data Analytics with R and Hadoop
REVOLUTION ANALYTICS WHITE PAPER Advanced Big Data Analytics with R and Hadoop 'Big Data' Analytics as a Competitive Advantage Big Analytics delivers competitive advantage in two ways compared to the traditional
More informationBig Data Rethink Algos and Architecture. Scott Marsh Manager R&D Personal Lines Auto Pricing
Big Data Rethink Algos and Architecture Scott Marsh Manager R&D Personal Lines Auto Pricing Agenda History Map Reduce Algorithms History Google talks about their solutions to their problems Map Reduce:
More informationAn Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce.
An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce. Amrit Pal Stdt, Dept of Computer Engineering and Application, National Institute
More informationDATA MINING WITH HADOOP AND HIVE Introduction to Architecture
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 20152025. Reproduction or usage prohibited without permission of
More informationAnalysis of algorithms of time series analysis for forecasting sales
SAINTPETERSBURG STATE UNIVERSITY Mathematics & Mechanics Faculty Chair of Analytical Information Systems Garipov Emil Analysis of algorithms of time series analysis for forecasting sales Course Work Scientific
More informationINTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GSASE
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GSASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe
More informationHadoop Ecosystem Overview. CMSC 491 HadoopBased Distributed Computing Spring 2015 Adam Shook
Hadoop Ecosystem Overview CMSC 491 HadoopBased Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future
More informationOn a Hadoopbased Analytics Service System
Int. J. Advance Soft Compu. Appl, Vol. 7, No. 1, March 2015 ISSN 20748523 On a Hadoopbased Analytics Service System Mikyoung Lee, Hanmin Jung, and Minhee Cho Korea Institute of Science and Technology
More informationMapReduce Design of KMeans Clustering Algorithm
MapReduce Design of KMeans Clustering Algorithm Prajesh P Anchalia Department of CSE, Email: prajeshanchalia@yahoo.in Anjan K Koundinya Department of CSE, Email: annjank@gmail.com Srinath N K Department
More informationBig Data Explained. An introduction to Big Data Science.
Big Data Explained An introduction to Big Data Science. 1 Presentation Agenda What is Big Data Why learn Big Data Who is it for How to start learning Big Data When to learn it Objective and Benefits of
More informationCreating a universe on Hive with Hortonworks HDP 2.0
Creating a universe on Hive with Hortonworks HDP 2.0 Learn how to create an SAP BusinessObjects Universe on top of Apache Hive 2 using the Hortonworks HDP 2.0 distribution Author(s): Company: Ajay Singh
More informationManaging Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database
Managing Big Data with Hadoop & Vertica A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Copyright Vertica Systems, Inc. October 2009 Cloudera and Vertica
More informationBayesian networks  Timeseries models  Apache Spark & Scala
Bayesian networks  Timeseries models  Apache Spark & Scala Dr John Sandiford, CTO Bayes Server Data Science London Meetup  November 2014 1 Contents Introduction Bayesian networks Latent variables Anomaly
More informationA STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS
A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, QuaidEMillath Govt College for Women, Chennai, (India)
More informationDistributed Computing and Big Data: Hadoop and MapReduce
Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:
More informationHadoop Ecosystem B Y R A H I M A.
Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open
More informationIMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAPREDUCE
IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAPREDUCE Mr. Santhosh S 1, Mr. Hemanth Kumar G 2 1 PG Scholor, 2 Asst. Professor, Dept. Of Computer Science & Engg, NMAMIT, (India) ABSTRACT
More informationTesting Big data is one of the biggest
Infosys Labs Briefings VOL 11 NO 1 2013 Big Data: Testing Approach to Overcome Quality Challenges By Mahesh Gudipati, Shanthi Rao, Naju D. Mohan and Naveen Kumar Gajja Validate data quality by employing
More informationEM Clustering Approach for MultiDimensional Analysis of Big Data Set
EM Clustering Approach for MultiDimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin
More informationComparative analysis of mapreduce job by keeping data constant and varying cluster size technique
Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Mahesh Maurya a, Sunita Mahajan b * a Research Scholar, JJT University, MPSTME, Mumbai, India,maheshkmaurya@yahoo.co.in
More informationDepartment of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15
Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 15 Big Data Management V (Bigdata Analytics / MapReduce) Chapter 16 and 19: Abideboul et. Al. Demetris
More informationThe Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah LinkedIn
The Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah LinkedIn Presented by : Ishank Kumar Aakash Patel Vishnu Dev Yadav CONTENT Abstract Introduction Related work The Ecosystem Ingress
More informationAxibase Time Series Database
Axibase Time Series Database Axibase Time Series Database Axibase TimeSeries Database (ATSD) is a clustered nonrelational database for the storage of various information coming out of the IT infrastructure.
More informationSpring,2015. Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE
Spring,2015 Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE Contents: Briefly About Big Data Management What is hive? Hive Architecture Working
More informationChapter 11 MapReduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related
Chapter 11 MapReduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing
More informationBringing Big Data Modelling into the Hands of Domain Experts
Bringing Big Data Modelling into the Hands of Domain Experts David Willingham Senior Application Engineer MathWorks david.willingham@mathworks.com.au 2015 The MathWorks, Inc. 1 Data is the sword of the
More informationBig Data and Analytics: A Conceptual Overview. Mike Park Erik Hoel
Big Data and Analytics: A Conceptual Overview Mike Park Erik Hoel In this technical workshop This presentation is for anyone that uses ArcGIS and is interested in analyzing large amounts of data We will
More informationBiDAl: Big Data Analyzer for Cluster Traces
BiDAl: Big Data Analyzer for Cluster Traces Alkida Balliu, Dennis Olivetti, Ozalp Babaoglu, Moreno Marzolla, Alina Sirbu Department of Computer Science and Engineering University of Bologna, Italy BigSys
More informationThe Scientific Data Mining Process
Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In
More informationHadoop and Hive. Introduction,Installation and Usage. Saatvik Shah. Data Analytics for Educational Data. May 23, 2014
Hadoop and Hive Introduction,Installation and Usage Saatvik Shah Data Analytics for Educational Data May 23, 2014 Saatvik Shah (Data Analytics for Educational Data) Hadoop and Hive May 23, 2014 1 / 15
More informationBIG DATA What it is and how to use?
BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14
More informationHadoop and MapReduce. Swati Gore
Hadoop and MapReduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why MapReduce not MPI Distributed sort Why Hadoop? Existing Data
More informationBig Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014
Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014 Defining Big Not Just Massive Data Big data refers to data sets whose size is beyond the ability of typical database software tools
More informationApache MRQL (incubating): Advanced Query Processing for Complex, LargeScale Data Analysis
Apache MRQL (incubating): Advanced Query Processing for Complex, LargeScale Data Analysis Leonidas Fegaras University of Texas at Arlington http://mrql.incubator.apache.org/ 04/12/2015 Outline Who am
More informationL1: Introduction to Hadoop
L1: Introduction to Hadoop Feng Li feng.li@cufe.edu.cn School of Statistics and Mathematics Central University of Finance and Economics Revision: December 1, 2014 Today we are going to learn... 1 General
More informationCOSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015
COSC 6397 Big Data Analytics 2 nd homework assignment Pig and Hive Edgar Gabriel Spring 2015 2 nd Homework Rules Each student should deliver Source code (.java files) Documentation (.pdf,.doc,.tex or.txt
More informationTIME SERIES ANALYSIS
TIME SERIES ANALYSIS L.M. BHAR AND V.K.SHARMA Indian Agricultural Statistics Research Institute Library Avenue, New Delhi0 02 lmb@iasri.res.in. Introduction Time series (TS) data refers to observations
More informationWhat is Analytic Infrastructure and Why Should You Care?
What is Analytic Infrastructure and Why Should You Care? Robert L Grossman University of Illinois at Chicago and Open Data Group grossman@uic.edu ABSTRACT We define analytic infrastructure to be the services,
More informationBIG DATA TRENDS AND TECHNOLOGIES
BIG DATA TRENDS AND TECHNOLOGIES THE WORLD OF DATA IS CHANGING Cloud WHAT IS BIG DATA? Big data are datasets that grow so large that they become awkward to work with using onhand database management tools.
More informationAnalysis of Web Archives. Vinay Goel Senior Data Engineer
Analysis of Web Archives Vinay Goel Senior Data Engineer Internet Archive Established in 1996 501(c)(3) non profit organization 20+ PB (compressed) of publicly accessible archival material Technology partner
More informationIntroduction to Big Data Analytics p. 1 Big Data Overview p. 2 Data Structures p. 5 Analyst Perspective on Data Repositories p.
Introduction p. xvii Introduction to Big Data Analytics p. 1 Big Data Overview p. 2 Data Structures p. 5 Analyst Perspective on Data Repositories p. 9 State of the Practice in Analytics p. 11 BI Versus
More informationVolume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies
Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online at: www.ijarcsms.com Image
More informationAn Approach to Implement Map Reduce with NoSQL Databases
www.ijecs.in International Journal Of Engineering And Computer Science ISSN: 23197242 Volume 4 Issue 8 Aug 2015, Page No. 1363513639 An Approach to Implement Map Reduce with NoSQL Databases Ashutosh
More informationAlternatives to HIVE SQL in Hadoop File Structure
Alternatives to HIVE SQL in Hadoop File Structure Ms. Arpana Chaturvedi, Ms. Poonam Verma ABSTRACT Trends face ups and lows.in the present scenario the social networking sites have been in the vogue. The
More informationINTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK OVERVIEW ON BIG DATA SYSTEMATIC TOOLS MR. SACHIN D. CHAVHAN 1, PROF. S. A. BHURA
More informationIBM SPSS Forecasting 22
IBM SPSS Forecasting 22 Note Before using this information and the product it supports, read the information in Notices on page 33. Product Information This edition applies to version 22, release 0, modification
More informationProgramming Hadoop 5day, instructorled BD106. MapReduce Overview. Hadoop Overview
Programming Hadoop 5day, instructorled BD106 MapReduce Overview The Client Server Processing Pattern Distributed Computing Challenges MapReduce Defined Google's MapReduce The Map Phase of MapReduce
More informationHadoop Introduction. Olivier Renault Solution Engineer  Hortonworks
Hadoop Introduction Olivier Renault Solution Engineer  Hortonworks Hortonworks A Brief History of Apache Hadoop Apache Project Established Yahoo! begins to Operate at scale Hortonworks Data Platform 2013
More informationAn Oracle White Paper June 2012. High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database
An Oracle White Paper June 2012 High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database Executive Overview... 1 Introduction... 1 Oracle Loader for Hadoop... 2 Oracle Direct
More informationMaking big data simple with Databricks
Making big data simple with Databricks We are Databricks, the company behind Spark Founded by the creators of Apache Spark in 2013 Data 75% Share of Spark code contributed by Databricks in 2014 Value Created
More informationBackground on Elastic Compute Cloud (EC2) AMI s to choose from including servers hosted on different Linux distros
David Moses January 2014 Paper on Cloud Computing I Background on Tools and Technologies in Amazon Web Services (AWS) In this paper I will highlight the technologies from the AWS cloud which enable you
More informationScalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011
Scalable Data Analysis in R Lee E. Edlefsen Chief Scientist UserR! 2011 1 Introduction Our ability to collect and store data has rapidly been outpacing our ability to analyze it We need scalable data analysis
More informationA CostBenefit Analysis of Indexing Big Data with MapReduce
A CostBenefit Analysis of Indexing Big Data with MapReduce Dimitrios Siafarikas Argyrios Samourkasidis Avi Arampatzis Department of Electrical and Computer Engineering Democritus University of Thrace
More informationBig Data and Analytics: Getting Started with ArcGIS. Mike Park Erik Hoel
Big Data and Analytics: Getting Started with ArcGIS Mike Park Erik Hoel Agenda Overview of big data Distributed computation User experience Data management Big data What is it? Big Data is a loosely defined
More information