Building OLAP cubes on a Cloud Computing environment with MapReduce

Size: px
Start display at page:

Download "Building OLAP cubes on a Cloud Computing environment with MapReduce"

Transcription

1 Building OLAP cubes on a Cloud Computing environment with MapReduce Billel ARRES Universite Lumire Lyon 2 5 avenue Pierre Mands-France Bron, France Billel.Arres@univ-lyon2.fr Nadia KABBACHI Universite Claude Bernard Lyon 1 43 Boulevard du 11 Novembre Villeurbanne, France Nadia.Kabachi@univ-lyon1.fr Omar BOUSSAID Universite Lumire Lyon 2 5 avenue Pierre Mands-France Bron, France Omar.Boussaid@univ-lyon2.fr Abstract Large-scale data analysis has become increasingly important for many enterprises, and Cloud Computing, under the impulse of large companies, has recently endowed a special attention both in industry and academic researches. Hadoop, based on a new distributed computing paradigm, called MapReduce, has allowed to facilitate access to such environments, due to its impressive scalability and flexibility to handle structured as well as unstructured data. The goal of our work is to develop a Cloud Computing environment for exploiting data warehouses and perform online analysis. It consists of handling large nonrelational databases and supporting data warehouse with a new generation of database management systems (DBMS) such as Hive. Thus, to set up such an environment, we implemented a data warehouse under Hadoop and Hive and we used the Map and Reduce functions of this environment, then we compared the cost of loading the warehoused data and constructing OLAP cubes between a virtual and a physical cluster, as well as the rise in data loading on a physical cluster. Obtained results allows MapReduce developers to fully compare the performance, help in the choice of platform, in which a customer application can be developed to translate SQL requests to HQL (Hive-QL) requests, and check if a not-relational model is adequate or not. I. INTRODUCTION Data warehouses and OLAP systems (Online Analytical Processing) represent decision aid technologies that allow online analysis of large volumes of data [18]. If the use of this two systems were originally dedicated to organize, store and optimally exploit simple data, they were not designed for new units of data measurement accumulated so far. Notice that, its not uncommon to see Petabyte data warehouse nowadays [3] [11]. On the other hand, high performance architectures are designed to overcome the growing needs in terms of computation and storage of scientific and industrial applications. Among these architectures, the Cloud Computing [9] which consists globally to outsource data processing and storage. Popularized by Google, an innovative model of parallel data processing, called MapReduce, is presented in a now famous article [6]. The breakthrough of this model made possible to realize, with ordinary equipments, the treatment in one minute of a problem that previously required an hour on condition to multiply by 60 the number of machines [15]. In addition, the arrival of the Hadoop project, based on MapReduce, allowed an easy exploit of these high performance architectures. The construction of data warehouses and on-line analysis on the Cloud is a new research field. Indeed, if on one part, the University of California has identified research areas for the Cloud (availability, quality of service, security, etc..) [9], on the other, little attention has been paid to what is done, technically, on the construction of data warehouses and OLAP analysis upon the Cloud [1]. Hence, the necessity to move in this direction. To do this, we set the goal to build a development environment based on Hadoop and Hive, to test this new generation of DBMS. In this paper, we propose an original architecture to, first, set up a data warehouse on the Cloud for performing online analysis, and second, evaluate the performance of different variants of the architecture set up with Hadoop and Hive. We decided to use the SSB [22] data sample, which is a benchmark of decision support, designed to measure the performance of a star schema data warehouse. We propose to realize such an environment with the implementation of a data warehouse under Hive, and this, to exploit Map and Reduce functions in this environment and compare the results of loading data warehouse time and building OLAP cube time. This was done, on one hand, on a virtual cluster, with a virtual machine (VM), then on a physical cluster. On the other hand, we evaluate the performance of Hive, and that, in scalability on a physical cluster with four and six nodes. The results will serve as a basis for future work for the purpose of benchmarking, help in the choice of platform, where client applications can be developed to translate SQL queries to Hive-QL, and finally to validate the adequacy of non-relational models with data warehousing complex data. This paper is organized as follows. Section II describes the background of this work. Section III is devoted to the proposed approach with an overview of our data warehouse system and schema design. We have performed several experiments to test this environment that we develop in section IV. The performance evaluation reesults are presented in section V. we conclude in section VI. II. STATE OF THE ART AND RELATED WORK To understand the rest of the paper, the reader may need some basic knowledge on the tools we are using. In this section we just summarize the main concepts of Cloud Computing and MapReduce with its open sourced implementation Hadoop as they where presented by Google in [6] and [16], respectively. A. Cloud Computing Cloud Computing is a set of services deployed across a network. It is a concept which consists to move on distant servers the storage and computer processing usually located /13/$ IEEE

2 on local servers [8]. The architecture of a Cloud Computing environment is generally based on a layered organization, as shown in Fig.1. Fig. 1. Management data architecture on the cloud. The first level corresponds to the infrastructure as a service (IaaS). In general, it is composed of data centers, made available by the cloud providers. Amazon EC2 [20] and Microsoft Azure [21] are examples of such infrastructures. The second level is dedicated to platforms as a service (PaaS). It allows to provide a runtime environment quickly available. The best known example in this regard is MapReduce [6], and its open source implementation Hadoop [16]. The third level is the execution environment (SaaS). It offers software solutions as hosted services and aims to allow an easier use and a total transparency of the other architecture s layers. Hive from Facebook [3], or Scope [12] from Microsoft, are examples of this last layer based on particular data models, as columnoriented model or models extending the relational model such as NoSQL (Not Only SQL). Cloud Computing provides elastic property to adjust the resources according to applications, increasing computing power and storage at peak times usage and decreasing it during slack periods, while allowing the parallelization of storage and data processing. According to the approach, there are two main models of deployment for the Cloud Computing services [9]: Public Cloud, accessible to a wide public and belongs to a service provider; and Private Cloud, where the infrastructure is completely dedicated to a single organization. In our work, a private cloud architecture based on Hadoop and Hive is implemented. It allows the compilation of Hive-QL instructions to set up a data warehouse and construct OLAP cube on a parallel environments (section III). B. MapReduce and Hadoop 1) The MapReduce paradigm: MapReduce is a programming model suited to massively parallel processing of very large amounts of data [6]. This programming model is based on two main steps Map and Reduce. In Map step, the node (machine) which is submitted to a problem cuts it in subproblems and delegates them to other nodes (which can do the same recursively). In Reduce step, lowest nodes trace their results to parents nodes who solicited them. At the end of the process, the original node can reconstruct a solution to the problem which he had been requested for. Introduced by Google, MapReduce has been used, in 2007, to process more than 400 TB of data in 6 minutes, with a number of machines equal to 436 [15]. The most popular implementation of the MapReduce paradigm is the framework Hadoop, which allows applications to be run on large clusters deployed on low cost machines. Other implementations of MapReduce paradigm are available for different architectures, such as multicore architectures [13], multiple virtual machines architectures [14], Grid Computing environments [4] or even mobile environments [2]. 2) Hadoop: Hadoop is an open source project based on the MapReduce paradigm and Google File System. It can be considered as a processing system for scalable data storage and batch processing of large quantities of data. It is perfectly suited for ad hoc storage and analysis on very large volumes of data [16]. With Hadoop, since it comes to management of very large volumes of data, we must also optimize the use of the bandwidth on the network. That is why MapReduce is generally used in combination with a distributed file management system. in the case of Hadoop it comes to the HDFS (Hadoop Data File System). HDFS has a master/slave architecture [17]. In this logic, a Hadoop cluster consists of a single master server, named NameNode, which manages the file system and access rights; but also of servers that are both a computation and storage tools, named DataNodes, in general one per node. Hadoop has been widely adopted by the decisional community, and made the field of data warehousing in the cloud more accessible. III. PROPOSED APPROACH We have implemented a private cloud architecture, limited in terms of infrastructure. The objective is to test the feasibility of our approach of building a data warehouse in the Cloud. It does not allow us to study the transition to very large scale. However, it will help us to deploy the parallelization of storage and treatments of the data warehouse. This allows us to observe the reaction of this approach by performing scalability, although it remains relatively limited. The goal is to note that the performance does not degrade when we increase the size of the data warehouse. Our work consists, at first, to the implementation of the parallel data processing paradigm: MapReduce, across multiple distributed clusters, using Hadoop and Hive. Then, we propose a performance evaluation of Hive s loading data warehouses time and OLAP cubes construction time. This evaluation was performed using HiveQL queries and under different variants of architecture. We proposed an architecture (Fig.2) that allows the partitioning of the data warehouse on different clusters (nodes), the construction and the interrogation of OLAP cube by the user. Fig. 2. The proposed architecture. As it can be seen on Fig.2, user can issue HQL query through either Web UI, or command line interface (CLI), or Java code via JDBC. The query is sent to the node that runs the Query Driver. The main task for Query Driver is to translate the query to a MapReduce job, which includes the map phase plan and the reduce phase plan.

3 In the following, we introduce the data warehouse used for the study (section III.A). Then, we explain the HiveQL requests used for building the OLAP cube (section III.B). A. Construction phase of the data warehouse To build the data warehouse, we have opted to use the SSB [22] data model. It is a test bench designed to measure the performance of a star schema data warehouse [10]. Using the classic star schema model of sales stores, the data warehouse implemented (Fig.3) consists of a fact table named LINEORDER. It has seventeen attributes to provide information about an order, including a primary key, which consist of ORDERKEY and LINENUMBER, and foreign keys of the dimension tables CUSTOMER, PART, DDATE and SUPPLIER. environment, by fixing the number of nodes (1 node) and the size of the implemented data warehouse (1GB). The third experiment (section IV.C) allows the scaling up in a physical environment, with the variation of the number of nodes (4 and 6 nodes) and the size of the data warehouse (from 1GB to 1TB). A. Experiments on a virtual cluster (1 node) For evaluation and familiarization purposes with the Hadoop system, the virtual machine is a good choice of starting. It allows the deployment of Hadoop with a minimum of time and material resources. We chose to install the development package Cloudera-CDH4 [23], for virtual machine (VM), which provides a pre-installation of Hadoop and it s components, on Ubuntu, including Hive. The package has been installed on a physical machine (Fig.4) with standard features, including: 4GB RAM memory and an Intel 3.10GHz x 4 processor. The OS is Windows 7. The configuration is equivalent to a virtual cluster with a single node, where the machine operates as NameNode and DataNode at the same time. In this first part of the experiment, we used a data warehouse size equal to 1 GB, which is sufficient to test the feasibility of the proposed architecture on a virtual environment. Fig. 3. Data warehouse schema. B. Loading phase and the OLAP cube construction The majority of the operations supported by the HiveQL are very similar to SQL. However, the architectural difference between the two systems on which are based these languages, and especially, the use of the HDFS as File Management system by Hive, imposes other additional operations that require the adaptation of the user. Thus, to measure the performance of Hive on time of loading the data warehouse and OLAP cube construction, we have developed a set of HiveQL queries. In this paper, we consider an OLAP query as a query that involves two levels of aggregation. Indeed, a classic OLAP query [7] consists of a set of queries on different levels of aggregation joined by a union operator, which is also implemented in HiveQL. We chose to build a data cube answering the following decision query: What is the sum of sales revenues by year and brand in Asia since 1992?. Here, we look for the total income (REVENUE) as a measure according to the dimensions PRODUCT (PART) and TIME (DATE). IV. EXPERIMENTATION In this section, we evaluate the performance of loading data warehouse and OLAP cube construction with Hive, according to different variants of the defined architecture. The first two experiments (sections IV.A and IV.B) used to compare the performance of our plateform between virtual and physical Fig. 4. Cloudera development environment on Linux. B. Experiments on a physical cluster (1 node) The second part of the experiment aimed to compare the performance of Hive deployed on a physical cluster to the virtual cluster already implemented. Thus, the same platform has been deployed, but this time, on a physical machine, with 4GB RAM memory, an Intel 3.10GHz x 4 processor and Ubuntu bit-type as operating system. We used the version of Hadoop provided by Apache. The data warehouse used is the same as the first part of the experiment (1GB). The configuration set up is equivalent to a cluster with a single node and the machine operates as NameNode and DataNode at the same time. Hive was installed from a stable version 0.9.0, available on the Apache s website [19]. C. Scalability on the physical cluster (4-6 nodes) This part of the experiment aims to measure the response time of the platform Hadoop/Hive according to its solicitation. This will allow us to exploit the heart of the Hadoop system which is the advanced use of the HDFS, the distribution of the data and loads on the network. All this, with the implementation of a physical cluster according to the proposed architecture, and thus, to test modestly, what is currently done

4 by the big actors of the web with more resources, while trying to evaluate, at our level, the performance of such infrastructures. The first evaluation of Hive on the scalability was performed on a cluster of machines composed of four identical nodes, including the master machine. The second evaluation of Hive on the scalability was performed, this time, on a configuration of six identical nodes, including the master machine. The machines used present all the same features ie 4GB RAM memory and an Intel 3.10GHz x 4 processor. The OS is Ubuntu This allow to assess the performance of the Hadoop platform with increasing the size of the data warehouse and the number of nodes. B. Configuration on a physical cluster with 4 and 6 nodes The results of the scalability experimentation with Hive are shown in the table below. The size of the data warehouse varies from 1 GB to 1 TB. Times are measured in seconds: V. RESULTS PRESENTATION AND ANALYSIS In this section we present the experimental results on two phases. The first phase consisted in fixing the size of the data warehouse (1 GB) and the number of nodes (1 node) to compare performance results between a physical cluster and virtual cluster. The second phase consists in comparing the performance results on a physical cluster, but this time, by varying the size of the data warehouse (from 1 GB to 1 TB) and the number of nodes (4 and 6 nodes). A. Configuration with a virtual cluster and a physical cluster The results of the tests performed on both virtual and physical platforms are illustrate in the next charts (Fig.5). Time loading of each table is measured in seconds. The sum of loading time of all tables is equal to the total time of loading the data warehouse by Hive. The time of OLAP cube construction is equivalent to the time required for Hive to execute all the jobs (tasks) MapReduce created by compiling the query group HiveQL of cube construction. Fig. 6. Results of the scalability on physical clusters with nodes 4 and 6 We note that the loading time of the data warehouse evolves with the increase of its size. Thus, if we take the results of loading time and building the OLAP cube for a cluster of 6 nodes, it clearly increases with the increasing size of the warehouse, ranging from seconds to load a 1 GB data warehouse to more than 29 minutes for 1 TB data warehouse; and a little less than 12 minutes for the OLAP cube construction of 1 GB data warehouse to less than 4 hours for a data warehouse of 1 TB. What is at first sight huge. The fact that the machines used for testing are ordinary low capacity machines, compared to what is commonly used. However, these results remain encouraging, because it should be noted that there s a big difference in the transition from four to six nodes, especially the results of the OLAP cube construction. Those, require the execution of an HQL block queries and treatment of results detailed below. Fig.7 compares the results of the tests performed on clusters with 4 and 6 nodes: Fig. 5. Results of loading data warehouse time and OLAP cube construction time between the virtual and the physical cluster (1 node). The results show clearly the superiority of the physical environment on loading data warehouse (39.09 s) and building the OLAP cube (17.91 min), and this, by consuming half of the time required to the virtual cluster for loading the data warehouse (2.38 min) and building the OLAP cube (35.8 min). The virtual cluster provides a quick and easy installation especially by exploiting the Cloudera package. It facilitates the handling of the Hadoop system, with all its components. It also allows the creation of an environment easy to use in order to familiarize with the cloud computing and data warehousing tools such as Hive. However, the widespread use with larger scale data shows the limits of this virtual environment and advantage the physical environment. Fig. 7. Results of the increase in charge on the physical clusters with 4 and 6 nodes. The loading time of the data warehouse are almost the same for both clusters with 4 and 6 nodes. These results follow an evolution along a straight line with a relatively low slope, from the fact that the loading phase does not require the creation or execution of MapReduce jobs by the Hive compiler. The data is divided into small units and mounted in the same way

5 on the different nodes (DataNodes) of the cluster via HDFS. Thus, the larger the size (of the warehouse) increases the more it is necessary to Hive to consume more time. However, the evolution of this time remains relatively low, ranging from 2 minutes for loading a data warehouse of 10 GB to 20 minutes for a data warehouse of 100 GB. Although, the variation of the number of nodes is probably not significant enough (from 4 to 6) to highlight the differences between the loading time which more specific tests, at the scale of Hadoop with a dozen or a hundred nodes may reveal. In terms of OLAP cube construction time, test results are almost the same for data warehouses which Hadoop and Hive consider as relatively small (inferior to 10 GB). For example, the OLAP cube construction time for a 1 GB warehouse is 12 minutes 30 with 4 nodes and 12 minutes with 6 nodes so a difference of 30 seconds. The gap between the results of building OLAP cube time between the two clusters appear well from a warehouse of 10 GB of data, and benefits to add two nodes to the cluster are appreciable as the size of the warehouse increases. For example, the OLAP cube construction time of 100 GB warehouse takes a little over 3 hours with 6 nodes and almost 5 hours with 4 nodes, so a difference of 2 hours. Construction OLAP cube time with Hive on a 6 nodes cluster are clearly better than those obtained with the 4 node cluster. In this case, the construction of the cube is done by executing the HiveQL queries Group, previously defined. MapReduce jobs are created automatically by the Hive compiler and executed by the different cluster s nodes that also contain data partitions. Thus, more the number of nodes increases less is the size of data partitions, and the more quickly the MapReduce Jobs return their results. In the end, the execution plan remains the same between the two clusters, but the number of nodes increases, which reduces the size of the different data partitions and accelerates the execution of the different jobs accelerating by the same way the construction of the OLAP cube. VI. CONCLUSION The size of data sets being collected and analyzed in the industry for business intelligence is growing rapidly, making traditional warehousing solutions prohibitively expensive. In this paper we have shown how to benefit from cloud computing technologies to build OLAP cubes by using MapReduce and Hive. Our motivation was to manipulate these concepts by creating an environment with a project around these technologies. The aim of this was the creation of an environment type on Hadoop with the establishment of a private cloud computing environment for creating data warehouses and perform online analysis. The experiments carried out have allowed handling and mounting various data warehouses on Hive and evaluating the performance of different variants of architecture implemented with Hadoop. In addition, these experiments allowed us to exploit the important part of the Hadoop system which is its file management system (HDFS), and especially to understand the logic of its operation. Also, the environment of Hadoop on Ubuntu (in addition to the Cloudera Virtual Machine distribution), was used to validate new versions of the Hadoop ecosystem projects such as Hive. These are more and more stable and relatively easy to deploy in order to exploit and benefit from the power of MapReduce in managing large data warehouses. The perspectives of this project are, in the short term, to continue these experimentations by increasing the number of nodes and the size of the data across several terabytes or even petabytes, in order to evaluate better the performance of these systems. In the medium term, we plan to develop and implement new algorithms with Map and Reduce functions of this environment, and exploit the Cloud architecture at the PaaS level. Finally, in the long term we plan to develop BI (Business Intelligence) tools solution in this type of environment. REFERENCES [1] A.Abello, J.Ferrarons, O.Romero. Building Cubes with MapReduce. DOLAP11, October 28, 2011, Glasgow, Scotland, UK. [2] A.Dou, et.al. Misco: a mapreduce framework for mobile systems, in: Proceedings of the PETRA10, ACM, New York, NY, USA, 2010, pp. 32:132:8. [3] A.Thusoo, et. al. Hive : a warehousing solution over a map-reduce framework. Facebook Data Infrastructure Team [4] C.Miceli, et.al. Programming abstractions for data intensive computing on clouds and grids, in: Proceedings of the CCGRID09, IEEE Com-puter Society, Washington, DC, USA, 2009, pp [5] E.Capriolo, D. Wampler, J. Rutherglen. Programming Hive. O Reilly, [6] J.Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters. Communications of the ACM, [7] J.Gray, et.al. Data cube : A relational aggregation operator generalizing group-by, cross-tab, and sub-total. In Proceedings of the International Conference on Data Engineering, pages , New Orleans, USA, [8] L.D Orazio, S. Bimonte. Intgration des Tableaux Multidimensionnels en Pig pour l Entreposage de Donnes sur les Nuages, [9] M.Armbrust, A. Fox, R. Griffith. Above the Clouds : A berkeley view of Cloud Computing, Technical Report UCB/EECS,2009. [10] P.O Neil, B.O Neil, X.Chen. Star Schma Benchmark, 2009, Web Page. poneil/starschemab.pdf. [11] Q. Wang, et.al. On The Correctness Criteria of Fine-Grained Access Control in Relational Databases. In Proceedings of VLDB, pages , [12] R. Chaiken, et.al. Scope : easy and efficient parallel processing of massive data sets. PVLDB, 1(2) : , [13] R.Chen, et.al. mapreduce: optimizing resource usages of data-parallel applications on multicore with tiling, in Proceedings of the 19th PACT10, ACM, New York, NY, USA, 2010, pp [14] S.Ibrahim,et.al. Cloudlet: towards mapreduce im-plementation on virtual machines, in: Proceedings of the 18th ACM HPDC09, ACM, New York, NY, USA, 2009, pp [15] S.Genaud. MapReduce : un cadre de programmation parallle pour lanalyse de grandes donnes, Universit de Strasbourg,2011. [16] T.White. O Reilly. Hadoop: The Definitive Guide, [17] V.Guana and J. Davidson. On Comparing Inverted Index Parallel Implementations Using MapReduce, University of Alberta, [18] W.Inmon. Building the Data Warehouse. Wiley, New York, USA, [19] Apache Hive Releases. Web page. [20] Amazon EC2. Web page. http ://aws.amazon.com/ec2/ [21] Microsoft azure. Web page. http :// [22] Star Schma Benchmark. Web page. poneil/ StarSchemaB.pdf [23] Cloudera Entreprise and CDH4.0. Web page. /blog/2012/06/cdh4-and-cloudera-enterprise-4-0-now-available/

Using the column oriented NoSQL model for implementing big data warehouses

Using the column oriented NoSQL model for implementing big data warehouses Int'l Conf. Par. and Dist. Proc. Tech. and Appl. PDPTA'15 469 Using the column oriented NoSQL model for implementing big data warehouses Khaled. Dehdouh 1, Fadila. Bentayeb 1, Omar. Boussaid 1, and Nadia

More information

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Kanchan A. Khedikar Department of Computer Science & Engineering Walchand Institute of Technoloy, Solapur, Maharashtra,

More information

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of

More information

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing

More information

How To Handle Big Data With A Data Scientist

How To Handle Big Data With A Data Scientist III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution

More information

BIG DATA TRENDS AND TECHNOLOGIES

BIG DATA TRENDS AND TECHNOLOGIES BIG DATA TRENDS AND TECHNOLOGIES THE WORLD OF DATA IS CHANGING Cloud WHAT IS BIG DATA? Big data are datasets that grow so large that they become awkward to work with using onhand database management tools.

More information

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Managing Big Data with Hadoop & Vertica A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Copyright Vertica Systems, Inc. October 2009 Cloudera and Vertica

More information

Manifest for Big Data Pig, Hive & Jaql

Manifest for Big Data Pig, Hive & Jaql Manifest for Big Data Pig, Hive & Jaql Ajay Chotrani, Priyanka Punjabi, Prachi Ratnani, Rupali Hande Final Year Student, Dept. of Computer Engineering, V.E.S.I.T, Mumbai, India Faculty, Computer Engineering,

More information

Big Data on Microsoft Platform

Big Data on Microsoft Platform Big Data on Microsoft Platform Prepared by GJ Srinivas Corporate TEG - Microsoft Page 1 Contents 1. What is Big Data?...3 2. Characteristics of Big Data...3 3. Enter Hadoop...3 4. Microsoft Big Data Solutions...4

More information

Native Connectivity to Big Data Sources in MSTR 10

Native Connectivity to Big Data Sources in MSTR 10 Native Connectivity to Big Data Sources in MSTR 10 Bring All Relevant Data to Decision Makers Support for More Big Data Sources Optimized Access to Your Entire Big Data Ecosystem as If It Were a Single

More information

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web

More information

Data processing goes big

Data processing goes big Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,

More information

Implement Hadoop jobs to extract business value from large and varied data sets

Implement Hadoop jobs to extract business value from large and varied data sets Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to

More information

The Inside Scoop on Hadoop

The Inside Scoop on Hadoop The Inside Scoop on Hadoop Orion Gebremedhin National Solutions Director BI & Big Data, Neudesic LLC. VTSP Microsoft Corp. Orion.Gebremedhin@Neudesic.COM B-orgebr@Microsoft.com @OrionGM The Inside Scoop

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.

More information

Introduction to Cloud Computing

Introduction to Cloud Computing Introduction to Cloud Computing Cloud Computing I (intro) 15 319, spring 2010 2 nd Lecture, Jan 14 th Majd F. Sakr Lecture Motivation General overview on cloud computing What is cloud computing Services

More information

<Insert Picture Here> Big Data

<Insert Picture Here> Big Data Big Data Kevin Kalmbach Principal Sales Consultant, Public Sector Engineered Systems Program Agenda What is Big Data and why it is important? What is your Big

More information

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

Open source Google-style large scale data analysis with Hadoop

Open source Google-style large scale data analysis with Hadoop Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical

More information

Mr. Apichon Witayangkurn apichon@iis.u-tokyo.ac.jp Department of Civil Engineering The University of Tokyo

Mr. Apichon Witayangkurn apichon@iis.u-tokyo.ac.jp Department of Civil Engineering The University of Tokyo Sensor Network Messaging Service Hive/Hadoop Mr. Apichon Witayangkurn apichon@iis.u-tokyo.ac.jp Department of Civil Engineering The University of Tokyo Contents 1 Introduction 2 What & Why Sensor Network

More information

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce Analytics in the Cloud Peter Sirota, GM Elastic MapReduce Data-Driven Decision Making Data is the new raw material for any business on par with capital, people, and labor. What is Big Data? Terabytes of

More information

Hadoop and Map-Reduce. Swati Gore

Hadoop and Map-Reduce. Swati Gore Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data

More information

Alexander Nikov. 5. Database Systems and Managing Data Resources. Learning Objectives. RR Donnelley Tries to Master Its Data

Alexander Nikov. 5. Database Systems and Managing Data Resources. Learning Objectives. RR Donnelley Tries to Master Its Data INFO 1500 Introduction to IT Fundamentals 5. Database Systems and Managing Data Resources Learning Objectives 1. Describe how the problems of managing data resources in a traditional file environment are

More information

Hadoop implementation of MapReduce computational model. Ján Vaňo

Hadoop implementation of MapReduce computational model. Ján Vaňo Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed

More information

Final Project Proposal. CSCI.6500 Distributed Computing over the Internet

Final Project Proposal. CSCI.6500 Distributed Computing over the Internet Final Project Proposal CSCI.6500 Distributed Computing over the Internet Qingling Wang 660795696 1. Purpose Implement an application layer on Hybrid Grid Cloud Infrastructure to automatically or at least

More information

Big Data Explained. An introduction to Big Data Science.

Big Data Explained. An introduction to Big Data Science. Big Data Explained An introduction to Big Data Science. 1 Presentation Agenda What is Big Data Why learn Big Data Who is it for How to start learning Big Data When to learn it Objective and Benefits of

More information

Application Development. A Paradigm Shift

Application Development. A Paradigm Shift Application Development for the Cloud: A Paradigm Shift Ramesh Rangachar Intelsat t 2012 by Intelsat. t Published by The Aerospace Corporation with permission. New 2007 Template - 1 Motivation for the

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

EXPERIMENTATION. HARRISON CARRANZA School of Computer Science and Mathematics

EXPERIMENTATION. HARRISON CARRANZA School of Computer Science and Mathematics BIG DATA WITH HADOOP EXPERIMENTATION HARRISON CARRANZA Marist College APARICIO CARRANZA NYC College of Technology CUNY ECC Conference 2016 Poughkeepsie, NY, June 12-14, 2016 Marist College AGENDA Contents

More information

HadoopRDF : A Scalable RDF Data Analysis System

HadoopRDF : A Scalable RDF Data Analysis System HadoopRDF : A Scalable RDF Data Analysis System Yuan Tian 1, Jinhang DU 1, Haofen Wang 1, Yuan Ni 2, and Yong Yu 1 1 Shanghai Jiao Tong University, Shanghai, China {tian,dujh,whfcarter}@apex.sjtu.edu.cn

More information

Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan

Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan Abstract Big Data is revolutionizing 21st-century with increasingly huge amounts of data to store and be

More information

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5 Distributed data processing in heterogeneous cloud environments R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5 1 uskenbaevar@gmail.com, 2 abu.kuandykov@gmail.com,

More information

Big Data and Apache Hadoop s MapReduce

Big Data and Apache Hadoop s MapReduce Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23

More information

Introduction to NoSQL Databases. Tore Risch Information Technology Uppsala University 2013-03-05

Introduction to NoSQL Databases. Tore Risch Information Technology Uppsala University 2013-03-05 Introduction to NoSQL Databases Tore Risch Information Technology Uppsala University 2013-03-05 UDBL Tore Risch Uppsala University, Sweden Evolution of DBMS technology Distributed databases SQL 1960 1970

More information

Hadoop on Windows Azure: Hive vs. JavaScript for Processing Big Data

Hadoop on Windows Azure: Hive vs. JavaScript for Processing Big Data Hive vs. JavaScript for Processing Big Data For some time Microsoft didn t offer a solution for processing big data in cloud environments. SQL Server is good for storage, but its ability to analyze terabytes

More information

NoSQL and Hadoop Technologies On Oracle Cloud

NoSQL and Hadoop Technologies On Oracle Cloud NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath

More information

Big Data Technologies Compared June 2014

Big Data Technologies Compared June 2014 Big Data Technologies Compared June 2014 Agenda What is Big Data Big Data Technology Comparison Summary Other Big Data Technologies Questions 2 What is Big Data by Example The SKA Telescope is a new development

More information

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, sborkar95@gmail.com Assistant Professor, Information

More information

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe

More information

WINDOWS AZURE DATA MANAGEMENT AND BUSINESS ANALYTICS

WINDOWS AZURE DATA MANAGEMENT AND BUSINESS ANALYTICS WINDOWS AZURE DATA MANAGEMENT AND BUSINESS ANALYTICS Managing and analyzing data in the cloud is just as important as it is anywhere else. To let you do this, Windows Azure provides a range of technologies

More information

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture. Big Data Hadoop Administration and Developer Course This course is designed to understand and implement the concepts of Big data and Hadoop. This will cover right from setting up Hadoop environment in

More information

Mining Large Datasets: Case of Mining Graph Data in the Cloud

Mining Large Datasets: Case of Mining Graph Data in the Cloud Mining Large Datasets: Case of Mining Graph Data in the Cloud Sabeur Aridhi PhD in Computer Science with Laurent d Orazio, Mondher Maddouri and Engelbert Mephu Nguifo 16/05/2014 Sabeur Aridhi Mining Large

More information

Chapter 6. Foundations of Business Intelligence: Databases and Information Management

Chapter 6. Foundations of Business Intelligence: Databases and Information Management Chapter 6 Foundations of Business Intelligence: Databases and Information Management VIDEO CASES Case 1a: City of Dubuque Uses Cloud Computing and Sensors to Build a Smarter, Sustainable City Case 1b:

More information

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract W H I T E P A P E R Deriving Intelligence from Large Data Using Hadoop and Applying Analytics Abstract This white paper is focused on discussing the challenges facing large scale data processing and the

More information

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM Maximizing Hadoop Performance and Storage Capacity with AltraHD TM Executive Summary The explosion of internet data, driven in large part by the growth of more and more powerful mobile devices, has created

More information

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14 Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 14 Big Data Management IV: Big-data Infrastructures (Background, IO, From NFS to HFDS) Chapter 14-15: Abideboul

More information

http://www.paper.edu.cn

http://www.paper.edu.cn 5 10 15 20 25 30 35 A platform for massive railway information data storage # SHAN Xu 1, WANG Genying 1, LIU Lin 2** (1. Key Laboratory of Communication and Information Systems, Beijing Municipal Commission

More information

Hadoop. http://hadoop.apache.org/ Sunday, November 25, 12

Hadoop. http://hadoop.apache.org/ Sunday, November 25, 12 Hadoop http://hadoop.apache.org/ What Is Apache Hadoop? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using

More information

ISSN: 2321-7782 (Online) Volume 3, Issue 4, April 2015 International Journal of Advance Research in Computer Science and Management Studies

ISSN: 2321-7782 (Online) Volume 3, Issue 4, April 2015 International Journal of Advance Research in Computer Science and Management Studies ISSN: 2321-7782 (Online) Volume 3, Issue 4, April 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online

More information

Analysis and Research of Cloud Computing System to Comparison of Several Cloud Computing Platforms

Analysis and Research of Cloud Computing System to Comparison of Several Cloud Computing Platforms Volume 1, Issue 1 ISSN: 2320-5288 International Journal of Engineering Technology & Management Research Journal homepage: www.ijetmr.org Analysis and Research of Cloud Computing System to Comparison of

More information

Big Data and Market Surveillance. April 28, 2014

Big Data and Market Surveillance. April 28, 2014 Big Data and Market Surveillance April 28, 2014 Copyright 2014 Scila AB. All rights reserved. Scila AB reserves the right to make changes to the information contained herein without prior notice. No part

More information

Executive Summary... 2 Introduction... 3. Defining Big Data... 3. The Importance of Big Data... 4 Building a Big Data Platform...

Executive Summary... 2 Introduction... 3. Defining Big Data... 3. The Importance of Big Data... 4 Building a Big Data Platform... Executive Summary... 2 Introduction... 3 Defining Big Data... 3 The Importance of Big Data... 4 Building a Big Data Platform... 5 Infrastructure Requirements... 5 Solution Spectrum... 6 Oracle s Big Data

More information

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com Data Warehousing and Analytics Infrastructure at Facebook Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com Overview Challenges in a Fast Growing & Dynamic Environment Data Flow Architecture,

More information

Big Data: Tools and Technologies in Big Data

Big Data: Tools and Technologies in Big Data Big Data: Tools and Technologies in Big Data Jaskaran Singh Student Lovely Professional University, Punjab Varun Singla Assistant Professor Lovely Professional University, Punjab ABSTRACT Big data can

More information

Native Connectivity to Big Data Sources in MicroStrategy 10. Presented by: Raja Ganapathy

Native Connectivity to Big Data Sources in MicroStrategy 10. Presented by: Raja Ganapathy Native Connectivity to Big Data Sources in MicroStrategy 10 Presented by: Raja Ganapathy Agenda MicroStrategy supports several data sources, including Hadoop Why Hadoop? How does MicroStrategy Analytics

More information

Introduction to Cloud Computing

Introduction to Cloud Computing Discovery 2015: Cloud Computing Workshop June 20-24, 2011 Berkeley, CA Introduction to Cloud Computing Keith R. Jackson Lawrence Berkeley National Lab What is it? NIST Definition Cloud computing is a model

More information

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets

More information

Research Article Hadoop-Based Distributed Sensor Node Management System

Research Article Hadoop-Based Distributed Sensor Node Management System Distributed Networks, Article ID 61868, 7 pages http://dx.doi.org/1.1155/214/61868 Research Article Hadoop-Based Distributed Node Management System In-Yong Jung, Ki-Hyun Kim, Byong-John Han, and Chang-Sung

More information

From Wikipedia, the free encyclopedia

From Wikipedia, the free encyclopedia Page 1 sur 5 Hadoop From Wikipedia, the free encyclopedia Apache Hadoop is a free Java software framework that supports data intensive distributed applications. [1] It enables applications to work with

More information

The Performance Characteristics of MapReduce Applications on Scalable Clusters

The Performance Characteristics of MapReduce Applications on Scalable Clusters The Performance Characteristics of MapReduce Applications on Scalable Clusters Kenneth Wottrich Denison University Granville, OH 43023 wottri_k1@denison.edu ABSTRACT Many cluster owners and operators have

More information

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model

More information

Big Data With Hadoop

Big Data With Hadoop With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

More information

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)

More information

Open Cloud System. (Integration of Eucalyptus, Hadoop and AppScale into deployment of University Private Cloud)

Open Cloud System. (Integration of Eucalyptus, Hadoop and AppScale into deployment of University Private Cloud) Open Cloud System (Integration of Eucalyptus, Hadoop and into deployment of University Private Cloud) Thinn Thu Naing University of Computer Studies, Yangon 25 th October 2011 Open Cloud System University

More information

Well packaged sets of preinstalled, integrated, and optimized software on select hardware in the form of engineered systems and appliances

Well packaged sets of preinstalled, integrated, and optimized software on select hardware in the form of engineered systems and appliances INSIGHT Oracle's All- Out Assault on the Big Data Market: Offering Hadoop, R, Cubes, and Scalable IMDB in Familiar Packages Carl W. Olofson IDC OPINION Global Headquarters: 5 Speen Street Framingham, MA

More information

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP Big-Data and Hadoop Developer Training with Oracle WDP What is this course about? Big Data is a collection of large and complex data sets that cannot be processed using regular database management tools

More information

Large scale processing using Hadoop. Ján Vaňo

Large scale processing using Hadoop. Ján Vaňo Large scale processing using Hadoop Ján Vaňo What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data Includes: MapReduce offline computing engine

More information

Hadoop IST 734 SS CHUNG

Hadoop IST 734 SS CHUNG Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to

More information

Chapter 6 8/12/2015. Foundations of Business Intelligence: Databases and Information Management. Problem:

Chapter 6 8/12/2015. Foundations of Business Intelligence: Databases and Information Management. Problem: Foundations of Business Intelligence: Databases and Information Management VIDEO CASES Chapter 6 Case 1a: City of Dubuque Uses Cloud Computing and Sensors to Build a Smarter, Sustainable City Case 1b:

More information

Big Data and Hadoop with components like Flume, Pig, Hive and Jaql

Big Data and Hadoop with components like Flume, Pig, Hive and Jaql Abstract- Today data is increasing in volume, variety and velocity. To manage this data, we have to use databases with massively parallel software running on tens, hundreds, or more than thousands of servers.

More information

INDUS / AXIOMINE. Adopting Hadoop In the Enterprise Typical Enterprise Use Cases

INDUS / AXIOMINE. Adopting Hadoop In the Enterprise Typical Enterprise Use Cases INDUS / AXIOMINE Adopting Hadoop In the Enterprise Typical Enterprise Use Cases. Contents Executive Overview... 2 Introduction... 2 Traditional Data Processing Pipeline... 3 ETL is prevalent Large Scale

More information

Big Data Analytics - Accelerated. stream-horizon.com

Big Data Analytics - Accelerated. stream-horizon.com Big Data Analytics - Accelerated stream-horizon.com Legacy ETL platforms & conventional Data Integration approach Unable to meet latency & data throughput demands of Big Data integration challenges Based

More information

Qsoft Inc www.qsoft-inc.com

Qsoft Inc www.qsoft-inc.com Big Data & Hadoop Qsoft Inc www.qsoft-inc.com Course Topics 1 2 3 4 5 6 Week 1: Introduction to Big Data, Hadoop Architecture and HDFS Week 2: Setting up Hadoop Cluster Week 3: MapReduce Part 1 Week 4:

More information

BIG DATA CHALLENGES AND PERSPECTIVES

BIG DATA CHALLENGES AND PERSPECTIVES BIG DATA CHALLENGES AND PERSPECTIVES Meenakshi Sharma 1, Keshav Kishore 2 1 Student of Master of Technology, 2 Head of Department, Department of Computer Science and Engineering, A P Goyal Shimla University,

More information

An Approach to Analyze Large Scale Wireless Sensors Network Data

An Approach to Analyze Large Scale Wireless Sensors Network Data An Approach to Analyze Large Scale Wireless Sensors Network Data Soufiane FARRAH * Hanane El Manssouri El Houssaine Ziyati Mohamed Ouzzif IT Department IT Department IT Department IT Department High School

More information

On a Hadoop-based Analytics Service System

On a Hadoop-based Analytics Service System Int. J. Advance Soft Compu. Appl, Vol. 7, No. 1, March 2015 ISSN 2074-8523 On a Hadoop-based Analytics Service System Mikyoung Lee, Hanmin Jung, and Minhee Cho Korea Institute of Science and Technology

More information

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social

More information

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW AGENDA What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story Hadoop PDW Our BIG DATA Roadmap BIG DATA? Volume 59% growth in annual WW information 1.2M Zetabytes (10 21 bytes) this

More information

Foundations of Business Intelligence: Databases and Information Management

Foundations of Business Intelligence: Databases and Information Management Foundations of Business Intelligence: Databases and Information Management Wienand Omta Fabiano Dalpiaz 1 drs. ing. Wienand Omta Learning Objectives Describe how the problems of managing data resources

More information

Virtualizing Apache Hadoop. June, 2012

Virtualizing Apache Hadoop. June, 2012 June, 2012 Table of Contents EXECUTIVE SUMMARY... 3 INTRODUCTION... 3 VIRTUALIZING APACHE HADOOP... 4 INTRODUCTION TO VSPHERE TM... 4 USE CASES AND ADVANTAGES OF VIRTUALIZING HADOOP... 4 MYTHS ABOUT RUNNING

More information

Data Warehouse Optimization

Data Warehouse Optimization Data Warehouse Optimization Embedding Hadoop in Data Warehouse Environments A Whitepaper Rick F. van der Lans Independent Business Intelligence Analyst R20/Consultancy September 2013 Sponsored by Copyright

More information

Adobe Deploys Hadoop as a Service on VMware vsphere

Adobe Deploys Hadoop as a Service on VMware vsphere Adobe Deploys Hadoop as a Service A TECHNICAL CASE STUDY APRIL 2015 Table of Contents A Technical Case Study.... 3 Background... 3 Why Virtualize Hadoop on vsphere?.... 3 The Adobe Marketing Cloud and

More information

Enhancing Massive Data Analytics with the Hadoop Ecosystem

Enhancing Massive Data Analytics with the Hadoop Ecosystem www.ijecs.in International Journal Of Engineering And Computer Science ISSN:2319-7242 Volume 3, Issue 11 November, 2014 Page No. 9061-9065 Enhancing Massive Data Analytics with the Hadoop Ecosystem Misha

More information

Market Basket Analysis Algorithm on Map/Reduce in AWS EC2

Market Basket Analysis Algorithm on Map/Reduce in AWS EC2 Market Basket Analysis Algorithm on Map/Reduce in AWS EC2 Jongwook Woo Computer Information Systems Department California State University Los Angeles jwoo5@calstatela.edu Abstract As the web, social networking,

More information

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov An Industrial Perspective on the Hadoop Ecosystem Eldar Khalilov Pavel Valov agenda 03.12.2015 2 agenda Introduction 03.12.2015 2 agenda Introduction Research goals 03.12.2015 2 agenda Introduction Research

More information

Forecast of Big Data Trends. Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 3 September 2014

Forecast of Big Data Trends. Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 3 September 2014 Forecast of Big Data Trends Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 3 September 2014 Big Data transforms Business 2 Data created every minute Source http://mashable.com/2012/06/22/data-created-every-minute/

More information

International Journal of Advance Research in Computer Science and Management Studies

International Journal of Advance Research in Computer Science and Management Studies Volume 2, Issue 8, August 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online

More information

Spring,2015. Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE

Spring,2015. Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE Spring,2015 Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE Contents: Briefly About Big Data Management What is hive? Hive Architecture Working

More information

Cloud computing - Architecting in the cloud

Cloud computing - Architecting in the cloud Cloud computing - Architecting in the cloud anna.ruokonen@tut.fi 1 Outline Cloud computing What is? Levels of cloud computing: IaaS, PaaS, SaaS Moving to the cloud? Architecting in the cloud Best practices

More information

Applied research on data mining platform for weather forecast based on cloud storage

Applied research on data mining platform for weather forecast based on cloud storage Applied research on data mining platform for weather forecast based on cloud storage Haiyan Song¹, Leixiao Li 2* and Yuhong Fan 3* 1 Department of Software Engineering t, Inner Mongolia Electronic Information

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

Big Data Analytics OverOnline Transactional Data Set

Big Data Analytics OverOnline Transactional Data Set Big Data Analytics OverOnline Transactional Data Set Rohit Vaswani 1, Rahul Vaswani 2, Manish Shahani 3, Lifna Jos(Mentor) 4 1 B.E. Computer Engg. VES Institute of Technology, Mumbai -400074, Maharashtra,

More information

Ubuntu and Hadoop: the perfect match

Ubuntu and Hadoop: the perfect match WHITE PAPER Ubuntu and Hadoop: the perfect match February 2012 Copyright Canonical 2012 www.canonical.com Executive introduction In many fields of IT, there are always stand-out technologies. This is definitely

More information

Modernizing Your Data Warehouse for Hadoop

Modernizing Your Data Warehouse for Hadoop Modernizing Your Data Warehouse for Hadoop Big data. Small data. All data. Audie Wright, DW & Big Data Specialist Audie.Wright@Microsoft.com O 425-538-0044, C 303-324-2860 Unlock Insights on Any Data Taking

More information

Alternatives to HIVE SQL in Hadoop File Structure

Alternatives to HIVE SQL in Hadoop File Structure Alternatives to HIVE SQL in Hadoop File Structure Ms. Arpana Chaturvedi, Ms. Poonam Verma ABSTRACT Trends face ups and lows.in the present scenario the social networking sites have been in the vogue. The

More information

IBM BigInsights Has Potential If It Lives Up To Its Promise. InfoSphere BigInsights A Closer Look

IBM BigInsights Has Potential If It Lives Up To Its Promise. InfoSphere BigInsights A Closer Look IBM BigInsights Has Potential If It Lives Up To Its Promise By Prakash Sukumar, Principal Consultant at iolap, Inc. IBM released Hadoop-based InfoSphere BigInsights in May 2013. There are already Hadoop-based

More information

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here> s Big Data solutions Roger Wullschleger DBTA Workshop on Big Data, Cloud Data Management and NoSQL 10. October 2012, Stade de Suisse, Berne 1 The following is intended to outline

More information

Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework

Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework Vidya Dhondiba Jadhav, Harshada Jayant Nazirkar, Sneha Manik Idekar Dept. of Information Technology, JSPM s BSIOTR (W),

More information

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia Monitis Project Proposals for AUA September 2014, Yerevan, Armenia Distributed Log Collecting and Analysing Platform Project Specifications Category: Big Data and NoSQL Software Requirements: Apache Hadoop

More information