Multidimensional Arrays for Warehousing Data on Clouds

Size: px
Start display at page:

Download "Multidimensional Arrays for Warehousing Data on Clouds"

Transcription

1 Multidimensional Arrays for Warehousing Data on Clouds Laurent d Orazio 1 and Sandro Bimonte 2 1 Blaise Pascal University - LIMOS, France dorazio@isima.fr 2 Cemagref, France sandro.bimonte@cemagref.fr Abstract. Data warehouses and OLAP systems are business intelligence technologies. They allow decision-makers to analyze on the fly huge volumes of data represented according to the multidimensional model. Cloud computing on the impulse of ICT majors like Google, Microsoft and Amazon, has recently focused the attention. OLAP querying and data warehousing in such a context consists in a major issue. Indeed, problems to be tackled are basic ones for large scale distributed OLAP systems (large amount of data querying, semantic and structural heterogeneity) from a new point of view, considering specificities from these architectures (pay-as-you-go rule, elasticity, and user-friendliness). In this paper we address the pay-as-you-go rules for warehousing data storage. We propose to use the multidimensional arrays storage techniques for clouds. First experiments validate our proposal. 1 Introduction Data warehouses and OLAP systems are business intelligence technologies that aim at the analysis of huge volumes of data modeled according with the multidimensional model [13]. In typical architectures, OLAP systems are deployed using Relational DBMS systems to store and analyze data. This approach is suitable for sparse data warehouses. When data is dense, the MOLAP approach can be used [28]. It stores data using a multidimensional data structure such as multidimensional arrays in order to reduce the size of stored data. High Performance Computing architectures aim at insuring increasing needs in terms of computing or storage resources of both scientific and industrial applications [6]. Among these architectures, cloud computing on the impulse of companies like Google, Microsoft and Amazon focuses the interest in particular due to low costs and the fact that they consists in good out-of-the-box solutions, even if their performances are bellow current parallel DBMS [23]. Data warehouses and OLAP systems on cloud raise several problems related to storage and query computation performance. In particular, problems to consider include basic ones from large scale distributed systems (large amount of data querying, semantic and structural heterogeneities) from a new point of A. Hameurlain, F. Morvan, and A. Min Tjoa (Eds.): Globe 2010, LNCS 6265, pp , c Springer-Verlag Berlin Heidelberg 2010

2 Multidimensional Arrays for Warehousing Data on Clouds 27 view, regarding specific behaviors of these architectures: pay-as-you go model, elasticity and user friendliness [6]. Some works support complex queries like spatial and OLAP queries on clouds [17], [25], [27]. However, to the best of our knowledge, no work defines a particular data model to store multidimensional data on clouds trying to respect the pay-as-you go model. Therefore, in this paper we provide the first step towards the implementation of a multidimensional arrays-based architecture on clouds, in order to reduce storage data costs. In particular, we present an algorithm that transforms data stored using multidimensional arrays into Pig data [17]. This allows us to perform OLAP queries using the MapReduce paradigm [9] and save storage costs. The second contribution of this paper is the description of open research issues of using cloud databases for OLAP analysis. This paper is organized as follows. Section 2 presents the context of our work. Section 3 introduces our proposition of multidimensional arrays storage on clouds. Section 4 validates our approach. Section 5 lists research opportunities. Finally, section 6 concludes this paper. 2 Context and Research Motivation This section briefly presents a case study that will be used as an illustration in subsection 2.1, data warehouses and OLAP in subsection 2.2, data management in clouds in subsection 2.3, then introduces our research motivation in subsection Case Study In order to present our work, we introduce a simulated case study that concerns the OLAP analysis of sales for stores of a supply chain, which are located in each French department. It presents two dimensions, a spatial dimension that groups departments into regions, and the temporal dimension (day < month < year), and the measure is the profit. An example of data is shown on table 1. Table 1. Case study data Year Month Day Country Region Department Profit France Auvergne Puy-de-Dôme France Auvergne Allier France Rhône-Alpes Isère Data Warehouse and OLAP Data warehouses model data according to the multidimensional model. Such a model defines the concepts of dimensions and measures. A dimension is composed of hierarchies and represents the analysis axis. A hierarchy organizes data into a hierarchical structure allowing decision-makers to analyze measures at different granularities. Indeed, measures are numerical indicators which describe the

3 28 L. d Orazio and S. Bimonte analysis subject. OLAP operators such as roll-up and drill-down allow decisionmakers to navigate into hierarchies aggregating data using SQL aggregation functions [13]. Some other operators have been defined to select a part of the data warehouse and permute dimensions [20]. MOLAP systems use multidimensional data structures such as multidimensional arrays constructed from the original data, which are typically stored in relational databases. MOLAP systems improve storage performance for dense data warehouse through its particular storage data model [27]. Indeed, using multidimensional arrays allows storing only measures values, as they are indexed using the position of dimensions members. For example, according the MOLAP representation of our study case illustrated by figure 1, the measure value at the position ARRAY[2] [1] is associated to the second member of the first dimension ( ) and the first member of the second dimension (Allier department). Dimensions Time Time Dim[0]= Time Dim[1]= Time Dim[34121]= Location Location Dim[0]=France,Auvergne,Puy-de-D^ome Location Dim[1]=France,Auvergne,Allier... Location Dim[99]=France,Rh^one-Alpes,Isère Measures Facts Profit Fact[0]=2000 Facts Profit Fact[1]= Fig. 1. MOLAP representation of data In order to store an unidimensional array, a simple formula has been provided: Let d dimensions, N k the members of the k th dimension, then the position of the measure value in the unidimensional array is: p(i 1,..., i d )= d j=1 (i j* d k=j+1 N k) where i j is the position of the member of the j th dimension. 2.3 Data Management on Cloud In order to achieve scalable and efficient databases management systems to deal with data volumes that cannot be managed by classical relational or object DBMS (for example Facebook manages more than three hundred millions users, more than two billions pictures uploaded and more than three millions events added per day), some new data management architectures on clouds have been developed. Data management in clouds usually follows a layered architecture as illustrated by figure 2. The first level is the infrastructure tier. Typically, such a tier consists in one or several data centers that are used in order for large data analysis processes to be done [4] [1]. The main behavior of this level is the associated pay-as-you-go model.

4 Multidimensional Arrays for Warehousing Data on Clouds 29 Fig. 2. Cloud data management architecture The second tier is the storage tier. Its main objective is to propose a highly scalable and fault-tolerant system. In clouds, data are stored in files managed by such systems [11] [2]. The third tier is the execution environment tier. The most known example of cloud computing execution environment is probably Google MapReduce [9] and its open source version Hadoop [3]. Such an execution tier aims at providing elasticity by enabling to adjust resources according to the application. In one way such a property avoids large invests in order for applications to sustain some peaks of use and as a consequence will lead to a global under use of the infrastructure. On the other way, it enables to ensure the good functioning of an application which popularity would have not been correctly foreseen, increasing resources if necessary. The last tier is the high querying language tier. Such a tier aims at proposing user-friendliness and transparence of the other tiers of the architecture, and the parallelism as possible. Some query languages have been proposed like Facebook Hive [25], Microsoft Scope [7], Google Sawzall [19], Map-Reduce-Merge [26], whicharebasedonparticulardatamodelssuchasthecolumnoriented[22]or extensions of the relational model [17] [25]. In particular, the Pig Latin language [17] has been designed to propose a trade-off between the declarative style of SQL, and the low-level, procedural style of MapReduce. The accompanying system, Pig, is fully implemented, and compiles Pig Latin into physical plans that are executed over a parallel execution environment. 2.4 Research Motivations According to the principle of pay-as-you-go, in clouds, users only pay for resources (CPU, storage, bandwidth consumption) they use. For example, with Microsoft Windows Azure [4] CPU costs 0.12 $ for one hour execution, storage costs 0.15 $ per month and per GB, bandwidth consumption costs 0.10 $ per GB in upload and 0.15 $ in download. Therefore, if on one hand all query languages for data in clouds support indirectly (since no ad-hoc operator [12] has been introduced) OLAP queries, on the other hand no one address multidimensional data storage. Then, our

5 30 L. d Orazio and S. Bimonte idea is to provide a particular organization of multidimensional data on cloud in order to reduce the storage and computation costs for OLAP queries, and at the same time take advantage of cloud data management systems characteristics: scalability and performance. 3 Multidimensional Arrays in Clouds In this section, we introduce an overview of the querying process on multidimensional arrays in clouds (subsection 3.1). Then, we present the storage and data processing in more details (subsection 3.2) and our optimization of Pig OLAP queries (subsection 3.3). 3.1 Overview of Querying Multidimensional Arrays in Clouds The querying process is composed of two steps as illustrated by figure 3: 1. Data are structured as arrays. This enables to reduce the size of the stored files, and as a consequence the price to be paid by clients. When a query (or a set of queries) is posed, arrays are translated in Pig data using a temporary file, using the algorithm presented in subsection 3.2. Such a file will removed after the analysis. 2. OLAP queries are formulated and optimized in an efficient execution plan of Pig Latin instructions. It has to be noted that this queries can be executed in a a parallel fashion, using the MapReduce paradigm, enabling elasticity. Fig. 3. Overview of the querying process 3.2 OLAP Queries Using PIG Latin and Multidimensional Arrays This section presents how multidimensional arrays can be used in Pig storage capabilities that we propose. The Pig data model is an extension of the relation model, with the following atomic concepts: Bag (set of values), Map (hash functions), nested tables and UDF (User Defined Functions).

6 Multidimensional Arrays for Warehousing Data on Clouds 31 Data are stored in logical multidimensional arrays, physically stored as a unidimensional array using the formula previously presented in subsection 2.2. Figure 4(a) illustrates multidimensional arrays for data our study case. For example, the first fact, first line of the measure part, is the measures values associated to members (first line of the time) and France,Auvergne,Puy-de-Dôme (first line of the location part). When queries are posed, data are converted in Pig data in a temporary file. Each line represents a tuple, values for a tuple being separated by semicolons. The conversion of the considering data by our study case are shown on figure 4(b) France,Auvergne,Puy-de-D^ome France,Auvergne,Allier... France,Rh^one-Alpes,Isère 2000;500;400;... (a) Multidimensional arrays 2000;01;01;France;Auvergne;Puy-de-D^ome; ;01;01;France;Auvergne;Allier; ;04;20;France;Rh^one-Alpes;Isère;2500 (b) Pig data Fig. 4. Data representation Conversion from multidimensional arrays to Pig data is done via the algorithm 1. Inputs of such an algorithm are the files which store arrays of dimensions and measures (figure 4(a)). The output consists in the file which represents warehouse data using the Pig data model (figure 4(b)). The idea of the algorithm is to build the Cartesian product using the n 1 dimensions. Then, these data are join in a Cartesian product with the n th dimension, and measure values of the measure array are added to generated tuples by this way: i th tuple with the i th value of the measure array. When the analysis is complete, the temporary file is removed in order to save storage costs. Algorithm 1. Algorithm for conversion of multidimensional arrays data to Pig data Require: Tables files Ensure: Pig file int i 1; int n; file cartprodfile; {initialized by the cartesian product of the two last dimensions} file pigfile; file mafile; array dimensions; {set of dimensions} while i<=n-1 do cartprodfile cartproduct(dimensions[i],cartprodfile); i i+1; end while i 0; while mafile not end do pigfile.insert (cartproduct(dimension(n),cartprodfile )+, +mafile(i)) ; i i+1; end while return pigfile;

7 32 L. d Orazio and S. Bimonte 3.3 Optimization of Pig Latin Instruction for OLAP Queries A classical OLAP query on these data is: what is the total profit per region during 1990 and 1991?. Such a formulated query can easily be expressed in our study case using Pig statements as illustrated by figure 5. Such an OLAP query in Pig is defined using three instructions. One selects the dimension members (1990 and 1991 in our example). One enables to group data (grouping by year and region), whereas the other is used for the aggregation process (a sum in this particular use case). s9091 = FILTER sales BY years = 1990 or years = 1991; groups = GROUP s9091 BY year, region; results = FOREACH groups GENERATE s9091.region, s9091.month, SUM(profit); Fig. 5. OLAP query with Pig: FILTER and (GROUP, Aggregation) Note that, unlike SQL where the DBMS chooses a plan through optimize hints, Pig Latin queries consist in a set of instructions where the order is let to the user and that the Pig systems only provide a logical optimizer, enabling for example logical optimizations such as projection pushdown [10]. That is why we propose a simple and yet efficient optimization of OLAP queries by rewriting Pig statements. In fact, we have seen previously that Pig queries consist in a set of instructions. Then, OLAP queries can be formulated in two ways: (i) FILTER and (GROUP and Aggregation) (see figure 5), (ii) (GROUP and Aggregation) and FILTER (see figure 6). Of course, such a sequence greatly influences response times. In such an example, if the data source is quite large, aggregating all data and then selecting the dimensions members may be costly. Therefore, one intuitive optimization of OLAP queries on Pig is to use the query pattern FILTER, GROUP, Aggregation. groups = GROUP sales BY years; avgprof = FOREACH groups GENERATE region, SUM(profits); results = FILTER avgprof BY years = 1990 or years = 1991; Fig. 6. OLAP query with Pig: (GROUP, Aggregation) and FILTER 4 Validation Our proposal has been validated with simulated data. All experiments were conducted on a 2.2GHZ Intel Core 2 duo with 4GB RAM. The main objective of these experiments was to illustrate that our proposals enable on the one hand a great reduction of storage cost with a negligible overhead and on the other hand a performance improvement. Subsection 4.1 focuses on storage, subsection 4.2 on the data conversion process, whereas subsection 4.3 studies the impact of our proposed optimization on response time.

8 4.1 Storage Consumption Multidimensional Arrays for Warehousing Data on Clouds 33 Figure 7 presents the storage consumption in GB according to the used data model, that is to say Pig or multidimensional arrays. Results clearly show that multidimensional arrays-based storage lead to a dramatic reduction (about 90%) in the amount of storage used by data sources. We can then conclude that our system is cheaper in term of storage. For example, with Amazon EC2 pricing (0.15$ per GB per month), and one TB data source based on a relational model would approximately cost 1850$ per year, whereas with our approach the cost would be around 230$. Fig. 7. Storage consumption 4.2 Data Conversion Figure 8 presents the mean response time of multidimensional arrays to Pig data conversion process according to the size of the source, given in number of tuples. Such an experiment highlights the additional cost, in particular in CPU, induced by our proposal. Results show that such a process takes less than one minute to execute for a data source containing up to ten millions of tuples. Fig. 8. Data conversion

9 34 L. d Orazio and S. Bimonte As a consequence the additional cost can be considered negligible. In fact, with Amazon EC2 pricing (0.12$ per hour for a standard instance), such an additional process would approximately cost 0,001$ on a pro rata temporis basis for a data source containing ten millions of tuples. 4.3 Query Optimization Finally, table 2 illustrates the impact of the optimization of Pig statements for OLAP queries. It presents the response time for the evaluation of a naive query and the corresponding optimized query, on a data source containing about half a million tuples. Results clearly show that the optimization accelerates the evaluation process (in that case a 30% reduction in the considered experiment). Table 2. Impact of optimization on the response time Mean response time (secondes) Optimized query FILTER, GROUP and Aggregate 130 Naive query GROUP, Aggregate, and FILTER Research Opportunities This section establishes a list of research opportunities that we consider as particularly important to consider in order to supply OLAP queries and data warehouses in clouds. We decompose these opportunities according two categories: performance optimization (subsection 5.1), then modeling and querying (subsection 5.2). 5.1 Performance Optimization In order to improve performance of OLAP queries, the following aspects should be considered: 1. Definition of OLAP PIG Latin query optimizer. Indeed,aswehavepreviously described, PIG does not use any query plan optimizer. Then, we think that very important query improvements are possible by adapting classical database management system optimizers to Pig data model and Pig Latin. 2. Data warehouse indexes implementation using MapReduce paradigm. Indexes such as bitmap, etc. are used with data warehouses in order to optimize computation time expensive queries as join [15] or aggregation [24]. As a consequence, using parallel indexes implemented in accord to the Map Reduce paradigm, we will allow us to exploit the important computation capacities offered by the cloud infrastructures. Thus, we consider mandatory the extension and/or the adaptation of these indexes to Cloud databases. 3. Definition of materialized view algorithms based on the pay-as-you go model. Materialized views are a fundamental technique for OLAP query optimization in ROLAP architectures. Several works have proposed an intelligent selection of materialized views that should be calculated [5]. These approaches

10 Multidimensional Arrays for Warehousing Data on Clouds 35 do not take into account the pay-as-you go model principles. We therefore believe to defining some materialization and dematerialization techniques that fit the changes of the user query patterns, and storage costs of the cloud computing providers. 4. Integration of caches to improve the quality of service and reduce costs. Caching is crucial to improve performance in many computing systems, and particularly in business intelligent systems [21]. Our objective is to supply sophisticated caching techniques, and more precisely semantic caching [14], [8] to enhance the quality of service (reduction of the response time, increasing of the availability) and to reduce costs. Such mechanisms would be used for copying frequently posed queries and as a consequence saves CPU and bandwidth consumption. 5.2 Modeling and Querying Considering modeling and querying of data warehouses in clouds, the aspects to be tackled should be: 1. Integration of OLAP SQL operators as Pig Latin native operators. Indeed, one of the most important characteristic of cloud infrastructures is the userfriendly. So, since, as shown on this paper, the definition of the OLAP queries with Pig Latin is not direct, we think to the integration of the principles of the Cube operator [12] in Pig Latin. This will allow us introduce some typical OLAP server functionalities directly in the cloud data manager system [10], facilitating the use and the definition of OLAP systems in the clouds. 2. Implementation of advanced modeling properties of multidimensional models using Pig data model. Multidimensional applications can present some advanced modeling properties, as many to many relations between facts and dimensions, complex measures, etc. [18] which are difficult to implement in relational DBMS [16]. Exploring the power of the Pig data model (as regards the concepts of bag, map, and nested query) for multidimensional modeling rests an important open research issue. 6 Conclusion This paper presents a starting work aiming at providing a multidimensional arrays-based architecture to be deployedonclouds.wehaveproposedtouse multidimensional arrays to store data in order to optimize storage costs. Then, we have presented an algorithm to convert these structures into Pig data. By this way, OLAP queries can be easily performed using the Pig Latin query language. Then, we have presented a simple and intuitive OLAP query optimization by ordering Pig Latin sentences. Experiments have shown the relevance of such a solution with simulated data. Results clearly show that our solution proposal saves storage consumption and as a consequence enables users of clouds to reduce their costs. Finally, we have listed research opportunities to consider in order to

11 36 L. d Orazio and S. Bimonte efficiently integrate OLAP queries and data warehouses in cloud. Currently we are working on the introduction of the MapReduce paradigm in the algorithm for conversion of multidimensional arrays data to Pig data, the introduction of the Cube operator in Pig, as well as further experiments on real cloud infrastructures. Acknowledgment Thanks to Boussad Mebarki, Ilyas Brahmia, Abdelaziz Merabet, in addition to the APIS team of the LIMOS laboratory and the COPAIN team from the Cemagref for useful discussions on datawarehouses and cloud computing. References 1. Amazon ec2, 2. Amazon s3, 3. Hadoop, 4. Microsoft azure, 5. Aouiche, K., Darmont, J.: Data mining-based materialized view and index selection in data warehouses. Journal of Intelligent Information Systems 33(1), (2009) 6. Armbrust, M., Fox, A., Griffith, R., Katz, A.D.J.R.H., Konwinski, A., Lee, G., Patterson, D.A., Rabkin, A., Stoica, I., Zaharia, M.: Above the clouds: A berkeley view of cloud computing. Technical Report UCB/EECS , Berkeley (2009) 7. Chaiken, R., Jenkins, B., Larson, P.-Å., Ramsey, B., Shakib, D., Weaver, S., Zhou, J.: Scope: easy and efficient parallel processing of massive data sets. PVLDB 1(2), (2008) 8. Dar, S., Franklin, M.J., Jonsson, B.T., Srivastava, D., Tan, M.: Semantic data caching and replacement. In: VLDB, Bombay, India, pp (1996) 9. Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Communications of the ACM 51(1), (2008) 10. Gates, A., Natkovich, O., Chopra, S., Kamath, P., Narayanam, S., Olston, C., Reed, B., Srinivasan, S., Srivastava, U.: Building a highlevel dataflow system on top of mapreduce: The pig experience. PVLDB 2(2), (2009) 11. Ghemawat, S., Gobioff, H., Leung, S.-T.: The google file system. In: SOSP, Bolton Landing, USA, pp (2003) 12. Gray, J., Bosworth, A., Layman, A., Pirahesh, H.: Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-total. In: ICDE, New Orleans, USA, pp (1996) 13. Inmon, W.: Building the Data Warehouse. Wiley, New York (1996) 14. Keller, A.M., Basu, J.: A predicate-based caching scheme for client-server database architectures. VLDB Journal 5(1), (1996) 15. Kimball, R.: The data warehouse toolkit: practical techniques for building dimensional data warehouses. John Wiley & Sons, Inc., Chichester (1996) 16. Malinowski, E., Zimnyi, E.: Advanced Data Warehouse Design: From Conventional to Spatial and Temporal Applications (Data-Centric Systems and Applications. Springer Publishing Company, Incorporated, Heidelberg (2008) 17. Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig latin: a not-soforeign language for data processing. In: SIGMOD, pp (2008)

12 Multidimensional Arrays for Warehousing Data on Clouds Pedersen, T.B., Jensen, C.S., Dyreson, C.E.: A foundation for capturing and querying complex multidimensional data. Information Systems 26(5), (2001) 19. Pike, R., Dorward, S., Griesemer, R., Quinlan, S.: Interpreting the data: Parallel analysis with sawzall. Scientific Programming 13(4), (2005) 20. Rafanelli, M.: Operators for multidimensional aggregate data. In: Multidimensional Databases: problems and solutions, pp (2003) 21. Savary, L., Gardarin, G., Zeitouni, K.: Geocache: A cache for gml geographical data. IJDWM 3(1), (2007) 22. Stonebraker, M., Abadi, D.J., Batkin, A., Chen, X., Cherniack, M., Ferreira, M., Lau, E., Lin, A., Madden, S., O Neil, E.J., O Neil, P.E., Rasin, A., Tran, N., Zdonik, S.B.: C-store: A column-oriented dbms. In: VLDB, pp (2008) 23. Stonebraker, M., Abadi, D.J., DeWitt, D.J., Madden, S., Paulson, E., Pavlo, A., Rasin, A.: Mapreduce and parallel dbmss: friends or foes? Communications of the ACM 53(1), (2010) 24. Tao, Y., Papadias, D.: Historical spatio-temporal aggregation. ACM Transaction Information Systems 23(1), (2005) 25. Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: Hive - a warehousing solution over a map-reduce framework. PVLDB 2(2), (2009) 26. H.-c. Yang, A., Dasdan, R.-L., Hsiao, R.-L., Parker, D.S.: Map-reduce-merge: simplified relational data processing on large clusters. In: SIGMOD, Beijing, China, pp (2007) 27. Zhang, S., Han, J., Liu, Z., Wang, K., Feng, S.: Spatial queries evaluation with mapreduce. In: GCC, pp (2009) 28. Zhao, Y., Deshpande, P., Naughton, J.F.: An array-based algorithm for simultaneous multidimensional aggregates. In: Peckham, J. (ed.) SIGMOD, Tucson, USA, pp (1997)

How To Analyze Log Files In A Web Application On A Hadoop Mapreduce System

How To Analyze Log Files In A Web Application On A Hadoop Mapreduce System Analyzing Web Application Log Files to Find Hit Count Through the Utilization of Hadoop MapReduce in Cloud Computing Environment Sayalee Narkhede Department of Information Technology Maharashtra Institute

More information

Integrating Hadoop and Parallel DBMS

Integrating Hadoop and Parallel DBMS Integrating Hadoop and Parallel DBMS Yu Xu Pekka Kostamaa Like Gao Teradata San Diego, CA, USA and El Segundo, CA, USA {yu.xu,pekka.kostamaa,like.gao}@teradata.com ABSTRACT Teradata s parallel DBMS has

More information

Building OLAP cubes on a Cloud Computing environment with MapReduce

Building OLAP cubes on a Cloud Computing environment with MapReduce Building OLAP cubes on a Cloud Computing environment with MapReduce Billel ARRES Universite Lumire Lyon 2 5 avenue Pierre Mands-France 69676 Bron, France Billel.Arres@univ-lyon2.fr Nadia KABBACHI Universite

More information

JackHare: a framework for SQL to NoSQL translation using MapReduce

JackHare: a framework for SQL to NoSQL translation using MapReduce DOI 10.1007/s10515-013-0135-x JackHare: a framework for SQL to NoSQL translation using MapReduce Wu-Chun Chung Hung-Pin Lin Shih-Chang Chen Mon-Fong Jiang Yeh-Ching Chung Received: 15 December 2012 / Accepted:

More information

Toward Lightweight Transparent Data Middleware in Support of Document Stores

Toward Lightweight Transparent Data Middleware in Support of Document Stores Toward Lightweight Transparent Data Middleware in Support of Document Stores Kun Ma, Ajith Abraham Shandong Provincial Key Laboratory of Network Based Intelligent Computing University of Jinan, Jinan,

More information

Data Migration from Grid to Cloud Computing

Data Migration from Grid to Cloud Computing Appl. Math. Inf. Sci. 7, No. 1, 399-406 (2013) 399 Applied Mathematics & Information Sciences An International Journal Data Migration from Grid to Cloud Computing Wei Chen 1, Kuo-Cheng Yin 1, Don-Lin Yang

More information

Alejandro Vaisman Esteban Zimanyi. Data. Warehouse. Systems. Design and Implementation. ^ Springer

Alejandro Vaisman Esteban Zimanyi. Data. Warehouse. Systems. Design and Implementation. ^ Springer Alejandro Vaisman Esteban Zimanyi Data Warehouse Systems Design and Implementation ^ Springer Contents Part I Fundamental Concepts 1 Introduction 3 1.1 A Historical Overview of Data Warehousing 4 1.2 Spatial

More information

Data and Algorithms of the Web: MapReduce

Data and Algorithms of the Web: MapReduce Data and Algorithms of the Web: MapReduce Mauro Sozio May 13, 2014 Mauro Sozio (Telecom Paristech) Data and Algorithms of the Web: MapReduce May 13, 2014 1 / 39 Outline 1 MapReduce Introduction MapReduce

More information

Flying Yellow Elephant: Predictable and Efficient MapReduce in the Cloud

Flying Yellow Elephant: Predictable and Efficient MapReduce in the Cloud Flying Yellow Elephant: Predictable and Efficient MapReduce in the Cloud Jörg Schad Supervised by: Prof. Dr. Jens Dittrich Information Systems Group, Saarland University http://infosys.cs.uni-saarland.de

More information

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)

More information

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13 Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2012/13 Hadoop Ecosystem Overview of this Lecture Module Background Google MapReduce The Hadoop Ecosystem Core components: Hadoop

More information

MapReduce With Columnar Storage

MapReduce With Columnar Storage SEMINAR: COLUMNAR DATABASES 1 MapReduce With Columnar Storage Peitsa Lähteenmäki Abstract The MapReduce programming paradigm has achieved more popularity over the last few years as an option to distributed

More information

A Study on Big Data Integration with Data Warehouse

A Study on Big Data Integration with Data Warehouse A Study on Big Data Integration with Data Warehouse T.K.Das 1 and Arati Mohapatro 2 1 (School of Information Technology & Engineering, VIT University, Vellore,India) 2 (Department of Computer Science,

More information

High performance computing network for cloud environment using simulators

High performance computing network for cloud environment using simulators High performance computing network for cloud environment using simulators Ajith Singh. N 1 and M. Hemalatha 2 1 Ph.D, Research Scholar (CS), Karpagam University, Coimbatore, India 2 Prof & Head, Department

More information

A Design and implementation of a data warehouse for research administration universities

A Design and implementation of a data warehouse for research administration universities A Design and implementation of a data warehouse for research administration universities André Flory 1, Pierre Soupirot 2, and Anne Tchounikine 3 1 CRI : Centre de Ressources Informatiques INSA de Lyon

More information

Lecture Data Warehouse Systems

Lecture Data Warehouse Systems Lecture Data Warehouse Systems Eva Zangerle SS 2013 PART C: Novel Approaches Column-Stores Horizontal/Vertical Partitioning Horizontal Partitions Master Table Vertical Partitions Primary Key 3 Motivation

More information

Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture

Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture He Huang, Shanshan Li, Xiaodong Yi, Feng Zhang, Xiangke Liao and Pan Dong School of Computer Science National

More information

A Dynamic Load Balancing Strategy for Parallel Datacube Computation

A Dynamic Load Balancing Strategy for Parallel Datacube Computation A Dynamic Load Balancing Strategy for Parallel Datacube Computation Seigo Muto Institute of Industrial Science, University of Tokyo 7-22-1 Roppongi, Minato-ku, Tokyo, 106-8558 Japan +81-3-3402-6231 ext.

More information

Data Management Course Syllabus

Data Management Course Syllabus Data Management Course Syllabus Data Management: This course is designed to give students a broad understanding of modern storage systems, data management techniques, and how these systems are used to

More information

Review on the Cloud Computing Programming Model

Review on the Cloud Computing Programming Model , pp.11-16 http://dx.doi.org/10.14257/ijast.2014.70.02 Review on the Cloud Computing Programming Model Chao Shen and Weiqin Tong School of Computer Engineering and Science Shanghai University, Shanghai

More information

DATA WAREHOUSING AND OLAP TECHNOLOGY

DATA WAREHOUSING AND OLAP TECHNOLOGY DATA WAREHOUSING AND OLAP TECHNOLOGY Manya Sethi MCA Final Year Amity University, Uttar Pradesh Under Guidance of Ms. Shruti Nagpal Abstract DATA WAREHOUSING and Online Analytical Processing (OLAP) are

More information

A Hybrid Scheduling Approach for Scalable Heterogeneous Hadoop Systems

A Hybrid Scheduling Approach for Scalable Heterogeneous Hadoop Systems A Hybrid Scheduling Approach for Scalable Heterogeneous Hadoop Systems Aysan Rasooli Department of Computing and Software McMaster University Hamilton, Canada Email: rasooa@mcmaster.ca Douglas G. Down

More information

Cooperative Database Caching within Cloud Environments

Cooperative Database Caching within Cloud Environments Cooperative Database Caching within Cloud Environments Andrei Vancea 1, Guilherme Sperb Machado 1, Laurent d Orazio 2, and Burkhard Stiller 1 1 Department of Informatics (IFI), University of Zürich, Zürich,

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.

More information

HANDLING IMPRECISION IN QUALITATIVE DATA WAREHOUSE: URBAN BUILDING SITES ANNOYANCE ANALYSIS USE CASE

HANDLING IMPRECISION IN QUALITATIVE DATA WAREHOUSE: URBAN BUILDING SITES ANNOYANCE ANALYSIS USE CASE International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XL-2/W1, 213 8th International Symposium on Spatial Data Quality, 3 May - 1 June 213, Hong Kong HANDLING

More information

How To Handle Big Data With A Data Scientist

How To Handle Big Data With A Data Scientist III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution

More information

OLAP and Data Mining. Data Warehousing and End-User Access Tools. Introducing OLAP. Introducing OLAP

OLAP and Data Mining. Data Warehousing and End-User Access Tools. Introducing OLAP. Introducing OLAP Data Warehousing and End-User Access Tools OLAP and Data Mining Accompanying growth in data warehouses is increasing demands for more powerful access tools providing advanced analytical capabilities. Key

More information

Review of Query Processing Techniques of Cloud Databases Ruchi Nanda Assistant Professor, IIS University Jaipur.

Review of Query Processing Techniques of Cloud Databases Ruchi Nanda Assistant Professor, IIS University Jaipur. Suresh Gyan Vihar University Journal of Engineering & Technology (An International Bi Annual Journal) Vol. 1, Issue 2, 2015,pp.12-16 ISSN: 2395 0196 Review of Query Processing Techniques of Cloud Databases

More information

A Comparative Study on Operational Database, Data Warehouse and Hadoop File System T.Jalaja 1, M.Shailaja 2

A Comparative Study on Operational Database, Data Warehouse and Hadoop File System T.Jalaja 1, M.Shailaja 2 RESEARCH ARTICLE A Comparative Study on Operational base, Warehouse Hadoop File System T.Jalaja 1, M.Shailaja 2 1,2 (Department of Computer Science, Osmania University/Vasavi College of Engineering, Hyderabad,

More information

Second Credit Seminar Presentation on Big Data Analytics Platforms: A Survey

Second Credit Seminar Presentation on Big Data Analytics Platforms: A Survey Second Credit Seminar Presentation on Big Data Analytics Platforms: A Survey By, Mr. Brijesh B. Mehta Admission No.: D14CO002 Supervised By, Dr. Udai Pratap Rao Computer Engineering Department S. V. National

More information

DWEB: A Data Warehouse Engineering Benchmark

DWEB: A Data Warehouse Engineering Benchmark DWEB: A Data Warehouse Engineering Benchmark Jérôme Darmont, Fadila Bentayeb, and Omar Boussaïd ERIC, University of Lyon 2, 5 av. Pierre Mendès-France, 69676 Bron Cedex, France {jdarmont, boussaid, bentayeb}@eric.univ-lyon2.fr

More information

CubeView: A System for Traffic Data Visualization

CubeView: A System for Traffic Data Visualization CUBEVIEW: A SYSTEM FOR TRAFFIC DATA VISUALIZATION 1 CubeView: A System for Traffic Data Visualization S. Shekhar, C.T. Lu, R. Liu, C. Zhou Computer Science Department, University of Minnesota 200 Union

More information

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5 Distributed data processing in heterogeneous cloud environments R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5 1 uskenbaevar@gmail.com, 2 abu.kuandykov@gmail.com,

More information

International Journal of Advanced Research in Computer Science and Software Engineering

International Journal of Advanced Research in Computer Science and Software Engineering Volume, Issue, March 201 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com An Efficient Approach

More information

INTEROPERABILITY IN DATA WAREHOUSES

INTEROPERABILITY IN DATA WAREHOUSES INTEROPERABILITY IN DATA WAREHOUSES Riccardo Torlone Roma Tre University http://torlone.dia.uniroma3.it/ SYNONYMS Data warehouse integration DEFINITION The term refers to the ability of combining the content

More information

Enhancing Massive Data Analytics with the Hadoop Ecosystem

Enhancing Massive Data Analytics with the Hadoop Ecosystem www.ijecs.in International Journal Of Engineering And Computer Science ISSN:2319-7242 Volume 3, Issue 11 November, 2014 Page No. 9061-9065 Enhancing Massive Data Analytics with the Hadoop Ecosystem Misha

More information

Big Data. Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich

Big Data. Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich Big Data Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich MapReduce & Hadoop The new world of Big Data (programming model) Overview of this Lecture Module Background Google MapReduce The Hadoop

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

11/18/15 CS 6030. q Hadoop was not designed to migrate data from traditional relational databases to its HDFS. q This is where Hive comes in.

11/18/15 CS 6030. q Hadoop was not designed to migrate data from traditional relational databases to its HDFS. q This is where Hive comes in. by shatha muhi CS 6030 1 q Big Data: collections of large datasets (huge volume, high velocity, and variety of data). q Apache Hadoop framework emerged to solve big data management and processing challenges.

More information

Chameleon: The Performance Tuning Tool for MapReduce Query Processing Systems

Chameleon: The Performance Tuning Tool for MapReduce Query Processing Systems paper:38 Chameleon: The Performance Tuning Tool for MapReduce Query Processing Systems Edson Ramiro Lucsa Filho 1, Ivan Luiz Picoli 2, Eduardo Cunha de Almeida 2, Yves Le Traon 1 1 University of Luxembourg

More information

FlexPRICE: Flexible Provisioning of Resources in a Cloud Environment

FlexPRICE: Flexible Provisioning of Resources in a Cloud Environment FlexPRICE: Flexible Provisioning of Resources in a Cloud Environment Thomas A. Henzinger Anmol V. Singh Vasu Singh Thomas Wies Damien Zufferey IST Austria A-3400 Klosterneuburg, Austria {tah,anmol.tomar,vasu.singh,thomas.wies,damien.zufferey}@ist.ac.at

More information

A Distributed Tree Data Structure For Real-Time OLAP On Cloud Architectures

A Distributed Tree Data Structure For Real-Time OLAP On Cloud Architectures A Distributed Tree Data Structure For Real-Time OLAP On Cloud Architectures F. Dehne 1,Q.Kong 2, A. Rau-Chaplin 2, H. Zaboli 1, R. Zhou 1 1 School of Computer Science, Carleton University, Ottawa, Canada

More information

DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVIRONMENT

DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVIRONMENT DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVIRONMENT Gita Shah 1, Annappa 2 and K. C. Shet 3 1,2,3 Department of Computer Science & Engineering, National Institute of Technology,

More information

A Comparison of Approaches to Large-Scale Data Analysis

A Comparison of Approaches to Large-Scale Data Analysis A Comparison of Approaches to Large-Scale Data Analysis Sam Madden MIT CSAIL with Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, and Michael Stonebraker In SIGMOD 2009 MapReduce

More information

SARAH Statistical Analysis for Resource Allocation in Hadoop

SARAH Statistical Analysis for Resource Allocation in Hadoop SARAH Statistical Analysis for Resource Allocation in Hadoop Bruce Martin Cloudera, Inc. Palo Alto, California, USA bruce@cloudera.com Abstract Improving the performance of big data applications requires

More information

NetFlow Analysis with MapReduce

NetFlow Analysis with MapReduce NetFlow Analysis with MapReduce Wonchul Kang, Yeonhee Lee, Youngseok Lee Chungnam National University {teshi85, yhlee06, lee}@cnu.ac.kr 2010.04.24(Sat) based on "An Internet Traffic Analysis Method with

More information

A Brief Tutorial on Database Queries, Data Mining, and OLAP

A Brief Tutorial on Database Queries, Data Mining, and OLAP A Brief Tutorial on Database Queries, Data Mining, and OLAP Lutz Hamel Department of Computer Science and Statistics University of Rhode Island Tyler Hall Kingston, RI 02881 Tel: (401) 480-9499 Fax: (401)

More information

How To Understand Cloud Computing

How To Understand Cloud Computing Cloud Computing: a Perspective Study Lizhe WANG, Gregor von LASZEWSKI, Younge ANDREW, Xi HE Service Oriented Cyberinfrastruture Lab, Rochester Inst. of Tech. Abstract The Cloud computing emerges as a new

More information

How To Write A Paper On Bloom Join On A Distributed Database

How To Write A Paper On Bloom Join On A Distributed Database Research Paper BLOOM JOIN FINE-TUNES DISTRIBUTED QUERY IN HADOOP ENVIRONMENT Dr. Sunita M. Mahajan 1 and Ms. Vaishali P. Jadhav 2 Address for Correspondence 1 Principal, Mumbai Education Trust, Bandra,

More information

MapReduce: A Flexible Data Processing Tool

MapReduce: A Flexible Data Processing Tool DOI:10.1145/1629175.1629198 MapReduce advantages over parallel databases include storage-system independence and fine-grain fault tolerance for large jobs. BY JEFFREY DEAN AND SANJAY GHEMAWAT MapReduce:

More information

Research Article An Extended Form of MATLAB To-map Reduce Frameworks in HADOOP Based Cloud Computing Environments

Research Article An Extended Form of MATLAB To-map Reduce Frameworks in HADOOP Based Cloud Computing Environments Research Journal of Applied Sciences, Engineering and Technology 12(9): 900-906, 2016 DOI:1019026/rjaset122807 ISSN: 2040-7459; e-issn: 2040-7467 2016 Maxwell Scientific Publication Corp Submitted: September

More information

Using the column oriented NoSQL model for implementing big data warehouses

Using the column oriented NoSQL model for implementing big data warehouses Int'l Conf. Par. and Dist. Proc. Tech. and Appl. PDPTA'15 469 Using the column oriented NoSQL model for implementing big data warehouses Khaled. Dehdouh 1, Fadila. Bentayeb 1, Omar. Boussaid 1, and Nadia

More information

CUBE INDEXING IMPLEMENTATION USING INTEGRATION OF SIDERA AND BERKELEY DB

CUBE INDEXING IMPLEMENTATION USING INTEGRATION OF SIDERA AND BERKELEY DB CUBE INDEXING IMPLEMENTATION USING INTEGRATION OF SIDERA AND BERKELEY DB Badal K. Kothari 1, Prof. Ashok R. Patel 2 1 Research Scholar, Mewar University, Chittorgadh, Rajasthan, India 2 Department of Computer

More information

Optimal Service Pricing for a Cloud Cache

Optimal Service Pricing for a Cloud Cache Optimal Service Pricing for a Cloud Cache K.SRAVANTHI Department of Computer Science & Engineering (M.Tech.) Sindura College of Engineering and Technology Ramagundam,Telangana G.LAKSHMI Asst. Professor,

More information

Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications

Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications Ahmed Abdulhakim Al-Absi, Dae-Ki Kang and Myong-Jong Kim Abstract In Hadoop MapReduce distributed file system, as the input

More information

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Scalable Cloud Computing Solutions for Next Generation Sequencing Data Scalable Cloud Computing Solutions for Next Generation Sequencing Data Matti Niemenmaa 1, Aleksi Kallio 2, André Schumacher 1, Petri Klemelä 2, Eija Korpelainen 2, and Keijo Heljanko 1 1 Department of

More information

Review. Data Warehousing. Today. Star schema. Star join indexes. Dimension hierarchies

Review. Data Warehousing. Today. Star schema. Star join indexes. Dimension hierarchies Review Data Warehousing CPS 216 Advanced Database Systems Data warehousing: integrating data for OLAP OLAP versus OLTP Warehousing versus mediation Warehouse maintenance Warehouse data as materialized

More information

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 15 - Data Warehousing: Cubes

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 15 - Data Warehousing: Cubes CSE 544 Principles of Database Management Systems Magdalena Balazinska Winter 2009 Lecture 15 - Data Warehousing: Cubes Final Exam Overview Open books and open notes No laptops and no other mobile devices

More information

MapReduce for Data Warehouses

MapReduce for Data Warehouses MapReduce for Data Warehouses Data Warehouses: Hadoop and Relational Databases In an enterprise setting, a data warehouse serves as a vast repository of data, holding everything from sales transactions

More information

Web Log Data Sparsity Analysis and Performance Evaluation for OLAP

Web Log Data Sparsity Analysis and Performance Evaluation for OLAP Web Log Data Sparsity Analysis and Performance Evaluation for OLAP Ji-Hyun Kim, Hwan-Seung Yong Department of Computer Science and Engineering Ewha Womans University 11-1 Daehyun-dong, Seodaemun-gu, Seoul,

More information

BUSINESS INTELLIGENCE AND NOSQL DATABASES

BUSINESS INTELLIGENCE AND NOSQL DATABASES INFORMATION SYSTEMS IN MANAGEMENT Information Systems in Management (2012) Vol. 1 (1) 25 37 BUSINESS INTELLIGENCE AND NOSQL DATABASES JERZY DUDA Department of Applied Computer Science, Faculty of Management,

More information

The Hidden Extras. The Pricing Scheme of Cloud Computing. Stephane Rufer

The Hidden Extras. The Pricing Scheme of Cloud Computing. Stephane Rufer The Hidden Extras The Pricing Scheme of Cloud Computing Stephane Rufer Cloud Computing Hype Cycle Definition Types Architecture Deployment Pricing/Charging in IT Economics of Cloud Computing Pricing Schemes

More information

Introduction to Cloud Computing

Introduction to Cloud Computing Discovery 2015: Cloud Computing Workshop June 20-24, 2011 Berkeley, CA Introduction to Cloud Computing Keith R. Jackson Lawrence Berkeley National Lab What is it? NIST Definition Cloud computing is a model

More information

New Cloud Computing Network Architecture Directed At Multimedia

New Cloud Computing Network Architecture Directed At Multimedia 2012 2 nd International Conference on Information Communication and Management (ICICM 2012) IPCSIT vol. 55 (2012) (2012) IACSIT Press, Singapore DOI: 10.7763/IPCSIT.2012.V55.16 New Cloud Computing Network

More information

Guidelines for Selecting Hadoop Schedulers based on System Heterogeneity

Guidelines for Selecting Hadoop Schedulers based on System Heterogeneity Noname manuscript No. (will be inserted by the editor) Guidelines for Selecting Hadoop Schedulers based on System Heterogeneity Aysan Rasooli Douglas G. Down Received: date / Accepted: date Abstract Hadoop

More information

Adaptive Query Execution for Cloud Based Data Management

Adaptive Query Execution for Cloud Based Data Management Adaptive Query Execution for Data Management in the Cloud Adrian Daniel Popescu Debabrata Dash Verena Kantere Anastasia Ailamaki School of Computer and Communication Sciences École Polytechnique Fédérale

More information

DATA WAREHOUSING - OLAP

DATA WAREHOUSING - OLAP http://www.tutorialspoint.com/dwh/dwh_olap.htm DATA WAREHOUSING - OLAP Copyright tutorialspoint.com Online Analytical Processing Server OLAP is based on the multidimensional data model. It allows managers,

More information

Index Terms : Load rebalance, distributed file systems, clouds, movement cost, load imbalance, chunk.

Index Terms : Load rebalance, distributed file systems, clouds, movement cost, load imbalance, chunk. Load Rebalancing for Distributed File Systems in Clouds. Smita Salunkhe, S. S. Sannakki Department of Computer Science and Engineering KLS Gogte Institute of Technology, Belgaum, Karnataka, India Affiliated

More information

Horizontal Aggregations in SQL to Prepare Data Sets for Data Mining Analysis

Horizontal Aggregations in SQL to Prepare Data Sets for Data Mining Analysis IOSR Journal of Computer Engineering (IOSRJCE) ISSN: 2278-0661, ISBN: 2278-8727 Volume 6, Issue 5 (Nov. - Dec. 2012), PP 36-41 Horizontal Aggregations in SQL to Prepare Data Sets for Data Mining Analysis

More information

Facilitating Consistency Check between Specification and Implementation with MapReduce Framework

Facilitating Consistency Check between Specification and Implementation with MapReduce Framework Facilitating Consistency Check between Specification and Implementation with MapReduce Framework Shigeru KUSAKABE, Yoichi OMORI, and Keijiro ARAKI Grad. School of Information Science and Electrical Engineering,

More information

Investigating the Effects of Spatial Data Redundancy in Query Performance over Geographical Data Warehouses

Investigating the Effects of Spatial Data Redundancy in Query Performance over Geographical Data Warehouses Investigating the Effects of Spatial Data Redundancy in Query Performance over Geographical Data Warehouses Thiago Luís Lopes Siqueira Ricardo Rodrigues Ciferri Valéria Cesário Times Cristina Dutra de

More information

CLOUD BASED PEER TO PEER NETWORK FOR ENTERPRISE DATAWAREHOUSE SHARING

CLOUD BASED PEER TO PEER NETWORK FOR ENTERPRISE DATAWAREHOUSE SHARING CLOUD BASED PEER TO PEER NETWORK FOR ENTERPRISE DATAWAREHOUSE SHARING Basangouda V.K 1,Aruna M.G 2 1 PG Student, Dept of CSE, M.S Engineering College, Bangalore,basangoudavk@gmail.com 2 Associate Professor.,

More information

http://www.paper.edu.cn

http://www.paper.edu.cn 5 10 15 20 25 30 35 A platform for massive railway information data storage # SHAN Xu 1, WANG Genying 1, LIU Lin 2** (1. Key Laboratory of Communication and Information Systems, Beijing Municipal Commission

More information

RESEARCH ON THE FRAMEWORK OF SPATIO-TEMPORAL DATA WAREHOUSE

RESEARCH ON THE FRAMEWORK OF SPATIO-TEMPORAL DATA WAREHOUSE RESEARCH ON THE FRAMEWORK OF SPATIO-TEMPORAL DATA WAREHOUSE WANG Jizhou, LI Chengming Institute of GIS, Chinese Academy of Surveying and Mapping No.16, Road Beitaiping, District Haidian, Beijing, P.R.China,

More information

Daniel J. Adabi. Workshop presentation by Lukas Probst

Daniel J. Adabi. Workshop presentation by Lukas Probst Daniel J. Adabi Workshop presentation by Lukas Probst 3 characteristics of a cloud computing environment: 1. Compute power is elastic, but only if workload is parallelizable 2. Data is stored at an untrusted

More information

Data Warehousing Systems: Foundations and Architectures

Data Warehousing Systems: Foundations and Architectures Data Warehousing Systems: Foundations and Architectures Il-Yeol Song Drexel University, http://www.ischool.drexel.edu/faculty/song/ SYNONYMS None DEFINITION A data warehouse (DW) is an integrated repository

More information

Distributed Framework for Data Mining As a Service on Private Cloud

Distributed Framework for Data Mining As a Service on Private Cloud RESEARCH ARTICLE OPEN ACCESS Distributed Framework for Data Mining As a Service on Private Cloud Shraddha Masih *, Sanjay Tanwani** *Research Scholar & Associate Professor, School of Computer Science &

More information

OLAP. Business Intelligence OLAP definition & application Multidimensional data representation

OLAP. Business Intelligence OLAP definition & application Multidimensional data representation OLAP Business Intelligence OLAP definition & application Multidimensional data representation 1 Business Intelligence Accompanying the growth in data warehousing is an ever-increasing demand by users for

More information

Prediction System for Reducing the Cloud Bandwidth and Cost

Prediction System for Reducing the Cloud Bandwidth and Cost ISSN (e): 2250 3005 Vol, 04 Issue, 8 August 2014 International Journal of Computational Engineering Research (IJCER) Prediction System for Reducing the Cloud Bandwidth and Cost 1 G Bhuvaneswari, 2 Mr.

More information

City University of Hong Kong Information on a Course offered by Department of Computer Science with effect from Semester A in 2014 / 2015

City University of Hong Kong Information on a Course offered by Department of Computer Science with effect from Semester A in 2014 / 2015 City University of Hong Kong Information on a Course offered by Department of Computer Science with effect from Semester A in 2014 / 2015 Part I Course Title: Data-Intensive Computing Course Code: CS4480

More information

A Novel Cloud Computing Data Fragmentation Service Design for Distributed Systems

A Novel Cloud Computing Data Fragmentation Service Design for Distributed Systems A Novel Cloud Computing Data Fragmentation Service Design for Distributed Systems Ismail Hababeh School of Computer Engineering and Information Technology, German-Jordanian University Amman, Jordan Abstract-

More information

DIMENSION HIERARCHIES UPDATES IN DATA WAREHOUSES A User-driven Approach

DIMENSION HIERARCHIES UPDATES IN DATA WAREHOUSES A User-driven Approach DIMENSION HIERARCHIES UPDATES IN DATA WAREHOUSES A User-driven Approach Cécile Favre, Fadila Bentayeb, Omar Boussaid ERIC Laboratory, University of Lyon, 5 av. Pierre Mendès-France, 69676 Bron Cedex, France

More information

Data Warehouse Snowflake Design and Performance Considerations in Business Analytics

Data Warehouse Snowflake Design and Performance Considerations in Business Analytics Journal of Advances in Information Technology Vol. 6, No. 4, November 2015 Data Warehouse Snowflake Design and Performance Considerations in Business Analytics Jiangping Wang and Janet L. Kourik Walker

More information

AN EFFICIENT STRATEGY OF THE DATA INTEGRATION BASED CLOUD

AN EFFICIENT STRATEGY OF THE DATA INTEGRATION BASED CLOUD INTERNATIONAL JOURNAL OF REVIEWS ON RECENT ELECTRONICS AND COMPUTER SCIENCE AN EFFICIENT STRATEGY OF THE DATA INTEGRATION BASED CLOUD Koncha Anantha Laxmi Prasad 1, M.Yaseen Pasha 2, V.Hari Prasad 3 1

More information

Andreas Rauber and Philipp Tomsich Institute of Software Technology Vienna University of Technology, Austria {andi,phil}@ifs.tuwien.ac.

Andreas Rauber and Philipp Tomsich Institute of Software Technology Vienna University of Technology, Austria {andi,phil}@ifs.tuwien.ac. An Architecture for Modular On-Line Analytical Processing Systems: Supporting Distributed and Parallel Query Processing Using Co-operating CORBA Objects Andreas Rauber and Philipp Tomsich Institute of

More information

MRGIS: A MapReduce-Enabled High Performance Workflow System for GIS

MRGIS: A MapReduce-Enabled High Performance Workflow System for GIS MRGIS: A MapReduce-Enabled High Performance Workflow System for GIS Qichang Chen, Liqiang Wang Department of Computer Science University of Wyoming {qchen2, wang}@cs.uwyo.edu Zongbo Shang WyGISC and Department

More information

Big Data and Hadoop with components like Flume, Pig, Hive and Jaql

Big Data and Hadoop with components like Flume, Pig, Hive and Jaql Abstract- Today data is increasing in volume, variety and velocity. To manage this data, we have to use databases with massively parallel software running on tens, hundreds, or more than thousands of servers.

More information

Cleveland State University

Cleveland State University Cleveland State University CIS 695 Big Data Processing and Data Analytics (3-0-3) 2016 Section 51 Class Nbr. 5493. Tues, Thur TBA Prerequisites: CIS 505 and CIS 530. CIS 612, CIS 660 Preferred. Instructor:

More information

THE CLOUD AND ITS EFFECTS ON WEB DEVELOPMENT

THE CLOUD AND ITS EFFECTS ON WEB DEVELOPMENT TREX WORKSHOP 2013 THE CLOUD AND ITS EFFECTS ON WEB DEVELOPMENT Jukka Tupamäki, Relevantum Oy Software Specialist, MSc in Software Engineering (TUT) tupamaki@gmail.com / @tukkajukka 30.10.2013 1 e arrival

More information

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment Analysing Large Files in a Hadoop Distributed Cluster Environment S Saravanan, B Uma Maheswari Department of Computer Science and Engineering, Amrita School of Engineering, Amrita Vishwa Vidyapeetham,

More information

Hosting Transaction Based Applications on Cloud

Hosting Transaction Based Applications on Cloud Proc. of Int. Conf. on Multimedia Processing, Communication& Info. Tech., MPCIT Hosting Transaction Based Applications on Cloud A.N.Diggikar 1, Dr. D.H.Rao 2 1 Jain College of Engineering, Belgaum, India

More information

BUILDING OLAP TOOLS OVER LARGE DATABASES

BUILDING OLAP TOOLS OVER LARGE DATABASES BUILDING OLAP TOOLS OVER LARGE DATABASES Rui Oliveira, Jorge Bernardino ISEC Instituto Superior de Engenharia de Coimbra, Polytechnic Institute of Coimbra Quinta da Nora, Rua Pedro Nunes, P-3030-199 Coimbra,

More information

A DATA WAREHOUSE SOLUTION FOR E-GOVERNMENT

A DATA WAREHOUSE SOLUTION FOR E-GOVERNMENT A DATA WAREHOUSE SOLUTION FOR E-GOVERNMENT Xiufeng Liu 1 & Xiaofeng Luo 2 1 Department of Computer Science Aalborg University, Selma Lagerlofs Vej 300, DK-9220 Aalborg, Denmark 2 Telecommunication Engineering

More information

low-level storage structures e.g. partitions underpinning the warehouse logical table structures

low-level storage structures e.g. partitions underpinning the warehouse logical table structures DATA WAREHOUSE PHYSICAL DESIGN The physical design of a data warehouse specifies the: low-level storage structures e.g. partitions underpinning the warehouse logical table structures low-level structures

More information

Index Selection Techniques in Data Warehouse Systems

Index Selection Techniques in Data Warehouse Systems Index Selection Techniques in Data Warehouse Systems Aliaksei Holubeu as a part of a Seminar Databases and Data Warehouses. Implementation and usage. Konstanz, June 3, 2005 2 Contents 1 DATA WAREHOUSES

More information

A Technical Review on On-Line Analytical Processing (OLAP)

A Technical Review on On-Line Analytical Processing (OLAP) A Technical Review on On-Line Analytical Processing (OLAP) K. Jayapriya 1., E. Girija 2,III-M.C.A., R.Uma. 3,M.C.A.,M.Phil., Department of computer applications, Assit.Prof,Dept of M.C.A, Dhanalakshmi

More information

Evaluation of New Technique to Secure End User Information Using Cloud Monitoring Approach

Evaluation of New Technique to Secure End User Information Using Cloud Monitoring Approach International Journal of Electronics and Computer Science Engineering 86 Available Online at www.ijecse.org ISSN- 2277-1956 Evaluation of New Technique to Secure End User Information Using Cloud Monitoring

More information

An Efficient Checkpointing Scheme Using Price History of Spot Instances in Cloud Computing Environment

An Efficient Checkpointing Scheme Using Price History of Spot Instances in Cloud Computing Environment An Efficient Checkpointing Scheme Using Price History of Spot Instances in Cloud Computing Environment Daeyong Jung 1, SungHo Chin 1, KwangSik Chung 2, HeonChang Yu 1, JoonMin Gil 3 * 1 Dept. of Computer

More information

An introduction to Tsinghua Cloud

An introduction to Tsinghua Cloud . BRIEF REPORT. SCIENCE CHINA Information Sciences July 2010 Vol. 53 No. 7: 1481 1486 doi: 10.1007/s11432-010-4011-z An introduction to Tsinghua Cloud ZHENG WeiMin 1,2 1 Department of Computer Science

More information

SpatialHadoop: Towards Flexible and Scalable Spatial Processing using MapReduce

SpatialHadoop: Towards Flexible and Scalable Spatial Processing using MapReduce SpatialHadoop: Towards Flexible and Scalable Spatial Processing using MapReduce Ahmed Eldawy Expected Graduation: December 2015 Supervised by: Mohamed F. Mokbel Computer Science and Engineering Department

More information