INTERNATIONAL JOURNAL OF INFORMATION TECHNOLOGY & MANAGEMENT INFORMATION SYSTEM (IJITMIS)

INTERNATIONAL JOURNAL OF INFORMATION TECHNOLOGY & MANAGEMENT INFORMATION SYSTEM (IJITMIS) International Journal of Information Technology & Management Information System (IJITMIS), ISSN 0976 ISSN 0976 6405(Print) ISSN 0976 6413(Online) Volume 5, Issue 2, May - August (2014), pp. 51-58 IAEME: http://www.iaeme.com/ijitmis.asp Journal Impact Factor (2014): 6.2217 (Calculated by GISI) www.jifactor.com IJITMIS I A E M E LIMITATIONS OF DATAWAREHOUSE PLATFORMS AND ASSESSMENT OF HADOOP AS AN ALTERNATIVE KULDEEP DESHPANDE 1, Dr. BHIMAPPA DESAI 2 1 (Ellicium Solutions, Pune, Maharashtra) 2 (Capgemini Consulting, Pune, Maharashtra) ABSTRACT Volume and complexity of data collected in datawarehouse systems is growing rapidly. This is posing challenges to traditional datawarehouse platforms. At the same time, Hadoop ecosystem has opened new avenues for implementing datawarehouse systems on Hadoop and overcome these challenges. In this paper we survey previous studies about limitations of traditional datawarehouse platforms. Opportunities offered by Hadoop for datawarehouse implementation are discussed. This paper can give a direction to future research in the areas of Datawarehouse implementation on Hadoop platform. Keywords: Datawarehouse, Hadoop, Hive, Analytical, ETL I. INTRODUCTION The size of data sets being collected and analyzed in the industry for business intelligence is growing rapidly, making traditional warehousing solutions prohibitively expensive [5]. Data collected from web logs, social media has become important component of analytical systems. At the same time, these data sources have resulted in added complexities for datawarehouses. Post 2005, richer set of analytical database management systems have been introduced [2]. However rate of growth of data volume and complexity is posing challenges to these analytical database systems also. Companies like Yahoo, Facebook have been using Hadoop for processing large datasets [5]. However in recent period, there has been an increased interest in evaluating Hadoop as a Datawarehouse platform. In this paper we study challenges to currently available datawarehouse platforms, opportunities opened up by Hadoop and explore areas that need research. 51

This work is organized as follows. In section 2 we conduct a meta-analysis of two surveys about issues with current datawarehouse platforms. We summarize top issues encountered by industry practitioners in this section. This is followed by overview of Hadoop and approaches for building datawarehouse on Hadoop. In section 4, we discuss various ways in which Hadoop can be used for building a datawarehouse. This includes using Hadoop for data archival, data staging and ETL processing. Hadoop is a new technology and has many limitations that need to be overcome to make it a complete datawarehouse platform. These shortcomings have been discussed in section 5. Finally in section 6, we discuss areas of research to make Hadoop a full-fledged datawarehouse platform. II. LIMITATIONS OF DATAWAREHOUSE PLATFORMS A datawarehouse platform is the most important component of the datawarehouse / analytical system. A datawarehouse platform is defined as a collection of hardware servers, an operating system, a database management system (DBMS) and data storage [1]. Different categories of datawarehouse platforms are as follows: 1. Traditional RDBMS databases 2. Column oriented databases 3. In memory databases 4. Software only appliances 5. Software and hardware appliances 6. Cloud based datawarehouse systems 7. Hadoop based datawarehouse platforms In order to define limitations of traditional datawarehouse platforms and challenges posed by growth in data volume and variety, we have referred two previous studies. In [1], a survey of 417 business and technical executives was conducted by The Datawarehousing Institute (TDWI). Objective of this survey was to help respondents understand options available for datawarehouse platforms. From the list it is clear that poor query response, lack of support for advanced analytics and inadequate load speed are critical challenges faced by the industry with existing datawarehouse platforms. In [2] a survey regarding usage of analytical platform has been conducted. This study surveyed 223 respondents regarding their satisfaction with traditional RDBMS as platform for Datawarehouse and plans for migration from RDBMS to analytical platforms. This study found that 75% respondents have migrated to analytical platforms or are in the process of migration. This survey asked respondents issues that lead to migration away from RDBMS. 52

Following table shows weighted average response of these 2 studies. Problem TABLE 1: Meta-Analysis of Datawarehouse platform surveys Survey 1 (2009) % of respondents 53 Survey 2 (2010) Weighted average response % of Problem respondents Query performance / response times 60.7% 50.5% Poor query response 45.0% Can t support advanced Need for complex analytics 40.0% analysis 61.3% 47.4% Inadequate data load speed 39.0% Load times 31.5% 36.4% Cost of scaling up is too expensive 33.0% Hardware growth / cost 22.0% 29.2% Poorly suited to real-time or on demand workloads 29.0% Need for on demand capacity 49.4% 36.1% Can t support large concurrent user count 20.0% Growth in number of concurrent users 38.1% 26.3% Inadequate high availability 19.0% Availability and fault tolerance 19.0% 19.0% Other 4.0% Other 6.0% 4.7% From above analysis we conclude that following are main challenges with existing datawarehouse platforms. 2.1 Poor query performance / response In both the above mentioned surveys, poor query response is mentioned as the most important challenge with existing datawarehouse platforms. SQL is the most common language used for analysis. Poor query response reflects slow execution of SQL. Over last 10 years various approaches like 64 bit computing, increasing memory, MPP systems, and columnar databases have been implemented to solve this challenge but still poor query response remains number one challenge for datawarehouses. 2.2 No support for advanced analytics Lack of advanced analytics capabilities is cited as an important challenge for datawarehouse platforms. However there is a debate over exact definition of the term. From various studies it can be concluded that support for various forms of predictive algorithms, statistical analysis and geographic visualization can be clubbed under advanced analytics. Traditional RDBMS based datawarehouse platforms generally do not support these advanced analytic functions using SQL. Majority of organizations perform this kind of advanced analytics outside RDBMS using hand coded platforms or tools [2]. 2.3 Slow data load speed Traditional datawarehouses are batch oriented and are loaded weekly / daily / multiple times a day. However trend is to integrate datawarehouse with transactional and operational applications such as fraud detection [1]. This is making traditional daily load oriented datawarehouse applications obsolete. Majority of large corporations have their datawarehouse load processes that run throughout the night and consume 10-12 hours. With increased data volumes and complexity, data load times will keep on increasing and this challenge needs a solution.

2.4 High hardware cost Increased data volume, complexity of data and number of users results in need for adding hardware (disk space or processor) to the datawarehouse infrastructure to support the growth. Additional hardware also requires additional cooling, space, power and increased management [2]. As per survey [1], due to recession, 57% of respondents said that organizations have reduced the budget for Datawarehousing. Thus reduced cost of hardware and support per additional volume, number of users and complexity of analysis is an important requirement that datawarehouse platforms have to satisfy. 2.5 No support for on demand workload With increased dependency on data driven decisions, need for ad hoc, one time on demand analysis is increasing. Many times these on demand analysis requires analyzing very large volume of data, combining of datawarehouse data with external data and usage of archived data for analysis. All this requires Datawarehouse to scale up and make higher memory and capacity available to specific analysis. With traditional RDBMS based datawarehouse platforms, scaling up on demand requires adding costly hardware, adding memory, long times to obtain the hardware etc. Lack of ability to scale up on demand with minimal cost and ramp up time is a major challenge to existing datawarehouse platforms. III. HADOOP AS AN ALTERNATIVE DATAWAREHOUSE PLATFORM In this section we discuss fundamentals of Hadoop and early usage of Hadoop as a datawarehouse platform. What is Hadoop? The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets [14]. A typical file in HDFS is gigabytes to terabytes in size. Thus, HDFS is tuned to support large files. HDFS applications need a write-onceread-many access model for files. A file once created, written, and closed need not be changed. A computation requested by an application is much more efficient if it is executed near the data it operates on. HDFS provides interfaces for applications to move themselves closer to where the data is located [14]. MapReduce is a programming model on top of HDFS for processing and generating large data sets [13]. MapReduce for Datawarehouse MapReduce is programming models on top of HDFS for processing and generating large data sets which was developed as an abstraction of the map and reduce primitives present in many functional languages [10]. In [8] a datawarehouse framework developed at Turn Inc. has been discussed. This framework makes use of MapReduce for processing the data. Framework demonstrated in this paper can benefit from massively parallel execution of programs and scalability of MapReduce. Also ability of MapReduce to process non-relational data / unstructured data can add value to the Datawarehouse. Thus problems such as Text tokenization, indexing and search, data mining and machine learning, handling of high number of hops can be handled easily in MapReduce. In this paper a specialized data model for Hadoop based datawarehouse has been proposed. As per this design a virtual schema in 54

which fact and dimension tables are pre-joined is created. This schema abstracts joins of tables from users and simplifies the query language. Since the implementation of JOIN operator is complex in MapReduce, this data model simplifies the MapReduce program. This paper demonstrated how Cheetah framework can process data at 1GB / Sec. Hive One of the first usages of Hadoop for Datawarehouse was reported by Facebook team. Hive is an open source datawarehousing solution built on top of Hadoop. Hive was built in 2007 by Facebook team and was open sourced in 2008 [5]. MapReduce is programming extensive and not very suitable for end users to query and analyze data. Purpose of Hive was to bring the concepts of tables, columns and SQL to Hadoop and still retain flexibility of Hadoop [5]. In 2010, Facebook Hive implementation had tens of thousands of tables and contained 700 TB data [5]. It supported 200 users. Hive query language (HQL) is a subset of SQL. This makes it easy for SQL oriented users to analyze data in Hadoop. Advanced analysis in the form of MapReduce can be plugged into HiveQL. This enables complex analysis such as text mining, pattern matching etc to be done in MapReduce and results can be explored using SQL using Hive. Hive compiles SQL into MapReduce batch jobs. Thus Hive was designed for complex, batch oriented jobs and not for low latency queries [13]. Hive works well as a solution for complex datawarehouse analysis scenario but does not perform well for dashboard and real time analysis. Other examples of Hadoop based DW implementation In previous sections we discussed usage of Hadoop for data storage in datawarehouse applications where data volumes are very large or data is unstructured. Few other studies have reported usage of Hadoop for specific other use cases. Sensors have been widely applied to various fields. Time series data generated by sensors has large demand for data storage and analysis. Most of the time series datawarehouse solutions available currently use RDBMS systems. Usage of Hbase database which is a NoSQL database on Hadoop for time series datawarehouse is reported in [9]. This experimental study conducted stress test on 400 million time series records and concluded that Hbase has a good read performance for time series data. Usage of Hadoop for high performance datawarehouse and OLAP has been demonstrated in [6]. In this experiment, a Hadoop cluster consisting of 18 nodes with 36 cores was constructed. This experiment made use of MapReduce for cube construction. It also provided XMLA API so that standard BI tools can access cube data. All studies described so far have described usage of Hadoop for very large scale datawarehouses. However midsized organizations may also benefit from usage of Hadoop by reducing data storage costs for datawarehouse. In [13], an evaluation of Hadoop for small and medium sized datawarehouse systems has been performed. This study has compared MySQL, Hadoop + MapReduce and Hive for data sizes ranging from 200 MB to 10 GB. This study found that up to 1 GB data size, MySQL performs better than Hadoop and Hive. Between 1 to 2 GB, Hadoop + MapReduce outperforms Hive and MySQL. Beyond 2 GB, Hive outperforms Hadoop + MapReduce and MySQL. Work done in this study needs to be extended further. Firstly, this study did not run extensive analytic queries to confirm the conclusion. Secondly, low latency queries were not tested. Despite of these issues, this study opens new opportunities for small and medium industries to leverage Hadoop for low cost Datawarehouse. 55

IV. DIFFERENT WAYS IN WHICH HADOOP CAN BE USED FOR DATAWAREHOUSE In the previous section we discussed how Hadoop can be used as a data storage platform for Datawarehouse. There are other means of using Hadoop for Datawarehouse implementation. Lot of data processing happens in the data staging area to prepare the data to be loaded in the datawarehouse [7]. Size of staging area is many times that of datawarehouse. Usage of Hadoop has been reported in literature to take advantage of low cost, linear scalability, facility with file-based data, and ability to manage unstructured data [7]. By doing this Datawarehouse servers are utilized only for the purpose of loading cleaned data and for the purpose of end user access to data. Enormous amount of data is gathered over a period of time in the datawarehouse system. Not all of this data may be required for reporting and analytics. However legal requirements stipulate organizations to retain data beyond its usage life. Also in many cases source data has to be retained as is along with the cleaned datawarehouse data. Retaining older datawarehouse data and raw source data for long period can be expensive [7] and can put burden on datawarehouse server. Similar to datawarehouse staging, data archival can be migrated to Hadoop. Migration of ETL processing to Hadoop can achieve benefits of reduced cost as well as processing times. Cheap storage, massive scalability, ability to handle complex logic and manage unstructured data are reasons due to which Hadoop can be an ideal ETL platform. It is recommended to identify top 20% ETL workloads for migration to Hadoop to achieve maximum benefits [4]. Following ETLs qualify for migration to Hadoop [4]: Relatively high elapse processing times Very complex scripts like change data capture, joins and cursors. File processing and semi structured data ETLs causing high impact on resource utilization Unstable or error prone code Thus apart from using Hadoop as a datawarehouse storage platform, it can also be used for data archival, data staging and processing ETL programs. V. LIMITATIONS OF HADOOP TO BE A DATAWAREHOUSE PLATFORM Despite of advantages of Hadoop as a Datawarehouse platform stated in the previous section, it has certain limitations. Most of these limitations are features that Hadoop lacks compared to mature RDBMS. In this section we discuss limitations of Hadoop as a Datawarehouse platform. 1. Low latency data access and queries MapReduce is a batch oriented programming paradigm. Hence MapReduce is not best suited for real time and speedy queries. However newer query engines like Impala and Apache Drill are providing faster query processing capabilities to data stored in Hadoop [16]. Usage of Hadoop as a Datawarehouse platform in future will depend to a great extent on maturity of these query engines and how fast these query engines acquire RDBMS functionality. 56

2. Inserts and updates Hadoop does not support ACID compliant insert and update queries. Even Hive does not currently support insert or update queries [5]. This makes it difficult to use Hadoop for dimensional tables in datawarehouse that require updates for slowly changing dimension tables. 3. Granular security Row level or field level security like a RDBMS is absent in Hadoop [7]. Only basic checks like file permission checks are present in Hadoop [7]. 4. SQL based analytics Any mature datawarehouse solution consists of end users writing complex SQLs for data analysis. Hadoop based databases like Hive and Impala have limited support for ANSI standard SQL [7]. Hive does not support correlated sub queries which are commonly used in most traditional warehouse queries. However databases like IBM BigSQL and GreenPlum HAWQ are Hadoop based databases and aim to support ANSI SQL. Ability of Hadoop based query engines to match ANSI SQL capabilities will speed up usage of Hadoop as datawarehouse platform. VI. DIRECTIONS FOR FUTURE RESEARCH As discussed in previous section, certain areas need further research to make Hadoop a viable alternative platform for datawarehouse implementations. In this direction, following research is necessary: 1. Adoption of Hadoop for datawarehouse implementations will depend to a great extent on maturity of SQL engines on Hadoop and compliance of these SQL engines to ANSI SQL. Further research is required to qualify which ANSI SQL features are lacking in Hive (or Impala) that will make these SQL engines mature datawarehouse platform. 2. Suitability of Hadoop for managing large datasets is well established. [13] Has discussed suitability of Hadoop for mid and small sized datasets. This aspect of suitability to small implementations needs to be further explored and detailed guidelines need to be developed to analyze suitability of Hadoop for small datasets. This will make Hadoop as a more feasible alternative for traditional DW platforms for smaller organizations. 3. Application of traditional dimensional or ER modeling techniques for datawarehouses on Hadoop needs to be studied. If these approaches are found unsuitable, alternative modeling methodology needs to be developed for modeling datawarehouse on Hadoop. 4. Various studies have proposed usage of Hadoop for processing ETL for transformations such as lookup, joins etc. However detailed benchmarks on when can these transformations benefit from Hadoop based implementation need to be developed. 5. A set of comprehensive guidelines / framework needs to be developed for evaluating whether a datawarehouse will benefit from Hadoop. VII. CONCLUSION In this paper we analyzed shortcomings of traditional datawarehouse platforms. This paper analyzed two surveys and conducted meta-analysis to report problems with current datawarehouse problems. A survey of various experiments on Hadoop based datawarehouse was reported in this paper. 57

Main objective of this paper is to analyze possibility of using Hadoop as datawarehouse platform and areas of research that will make Hadoop as a strong alternative to traditional datawarehouse platforms. As reported in section 7, further research is required to make SQL engines on Hadoop more mature. Also there is a need of a comprehensive framework to judge datawarehouses that will benefit from implementation on Hadoop platform. VIII. REFERENCES [1] Philip Russom, Next generation Datawarehouse platforms, The Datawarehousing Institute, 2009 [2] Merv Adrian and Colin White,Analytic Platforms: Beyond the Traditional Data Warehouse,BeyeNETWORK Custom Research Report, 2010 [3] Philip Russom, Analytic Databases for Big Data, The Datawarehousing Institute, 2012 [4] Offload your Datawarehouse with Hadoop, Syncsort publication, 2014 [5] Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Ning Zhang, Suresh Antony, Hao Liu and Raghotham Murthy, Hive A Petabyte Scale Data Warehouse Using Hadoop, IEEE, 2010. [6] Jinguo You1, Jianqing Xi1, Chuan Zhang1, Gengqi Guo1, HDW: A High Performance Large Scale Data Warehouse, Computer and Computational Sciences IMSCCS '08, 2008. [7] Philip Russom, Evolving Data Warehouse Architectures, The Datawarehousing Institute, 2014 [8] Songting Chen, Cheetah: A High Performance, Custom Data Warehouse on Top of MapReduce, 36th International Conference on Very Large Data Bases, 2010 [9] Wen-Yuan Ku, Tien-Yin Chou, Lan-Kun Chung, The Cloud-Based Sensor Data Warehouse, International Symposium on Grids & Clouds, 2011. [10] T.K.Das and Arati Mohapatro, A Study on Big Data Integration with Data Warehouse, International Journal of Computer Trends and Technology (IJCTT) volume 9, number 4, Mar 2014. [11] Charles Loboz, Slawek Smyl, Suman Nath, DataGarage: Warehousing Massive Performance Data on Commodity Servers, 36th International Conference on Very Large Data Bases, 2010. [12] Sanjeev Khatiwada, Architectural Issues in Real-time Business Intelligence, 2012 [13] Marissa Rae Hollingsworth, Hadoop and Hive as Scalable Alternatives to RDBMS: A Case Study, Boise State University, 2012. [14] Dhruba Borthakur, 2007, The Hadoop Distributed File System: Architecture and Design, Apache foundation. [15] Donald Feinberg, DBMS Infrastructure for the Modern Data Warehouse, Business Intelligence Summit, 2010. [16] http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html. [17] Kuldeep Deshpande and Dr. Bhimappa Desai, A Critical Study of Requirement Gathering and Testing Techniques for Datawarehousing, International Journal of Information Technology and Management Information Systems (IJITMIS), Volume 5, Issue 1, 2014, pp. 60-71, ISSN Print: 0976 6405, ISSN Online: 0976 6413. 58