}w!"#$%&'()+,-./012345<ya

}w!"#$%&'()+,-./012345<ya MASARYK UNIVERSITY FACULTY OF INFORMATICS Hadoop as an Extension of the Enterprise Data Warehouse MASTER THESIS Bc. Aleš Hejmalíček Brno, 2015

Declaration Hereby I declare, that this paper is my original authorial work, which I have worked out by my own. All sources, references and literature used or excerpted during elaboration of this work are properly cited and listed in complete reference to the due source. Bc. Aleš Hejmalíček Advisor: doc. RNDr. Vlastislav Dohnal, Ph.D. ii

Acknowledgement First of all I would like to thank my supervisor Vlastislav Dohnal for his time, for his advices and feedback along the way. Then I would like to thank to my family for their continuous support through whole university studies. I would also like to apologize to my girlfriend for waking her late at night with loud typing. Last but not least is a thank to my colleagues in AVG BI team for respecting my time inflexibility, especially while finishing this thesis. iii

Abstract The goal of this thesis is to describe issues with processing big data and to propose and explain an enterprise data warehouse architecture that is capable of processing large volumes of structured and unstructured data. The thesis aims to explain integration of Hadoop framework as a part of the proposed architecture into existing enterprise data warehouses. iv

Keywords Hadoop, Data warehouse, Kimball, Analytic platform, OLAP, Hive, ETL, Analytics v

Contents Introduction................................... 1 1 Analytic platforms introduction....................... 2 1.1 New data sources............................. 3 1.2 Data warehouse.............................. 3 1.2.1 Analytic platform......................... 4 1.2.2 Extract, transform and load................... 5 1.2.3 Kimball s architecture...................... 5 1.2.3.1 Dimensional modelling................ 5 1.2.3.2 Conformed dimensions................ 6 1.2.3.3 Surrogate keys..................... 7 1.2.3.4 Fact tables........................ 7 1.2.4 Business intelligence....................... 8 1.2.4.1 Online analytical processing............. 9 1.2.4.2 Analytics........................ 10 1.2.4.3 Reporting........................ 10 1.3 Existing technologies in DW....................... 10 2 Proposed architecture............................. 14 2.1 Architecture overview.......................... 15 2.2 Multi-Platform DW environment.................... 16 2.3 Hadoop................................... 17 2.3.1 HDFS................................ 17 2.3.2 YARN............................... 18 2.3.3 MapReduce............................ 18 2.3.4 Hive................................ 18 2.3.4.1 HiveQL......................... 20 2.3.4.2 Hive ETL........................ 21 2.3.5 Impala............................... 24 2.3.6 Pig................................. 24 2.3.7 Kylin................................ 24 2.3.8 Sqoop............................... 25 3 Hadoop integration............................... 27 3.1 Star schema implementation....................... 27 3.1.1 Dimensions implementation.................. 27 3.1.2 Facts implementation...................... 29 3.1.3 Star schema performance optimization............ 30 3.2 Security................................... 30 3.3 Data management............................. 31 3.3.1 Master data............................ 31 3.3.2 Metadata.............................. 31 3.3.3 Orchestration........................... 32 3.3.4 Data profiling........................... 33 3.3.5 Data quality............................ 34 vi

3.3.6 Data archiving and preservation................ 34 3.4 Extract, transform and load....................... 35 3.4.1 Hand coding........................... 36 3.4.2 Commercial tools......................... 36 3.5 Business intelligence........................... 37 3.5.1 Online analytical processing.................. 37 3.5.2 Reporting............................. 38 3.5.3 Analytics.............................. 39 3.6 Real time processing........................... 39 3.7 Physical implementation......................... 41 3.7.1 Hadoop SaaS........................... 42 3.8 Hadoop commercial distributions.................... 43 4 Other Hadoop use cases............................ 44 4.1 Data lakes................................. 44 4.2 ETL offload platform........................... 44 4.3 Data archive platform.......................... 45 4.4 Analytic sandboxes............................ 46 5 Conclusion.................................... 48 vii

Introduction Companies want to leverage emerging new data sources to gain market advantage. However, traditional technologies are not sufficient for processing large volumes of data or streaming real time data. Hence, in last few years a lot of companies invested into development of new technologies that are capable of processing such data. Those data processing technologies are generally expensive. Therefore, Hadoop, open source framework for distributed computing was developed. Lately, companies have adopted and integrated Hadoop framework to improve their data processing capabilities, despite the fact that they already use some form of data warehouse. However adopting Hadoop and other similar technologies brings new challenges for all people involved in data processing, reporting and data analytics. Main issue is integration into already running data warehouse environment as Hadoop technology is relatively new and not many business use cases and successful implementations have been published and therefore there are no existing best practices or guidelines. This is the reason why I choose this topic as the topic of my master s thesis. Goal of my thesis is to suggest data warehouse architecture that allows processing of large amount of data in batch manner and streaming data as well and to explain techniques and processes for new system integration into existing enterprise data warehouse. Including explanation of data integration using Kimball s architecture best practices, data management processes and connection to business intelligence systems such as reporting or online analytical processing. Proposed architecture also needs to be accessible the same way as existing data warehouses, so the data consumers can access the data in familiar manner. The first chapter introduce the problems of data processing and explains basic terms such as data warehousing, business intelligence or analytics. The main goal is to present new issues, challenges and currently used technologies for data processing. The second chapter presents requirements on new architecture and purposes architecture that meets necessary requirements. Further in second chapter are described individual technologies and theirs characteristics, advantages and disadvantages. The third chapter focuses on Hadoop integration into existing enterprise data warehouse environment. It includes explanation of data integration in Hadoop following Kimball s best practices from general data warehousing such as start schema implementation and individual parts of data management plan and processes. The other part describes implementation of extract, transform and load process, usage of business intelligence and reporting tools and then focuses on physical Hadoop implementation and Hadoop cluster location. The Last chapter explains other specific business use cases for Hadoop in data processing and data warehousing as it is well-rounded technology and can be used for different purposes. 1

1 Analytic platforms introduction In the past few years, the amount of data available to organizations of all kinds has increased exponentially. Businesses are under the pressure to be able to retrieve information that could be leveraged to improve business performance and to gain competitive advantage. However, processing and data integration is getting more complex due to data variety, volume and velocity. This is the challenge for most of organizations, as internal changes are necessary in organization, data management and infrastructure /citebd. Due to adoption of new technologies, companies hire people with specific experience and skills in these new technologies. As the most of the technologies and tools are relatively young, it is expected that skill shortage gap is going to grow. Ideal candidate should have mix of analytic skills, statistics and coding experience and such people are difficult to find. By 2018 The United States alone is going to face a shortage of 140,000 to 190,000 people with analytical expertise and 1.5 million managers and analysts with the skills to understand and make decisions based on an analysis of big data [1]. However, the data are not new for organizations. In most cases the data from mobile devices, log data or sensor data have been available for a long time. A lot of those data were not previously processed and analyzed even though the sources of massive amount of data existed in history. Major change in business perspective happened in last several years with development of distributed analytic platforms. Nowadays, the technologies are available for anyone for an affordable price and a lot of companies have started to look for business cases to build and use analytical platforms. Due to the demand, a lot of companies have created their products to help businesses solve those issues. Companies such as Cloudera, IBM, Microsoft or Teradata offers solution including software, hardware and enterprise support all together. However, an initial price is too high for smaller companies, as the price for a product itself does not takes into consideration companies existing systems and additional data sources and processing or data integration. Data integration itself is a major expense in data processing projects. Most technological companies already somehow integrate data and build data warehouses to ensure simple and unified access to data [2]. However, these data warehouses are built to secure precise reporting and usually uses technologies such as relational database management systems (RDBM) that are not designed for processing of petabytes of data. Companies build data warehouses to achieve fast, simple and unified reporting. This is mostly performed by aggregating the data. This approach improve processing speeds and reduce needed storage space. On the other hand, aggregated data doesn t allow complex analytics due to a lack of detail. For analytic purposes data should have the same detail as raw source data. When building analytic platform or building complementary solution to an already existing data warehouse, business must decide if they prefer commercial 2

1. ANALYTIC PLATFORMS INTRODUCTION third party product with enterprise support or if they are able to build the solution themselves in-house. Both approaches have advantages and disadvantages. Main advantage of an in-house solution is the price and modifiability. On the other hand it can be difficult to find experts with enough experience to develop and deliver end to end solution and to provide continuous maintenance. Major disadvantages of buying complete analytical platform is the price as it can hide additional costs in data integration, maintenance and support. 1.1 New data sources As new devices are connecting to the network, various data become available, a large amount of data is hard to manage with common technologies and tools and to process it within tolerable time. Those new data sources include mobile phones, tablets, sensors, wearable devices, which number has grown significantly in last years. All these devices interact with its surrounding or web sites such as social media and every action can be recorded and logged. New data sources have following characteristics: Volume - As mentioned before, the amount of data is rapidly increasing every year. According to The Economist [3], the amount of digital information increases tenfold every five years. Velocity - Interaction with social medias or with mobile applications usually happens in real time and hence causing continuous data flow. Processing real time data can help business make valuable decisions. Variety - Data structure can be dynamic and it can change significantly with any record. Such data include XML or nested JSON and JSON arrays. However, unstructured or semi structured data are harder to process with traditional technologies, although these data can contain valuable information. Veracity - With different data sources, it is getting more difficult to maintain data certainty and this issue is more challenging with more volume and higher velocity. With increasing volume and complexity of data, integrating and cleaning processes get more difficult and companies are integrating new platforms and processes to deal with data processing and analytics [4, 5]. 1.2 Data warehouse Data warehouse is a system that allows process data in order to be easily analysable and queryable. Bill Inmon defined a data warehouse in 1990 as follows: A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management s decision making process. [6] 3

1. ANALYTIC PLATFORMS INTRODUCTION He defined the terms as follows: Subject Oriented - Data that gives information about a particular subject instead of about a company s ongoing operations. Integrated - Data that is gathered into the data warehouse from a variety of sources and merged into a coherent whole. Time-variant - All data in the data warehouse is identified with a particular time period. Non-volatile - Data are stable in a data warehouse. More data can be added but historic data are never removed or modified. This enables management to gain a consistent picture of the business. Enterprise data warehouse integrates and unifies all the business information of an organization and makes it accessible all across a company without compromising security or data integrity. It allows complex reporting and analytics across different systems and business processes. 1.2.1 Analytic platform Analytic platform is a set of tools and technologies that allow storing, processing and analysis of data. Most companies want to focus on processing of higher amount of data, therefore a distributed system is a base of analytical platform. Often newer technologies are used such as NoSQL and NewSQL databases and advanced statistical tools and machine learning. Many companies providing analytic platforms focus on providing as much integrated tools as possible in order to make adaptation of a new platform seemingly easy. Those platforms are designed for processing a large amount of data in order to make advanced analytics possible. Among others, advance analytics include following cases: Search ranking Ad tracking User experience analysis Big science data analysis Location and proximity tracking Causal factor discovery Social CRM Document similarity testing Customer churn or wake up analysis 4

1. ANALYTIC PLATFORMS INTRODUCTION 1.2.2 Extract, transform and load Extract, transform and load (ETL) [7] is a process of moving data across systems and databases in order to make them easily analysable. ETL is mostly used in data warehousing as data that are being loaded into data warehouse are often transformed and cleansed to ensure data quality needed for analysis. ETL describes three steps of moving data: Extract - Process of an extraction of data from a source system. Extraction can be directly from database or through some API. Extraction can implement complex mechanisms of data extraction to extract only changes from a database. This process is called change data capture and is one of the most efficient mechanism for data extraction. Extraction also often includes extract archiving for audit purposes. Transform - Transformation can implement any algorithm or transformation logic. Data are transformed so they satisfy data model and data quality needs of data warehouse. This can include data type conversion, data cleansing or even fuzzy lookups. Usually data from more data sources are integrated together within this step. Load - Process of loading transformed data into data warehouse. Usually includes loading transformed dimensional and fact tables. 1.2.3 Kimball s architecture Ralph Kimball designed individual processes and tasks within a data warehouse in order to simplify its development and maintenance [8]. The Kimball s architecture includes processes used in end to end delivery of a data warehouse. The whole data warehouse development process starts with gathering users requirements. Primary goal is to gather metrics which needs to be analyzed in order to identify data sources and design data model. Then it describes incremental development method that focus on continuous delivery as data warehouse project are rather bigger and it is important to start delivering business value early in the data warehouse development. Key features of Kimball s architecture are dimensional modelling and identifying fact tables, which are described further in this thesis. Regarding ETL process, Kimball described steps of an extraction, data transformation, data cleaning and loading into stage as into temporal storage and into final data warehouse dimensional and fact tables. All other processes around data warehouse such as data warehouse security, data management and data governance are included as well. All together Kimball s architecture is a quite simple and understandable framework for data warehouse development. 1.2.3.1 Dimensional modelling Dimensional modeling [8] is a technique for designing data models simple and understandable. As most of the business users tend to describe world in the enti- 5

1. ANALYTIC PLATFORMS INTRODUCTION ties such as product, customer or dates it is reasonable to model the data the same way. It is intuitive implementation of data cube that has edges labeled product, customer and date for example. This implementation allows users to easily slice the data and breakdown them by different dimensions. Inside of a data cube are measured metrics. When cube is sliced, metrics are shown, depending on how many dimensions are sliced. Implementation of dimensional model is a star schema [8]. In star schema, all dimensions are tied to fact tables [8]. Therefore in star schema it is easily visible, which dimensions can be used in for slicing. Figure 1.1: Star schema example. Dimensions can also contain different hierarchies and additional attributes. As the dimension needs to track history, Kimball defines several types [8] of dimensions. The most commons are: Type one - It does not track historical changes. When a record is changed in source system, it is updated in dimensions. Therefore only one record is stored for each natural key. Type two - It uses additional attributes such as date effective from and date effective from to track different versions of record. When record is changed in source system, new record is inserted into dimensions and date effective to attribute of old record is updated. Type three - It is used to track changes in defined attributes. If history tracking is needed, two separate attributes are created. One defines current state and the second one previous value. 1.2.3.2 Conformed dimensions One of the key features in data integration are conformed dimensions [8]. These are dimensions that describes one entity the same way across all integrated data 6

1. ANALYTIC PLATFORMS INTRODUCTION sources. Main reason for implementation of conformed dimensions is that CRM, ERP or billing systems can have different attributes and different ways how to describe business entities such as customer. Dimension conforming is process of taking all information about an entity and designing the transformation process in the way that data about the entity from all data sources are merged into one dimension. Dimension created using this process is called conformed dimension. Using conformed dimension significantly simplifies business data as all people involved in the business use the same view on customer and the same definition. This allows simple data reconsolidation. Figure 1.2: Dimension table example. 1.2.3.3 Surrogate keys Surrogate keys [8] ensures identification of individual entities in a dimension. Usually surrogate keys are implemented as an incremental sequence of integers. Surrogate keys are used because duplication of natural keys is expected in dimension tables as the changes in time need to be tracked, therefore surrogate key identifies specific version of the record. In addition, more than one natural key would have to be used in conformed dimension as the data may come from more data sources. Surrogate keys are predictable and easily manageable as they are assigned within a data warehouse. 1.2.3.4 Fact tables Fact tables are tables containing specific events or measures of a business process. A fact table has typically two types of columns. Foreign keys of dimension 7

1. ANALYTIC PLATFORMS INTRODUCTION tables and numeric measures. In an ETL process lookup on dimension tables is performed and values or keys describing entity in dimension are replaced by surrogate keys from particular dimensions. Fact table is defined by its granularity. Fact tables should always contain only one level of granularity. Having different granularity in a fact table could cause issues in measures aggregations. An example of fact tables with specific granularity can be table of sales orders and another one with order items. Figure 1.3: Fact table example. While designing fact table it is important to identify business processes that users want to analyze in order to specify data sources needed. Then follows a definition of measures such as sale amount or tax amount and a definition of dimensions that make sense within a business process context. 1.2.4 Business intelligence Business intelligence (BI) is set of tools and techniques, which goal is to simplify querying and analysing data sets. Commonly, BI tools are used as a top layer of data warehouse which is accessible for wide spectrum of users as a lot of BI techniques do not require advanced technological skills [9]. Business intelligence uses variation of techniques. Depending on a business requirements different tool or technique is chosen. Therefore BI in companies usually contains various tools from different providers. Example of BI techniques: Online analytical processing (OLAP) Reporting 8

1. ANALYTIC PLATFORMS INTRODUCTION Dashboards Predictive analytics Data mining 1.2.4.1 Online analytical processing Online analytical processing (OLAP) [10, 9] is an approach that achieves faster response when querying multidimensional data, therefore it is one of the key features of companies decision system. The main advantage of OLAP is a leverage of star or snowflake schema structure. Three types of OLAP exist. If OLAP tool stores data in special structure such as hash table (SSAS) or multidimensional array on OLAP server then it is called multidimensional OLAP (MOLAP). MOLAP provides quick response to operations such as slice, dice, roll-up or drill-down as a tool is able to simply navigate trough precalculated aggregation to the lowest level. Among MOLAP, two other types of OLAP tools exist. Those are relational OLAP (ROLAP), which is based on querying data in relational structure (e.g. in RDBM). ROLAP is not very common as it does not achieve querying response time of MOLAP. The third type is hybrid OLAP (HOLAP), which is combination of MOLAP and ROLAP. One of the popular implementation is precalculating aggregations into MOLAP and keeping underlying data stored in ROLAP, therefore only when an aggregation is requested, underlying data are not queried. OLAP is often queried via multidimensional query language such as MDX either directly or through analytic tool such as Excel 1 or Tableau 2. User then only works with a pivot table and is able to query or filter aggregated and underlying data depending on OLAP definition. Some of the OLAP tools include: SSAS Microsoft 3. Mondrian developed by Pentaho 4. SAS OLAP Server 5. Oracle OLAP 6. 1. https://products.office.com/en-us/excel 2. http://www.tableausoftware.com/ 3. http://www.microsoft.com/en-us/server-cloud/solutions/businessintelligence/analysis.aspx 4. http://community.pentaho.com/projects/mondrian 5. http://www.sas.com/resources/factsheet/sas-olap-server-factsheet.pdf 6. http://www.oracle.com/technetwork/database/options/olap/index.html 9

1. ANALYTIC PLATFORMS INTRODUCTION 1.2.4.2 Analytics Analytics is a process of discovering patterns in data and providing insight. As a multidimensional discipline, analytics includes methodologies from mathematics, statistics and predictive modeling to retrieve valuable knowledge. Considering data requirements, often analyses such as initial estimate calculations or data profiling does not require transformed data. Those analyses can be performed on slightly cleaned or even on raw data. Bigger impact on an analysis result usually have chosen subset of data. Therefore data quality processes are usually less strict than for data that are integrated into a EDW for reporting purposes. However, in general it is better to perform analyses on cleansed and transformed data. 1.2.4.3 Reporting Reporting is one of the last parts of a process that starts with discovering useful data sets within the company and continues through their integration with ETL process into EDW. This process, together with reports, have to be well designed, tested and audited as reports are often used for reconsolidation of manual business processes. Reports can also be used as official da stock market. Therefore, the process of cleansing and transformation have to be well documented and precise. Due to significant requirements on data quality with increasing volume of data, an ETL processing gets significantly more complex. Having a large amount of data needed for reporting can cause significant delivery issues. 1.3 Existing technologies in DW There are many architectural approaches to how to build a data warehouse. For last 20 years a lot of experts have been improving methodologies and processes that can be followed. Those methodologies are well known and applicable with traditional business intelligence and data warehouse technologies. As the methodologies are common, companies developing RDBMs such as Oracle or Microsoft have integrated functionality to make data warehouse development easier. There is also a lot of ETL frameworks and data warehouse IDEs (such as WhereScape) that provide higher level of data warehouse development abstraction [7, 9]. Thanks to more than 20 years of continuous development and support from community, the technology have been adjusted, so developers can focus on business needs rather than on technology. A lot of companies are developing or maintaining data warehouses built to integrate data from various internal systems such as ERP, CRM or back end systems. Those data sources are usually built on RDBMs and therefore data are structured and well defined. However, the ETL process is still necessary. Those data are not usually bigger than a few gigabytes a day, hence a processing of data transformations is feasible. After data are transformed and loaded into data warehouse, data are usually accessed via BI tools for easier data manipulation by data 10

1. ANALYTIC PLATFORMS INTRODUCTION consumers. Data from different sources are integrated either in data warehouse, data marts or BI depending on specific architecture. TDWI performed relevant research about architectural components used in data warehousing with the following results. Figure 1.4: Currently used components and plan for the next three years [2]. From results following statements are deductible. EDW and data marts are commonly used and will be used even in the future. OLAP and tabular data are one of the key components of a BI. Dimensional star schema is preferable method in data modelling. RDBMs are used commonly, but it is expected that they will be used less in the next few years. 11

1. ANALYTIC PLATFORMS INTRODUCTION In-memory analytics, columnar databases and Hadoop usage is expected to grow. However, not all companies are planning to adopt these technologies as the technologies are used mostly for specific purposes. Therefore it is expected that usage of new technologies will grow significantly. Nonetheless, existing basic principles of data warehousing will prevail. Hence, it is important that new technologies can be completely integrated into existing data warehouse architecture. This includes both logical and physical architecture of data warehouse. Regarding data modelling, currently, most data warehouses use some form of hybrid architecture [2] that has origin either in Kimball s architectural approach, which is often called bottom-up approach and is based on dimensional modeling or in Inmon s top-down approach which prefers building data warehouse in third normal form [6]. Modeling technique highly depends on development method used. Inmnon s third normal form or Data Vaults are preferable for agile development. Data Vault is combination of both approaches. On the other hand, Kimball is more suitable for iterative data warehouse development. While integrating new data sources, a conventional data warehouse faces several issues: Scalability - Commonly platforms for building data warehouses are RDBMs such as SQL Server from Microsoft or Oracle database. Databases alone are not distributed, therefore adding new data sources can cause adding another server with new database instance. In addition, new processes have to be designed to support such a distributed environment. Price - RDBMs can run almost on any hardware. However, for processing gigabytes of data and at the same time accessing data for reporting and analytics, it is necessary to either buy powerful server and software licenses or significantly optimize DW processes. Usually all three steps are necessary. Unfortunately, licenses, servers and people are all expensive resources. Storage space - One of the most expensive parts of the DW is storage. For the best performance RBMS needs fast accessible storage. Considering data warehouse environment, where data are stored more than once in RDBM (e.g. in persistent stage or operational data store), for fault tolerance disks are set up in RAID, storage plan needs to be designed precisely to keep costs low as possible. Data also need to be backed up regularly and archived. In addition, historically not many sources were able to provide data in a real time or close to a real time, therefore traditional batch processing approach was very convenient. However, business needs output with the lowest latency possible, especially information for operational intelligence as a business needs to respond and act. Most suitable process for batch processing is ETL. Nonetheless, with real time streaming data it is more convenient to use extract, load and 12

1. ANALYTIC PLATFORMS INTRODUCTION transform (ELT) processes in order to be able to analyze data before long running transformations start. However, both an ETL and an ELT are implementable for real time processing with a right set of tools. 13

2 Proposed architecture In order to effectively tackle issues and challenges of current data warehouses and processing of new data sources, DW architecture has to be adjusted. Main requirements on purposed architecture are: Ability to process large amount of data in batch processing as well as in a real time. Scalability up to petabytes of data stored in a system. Distributed computing and linear performance scalability. Ability to process structured and unstructured data. Linear storage scalability. Support for ETL tools and BI tools. Relatively low price. Similar data accessibility to RDBMs. Support for variety of analytical tools. Star schema support. 14

2. PROPOSED ARCHITECTURE 2.1 Architecture overview Proposed architecture uses Hadoop [11] as an additional system to RDBM that brings new features that are useful for specific tasks that are expensive on RDBM. Hadoop is logically implemented as RDBM however it should be used mainly for processing of big amount of data and streaming real time data. Figure 2.1: Diagram describing proposed high-level architecture. Adopting Hadoop framework into existing data warehouse environment brings several benefits: Scalability - Hadoop supports linear scalability. By adding new nodes into Hadoop cluster we can linearly scale performance and storage space [12]. Price - As an open source framework, Hadoop is free to use. This however does not mean that it is inexpensive, as maintaining and developing applications on Hadoop cluster, hardware and experienced people with knowledge of this specific technology are expensive. However, generally Hadoop is significantly cheaper for a terabyte of storage than RDBM as it run on commodity hardware. Modifiability - Another advantage of open source project is modifiability. In case some significant functionality is not available, it is possible to develop it in-house. 15

2. PROPOSED ARCHITECTURE Community - A lot of companies as Yahoo, Facebook or Google are significantly contributing into Hadoop source codes. Either with developing and improving new tools or publishing theirs libraries. Usage of individual Hadoop tools is described in following diagram. Figure 2.2: Diagram of Hadoop tools usage in EDW architecture. Hadoop Open source Structured and unstructured data Less expensive Better for massive full data scans Support for unstructured data No support for transaction processing RDBM Proprietary Structured only, mostly Expensive Usage of index lookups Indirect support for unstructured data Support for transaction processing Table 2.1: Hadoop and RDBS comparsion. 2.2 Multi-Platform DW environment Existence of core data warehouse is still crucial for reporting and dashboards. It is also mostly used as a data source for online analytic processing [2]. In order to solve issues mentioned before, it is convenient to integrate new data platform that 16

2. PROPOSED ARCHITECTURE will support massive volumes and variety of data. What makes this task difficult is precise integration of a new platform into an existing DW architecture. As a data warehousing focuses on data consumers, they should be able to access a new platforms the same way and they should feel confident using the new platform. Using the Kimball s approach for integrating data on both platforms gives users the same view on data and also unites other internal data warehouse processes. 2.3 Hadoop Hadoop is an open-source software framework for distributed processing which allows to cheaply store and process vast amounts of structured and unstructured data. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. High-availability is covered at the application layer, therefore Hadoop does not rely on hardware to secure data and processing. Hadoop delivers service on top of a cluster of computers, each of which may be prone to failures. The framework itself consists of tens of different tools for various purposes and the number of tools available is growing fast. Three major components of Hadoop 2.x are Hadoop distributed file system (HDFS), Yet another resource negotiator (YARN) and MapReduce programming model. Most relevant tools to data warehousing and analytics include: Hive Pig Sqoop Impala Some other tools are not part of Hadoop, but are well integrated with Hadoop framework (such as Storm). 2.3.1 HDFS HDFS was designed to be a scalable, fault-tolerant, distributed storage system that works closely with MapReduce and runs on commodity hardware. "HDFS has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets." [13] An HDFS cluster consists of NameNode which manages metadata and DataNodes that stores the data. Typically each file is split into large blocks of 64 or 128 17

2. PROPOSED ARCHITECTURE megabytes and then distributed to DataNodes. HDFS secure high-availability by replicating and distributing to other nodes. When a block is lost due to failure, NameNode creates another replica of the block and distributes it automatically to different DataNode. 2.3.2 YARN YARN is a cluster management technology and it combines a central resource manager that reconciles the way applications use Hadoop with node manager agents that monitor processing on individual DataNodes. Main purpose of YARN is to allow parallel access and usage of a Hadoop system and resources as until Hadoop 2.x processing of parallel queries was not possible due to lack of resource management. YARN opens Hadoop for wider usage. 2.3.3 MapReduce MapReduce is a software framework that allows developers to write programs that process massive amounts of structured or unstructured data in parallel across a distributed cluster. The MapReduce is divided into two major parts: Map - The Map function takes a series of key/value pairs, processes each, and generates zero or more output key/value pairs. The input and output types of the map can be different from each other. Reduce - The input for each Reduce is pulled from the machine where the Map ran and sorted using the application s sorting function. Number of Reducers does not depend on number of Map functions. MapReduce framework works closely with Hadoop, however, MapReduce programming paradigm can be used with any programming language. 2.3.4 Hive The Hive [11] is a data warehouse software combining querying and managing large data sets stored in HDFS. Developers can specify a structure of tables the same way as in RDBM and then query underlying data using SQL-like language HiveQL [14]. Hive gives a developer power to create tables over data in HDFS or over external data sources and specifies how these tables are stored. Hive metadata are stored in HCatalog. Specifically in Metastore database. Major advantage of the metastore is that its database can reside outside Hadoop cluster. Such a located metastore database can be used by other services and it prevail in case of cluster failure. One of the most distinguishing feature of Hive is validation of schema on read and not with write like in RDBMs. Due to this behaviour, it is not possible to define referential integrity using foreign keys or even define uniqueness. 18

2. PROPOSED ARCHITECTURE Clients can connect to Hive using two different drivers, ODBC and JDBC. ODBC is a standard written in C and C++ and is supported by majority of tools and client tools. JDBC, on the other hand, is based on programming language Java, therefore some technologies, especially the ones from companies not developing on Java, lack native support. However, a lot of tools have separate JDBC drivers that can be installed. For example Microsoft SQL Server have downloadable JDBC driver that supports SQL Server 2005 and up. Oracle and MySQL databases have native support for JDBC drivers. Drivers performance is highly dependable on implementation of driver that we use to connect to Hive [15, 16]. Figure 2.3: Hive connection diagram [11]. Tables in Hive can be internal or external. Internal table is a table completely managed by Hadoop. However, an external table can be located elsewhere and then only metadata are stored in Hive metastore. For querying data outside Hadoop Hive uses Storage Handlers. Currently, support for JDBC storage handler is not included in official Hive release, but code can be downloaded and compiled from open-source project [17]. This feature gives Hive ability to query data stored in databases through JDBC drivers. However, external tables cannot be modified from Hive. Among the JDBC driver, Hive supports external tables for HBase, Cassandra, BigTable and others. Hive uses HiveQL language to query the data [14]. Every query is translated into Java MapReduce jobs first and then executed. In general, Hive has not been build for quick iterative analysis, but mostly for long running jobs. The translation and query distribution itself takes around 20 seconds to finish. This disadvantages of Hive in using it in customer facing BI tools such as reporting services, dashboards or in a OLAP with data stored in Hadoop, as every interaction such as a refresh or change of parameters, generate new query to the Hive and forces users to wait at least 20 seconds for any results. 19

2. PROPOSED ARCHITECTURE Another Hive feature used in data warehousing and analytics is support for different file types and compressed files. This includes text file, binary sequence file, columnar storage called Parquet or JSON objects. Other feature related to file storage is compression. Compressing files can save significant amount of storage space, therefore decreasing read and write time changes for CPU time. In some cases compression can improve both disk usage and query performance [18]. Text files consisting csv, XML or JSON can be parsed using different SerDe functions, which are serialisation and deserialisation functions. Natively Hive offers several basic SerDe functions or RegEx SerDe for regular expressions. Open-source libraries for parsing JSON exist, although they are not included in official Hive releases. However, those libraries often have issues with nested JSON or JSON arrays. In data warehousing data are mostly stored as a time series. Typically every hour, day or week new data are exported from a source system for the particular period. In order to easily append data in Hive tables, Hive supports table partitioning. Each partition is defined by a meta column which is not part of a data files. By adding new partition into existing table, all queries automatically query even new partition. Performance wise specific partition can be specified in where clause of HiveQL statement in order to reduce number of redundant reads as a Hive reads only partitions that are needed. This is useful feature in an ETL process as an ETL usually process only small set of data. Typically, table would be partitioned by a date or a datetime, depending on a period of data exports. Hive also supports indices, which are similar to indices in RDBMs. Index is sorted structure that increases read performance. Using the right query conditions and index, it is possible to decrease number of reads, as only portion of table or partition needs to be loaded and processed. Hive supports index rebuild function on table or individual partitions. 2.3.4.1 HiveQL HiveQL is Hive query language that has been developed by Facebook to simplify data querying in Hadoop. As most developers and analysts are used to SQL language, developing similar language for Hadoop was very reasonable. HiveQL give users SQL-like access to data stored in Hadoop. It does not follow full SQL standard, however the syntax is familiar to SQL. Among others HiveQL supports following features: Advanced SQL features such as window functions (e.g. Rank, Lag, Lead) Querying JSON objects (e.g. using get_json_object() function). User defined functions. Indexes. On the other hand, HiveQL does not support delete and update statements. 20

2. PROPOSED ARCHITECTURE 2.3.4.2 Hive ETL This is an example of hand coded ETL in Hive. Browser and country dimensions are not included as they are identical to country. Individual parts of the code are commented. 1 PREPARATION 2 add t h i r d p a r t y JSON SerDe l i b r a r y 3 add j a r s3 :// elasticmapreduce/samples/hive ads/ l i b s /jsonserde. j a r ; 4 5 s t a g e f o r s t o r i n g temporary d a t a 6 CREATE TABLE s t a g e _ v i s i t _ l o g 7 ( 8 id_dim_email int, 9 id_dim_country int, 10 id_dim_browser int, 11 logtime timestamp 12 ) ; 13 14 f i n a l f a c t t a b l e 15 CREATE TABLE f a c t _ v i s i t _ l o g 16 ( 17 id_dim_email i n t COMMENT Surrogate key to Email dimension, 18 id_dim_country i n t COMMENT Surrogate key to Email dimension, 19 id_dim_browser i n t COMMENT Surrogate key to Email dimension, 20 logtime timestamp COMMENT Time when user loged i n t o our s e r v i c e 21 ) 22 COMMENT Fact t a b l e containing a l l l o g i n s to our s e r v i c e 23 PARTITIONED BY( etl_timestamp timestamp ) 24 i n d i v i d u a l p a r t i t i o n f o r e a c h e t l i t e r a t i o n 25 STORED AS SEQUENCEFILE 26 s t o r i n g as a s e q u e n c e f i l e t o d e c r e a s e f i l e s i z e 27 ; 28 29 f i n a l dimension e m a i l t a b l e 30 CREATE TABLE dim_email 31 ( 32 id_dim_email i n t COMMENT Surrogate key, 33 email s t r i n g COMMENT F u l l email address 34 ) 35 COMMENT Email dimension t a b l e 36 STORED AS SEQUENCEFILE 37 s t o r i n g as a s e q u e n c e f i l e t o d e c r e a s e f i l e s i z e 38 ; 21

2. PROPOSED ARCHITECTURE 39 40 s t a g e e m a i l union t a b l e 41 CREATE TABLE s t a g e _ e m a i l _ a l l 42 ( 43 id_dim_email i n t COMMENT Surrogate key, 44 email s t r i n g COMMENT F u l l email address 45 ) 46 COMMENT Email dimension t a b l e 47 STORED AS SEQUENCEFILE 48 s t o r i n g as a s e q u e n c e f i l e t o d e c r e a s e f i l e s i z e 49 ; 50 51 DATA LOADING 52 EXAMPLE OF ONE ITERATION 53 54 c r e a t e e x t e r n a l t a b l e t o s3 f o l d e r 55 a l l f i l e s in t h e f o l d e r and u n d e r l y i n g f o l d e r s a r e p a r s e d and q u e r i e d 56 a l l o w s s i m p l e ETL j o b rerun 57 DROP TABLE IF EXISTS s o u r c e _ v i s i t _ l o g _ 2 0 1 4 1 0 2 5 ; 58 CREATE EXTERNAL TABLE s o u r c e _ v i s i t _ l o g _ 2 0 1 4 1 0 2 5 59 ( 60 email s t r i n g, 61 country s t r i n g, 62 browser s t r i n g, 63 logtime s t r i n g 64 ) 65 row format serde com. amazon. elasticmapreduce. JsonSerde 66 with s e r d e p r o p e r t i e s ( paths = email, country, browser, logtime ) 67 LOCATION s3 :// Incoming data/ v i s i t log /2014 10 25/ 68 ; 69 70 g e t maximum i d from e m a i l dimension 71 f i l l dimension e m a i l 72 INSERY OVERWRITE TABLE stage_email 73 SELECT 74 ROW_NUMBER( ) OVER ( PARTITION BY email ORDER BY email ) + max_id as id_dim_email 75, email 76 FROM 77 ( 78 SELECT 79 DISTINCT l o g. email as email 80 FROM s o u r c e _ v i s i t _ l o g _ 2 0 1 4 1 0 2 5 log 81 LEFT JOIN dim_email em 82 ON log. email = em. email 22

2. PROPOSED ARCHITECTURE 83 WHERE log. email i s null 84 ) dist_ em 85 LEFT JOIN 86 (SELECT max( id_dim_email ) as max_id FROM dim_email ) max_id_tab 87 ; 88 89 union s t a g e and dimension d a t a 90 INSERT OVERRIDE s t a g e _ e m a i l _ a l l 91 SELECT id_dim_email, email FROM dim_email 92 UNION ALL 93 SELECT id_dim_email, email FROM stage_email 94 ; 95 96 s w i t c h s t a g e and dimension t a b l e 97 ALTER TABLE dim_email RENAME TO dim_email_old ; 98 ALTER TABLE s t a g e _ e m a i l _ a l l RENAME TO dim_email ; 99 ALTER TABLE dim_email_old RENAME TO s t a g e _ e m a i l _ a l l ; 100 101 p e r f o r m s dimension l o o k u p s and l o a d t r a n s f o r m e d d a t a i n t o s t a g e t a b l e 102 a l l o w s s i m p l e ETL j o b rerun 103 INSERT OVERWRITE TABLE s t a g e _ v i s i t _ l o g 104 SELECT 105 ISNULL(em. id_dim_email, 1) as id_dim_email 106, ISNULL( c n t. id_dim_country, 1) as id_ dim_ country 107, ISNULL( br. id_dim_browser, 1) as id_dim_browser 108,CAST( logtime as timestamp ) as logtime 109 FROM s o u r c e _ v i s i t _ l o g _ 2 0 1 4 1 0 2 5 log 110 LEFT JOIN dim_email em 111 ON ISNULL( log. email, unknown ) = em. email 112 LEFT JOIN dim_country c n t 113 ON ISNULL( log. country, unknown ) = cnt. country 114 LEFT JOIN dim_browser br 115 ON ISNULL( log. browser, unknown ) = cnt. browser 116 ; 117 118 p a r t i t i o n swap i n t o f a c t t a b l e 119 ALTER TABLE f a c t _ v i s i t _ l o g DROP IF EXITS PARTITION ( etl_timestamp = 2014 10 25 ) ; 120 121 ALTER TABLE f a c t _ v i s i t _ l o g EXCHANGE PARTITION ( etl_timestamp = 2014 10 25 ) WITH TABLE s t a g e _ v i s i t _ l o g ; 23

2. PROPOSED ARCHITECTURE 2.3.5 Impala Impala is an open source distributed processing framework developed by Cloudera. Impala provides low-latency SQL queries on data stored in Hadoop. Lowlatency is achieved with in-memory computing. Impala stores data in-memory in Parquet columnar format, therefore it more suitable for querying large fact tables with smaller amount of columns. Another feature that distinguish Impala from Hive is a query engine. Imapala does not compile queries into MapReduce rather uses its own programming model. Impala is closely tied to Hive because it uses Hive metastore for storing data metadata. Impala can run on Hadoop cluster without Hive, however Hive metastore has to be installed. Impala accepts queries sent via impala shell, JDBC, ODBC or Hue console. 2.3.6 Pig Pig [19] is simple scripting language for data processing and transformations. It has been developed mainly by Yahoo. The power of Pig and Hive is similar. The main difference is audience for which the tool was developed. Pig is more intuitive for developers with experience in procedural languages. On the other hand, Hive leverages knowledge of SQL. Pig s advantages: Supports structured, semi-structured and unstructured data sets. Procedural, simple, easy to learn. Supports user defined functions. One of the powerful features is capability to work with dynamic schema. Meaning that PIG can parse unstructured data such as JSONs without predefined schema. This is helpful while working with dynamic nested JSONs or JSONs arrays. DataJson = LOAD s3://pig-test/test-data.json using com.twitter.elephantbird.pig.load.jsonloader( -nestedload ); Figure 2.4: Example of Pig script for loading complex JSON structure. Usage of Pig or Hive highly depends on audience. As most of the data warehouses are built on RDBMs and people involved in its development have knowledge of SQL, the Hive is preferable tool for data transformation and data querying. 2.3.7 Kylin Relatively new project that implements OLAP into Hadoop is Kylin [20]. Kylin originally started as an in-house project of ebay but it has been published as opensource few months ago. 24

2. PROPOSED ARCHITECTURE Kylin works in the following manner. Aggregations are calculated from Hive using HiveQL and stored in HBase. Then when data are queried Kylin query engine looks into HBase if requested data are precalculated there and if so, returns data from HBase with sub-second latency. Otherwise Kylin query engine routes query into Hive. Kylin supports JDBC, ODBC and Rest API for client tools, therefore it is possible to connect from analytic tools such as Tableau, SAS or Excel. Figure 2.5: Kylin high-level architecture [20]. 2.3.8 Sqoop Sqoop [11] is tool provided by Hadoop that is used for importing and exporting data. Different modes of exporting are supported such as full export, incremental or limiting size of an export using WHERE clause. Data can be exported from HDFS, Hive or any RDBM that supports JDBC. Sqoop works bi-directionally therefore supports importing into HDFS, Hive or RDBM. Data can also be exported into delimited text file with specific field and row terminators or sequence file. As a tool that is a part of distributed system, Sqoop also supports distributed and parallel processing. Sqoop includes additional functionality for Hive export and import in order to simplify data transfers with RDBMs. Sqoop Hive additional support includes: Incremental imports. CREATE TABLE statements within imports. Data import into specific table partition. Data compression. 25

2. PROPOSED ARCHITECTURE For user interaction, Sqoop has simple command line interface. 26

3 Hadoop integration The following chapters will explain how to integrate Hadoop into enterprise data warehouse including implementation of star schema, data management and physical implementation. 3.1 Star schema implementation A logical star schema model captures business point of view and it is highly platform independable on physical implementation. Hive provides almost the same SQL-like interface as common RDBMs, therefore the data model build for RDBMs data warehouse can be implemented in Hive with some adjustments. Thesis therefore focuses rather on physical implementation than on logical. Main advantages of implementing star schema in Hive: Simple and understandable view on data. Easy support for master data management. Dimensions support user defined hierarchies. Performance improvement [18]. Conformed dimensions can be shared across different platforms. 3.1.1 Dimensions implementation One of the most challenging part of implementing dimension tables are unsupported DML functions update and delete at the row level in HIVE. Even though, append operation is available, it creates new file with every append. This can cause significant issues with NameNode performance as amount of small files can grow significantly and it can cause issues with insufficient memory. Also only first type of slowly changing dimension can be implemented without support of update and delete. Because dimension needs to keep the same surrogate keys and update is not available, it is necessary to recreate the table every ETL run with the same keys. Although, auto increment functionality is not developed yet, other ways exist. While inserting new rows into table, we need to merge existing dimension data and new data. It is necessary to keep old keys and generate new keys for new records in this step. One of the ways how to generate a sequence is getting maximal key value from existing data and then use function UDFRowsequence.java to generate sequence starting with number one and then add maximum key value to all generated keys. The same result can be achieved using ROW_NUMBER() window function. Due to lack of transactions in Hive it is recommended to stage data from more data sources first and load them into final dimension table at once 27

3. HADOOP INTEGRATION or use different techniques to disable possibility of concurrent runs of filling a dimensional table. Example of filling dimension table is at rows 70-100 in chapter 2.3.4.2. However, as the Hive does not support primary keys, it is better to develop a job that validates uniqueness of surrogate keys in dimension tables. As a duplication of surrogate keys can lead to a record duplication in a fact tables. If a dimension is conformed or at least shared across different data sources on different platforms, more suitable solution is to generate keys on platform such as RDBM that have auto-increment functionality, supports transactions and ensures uniqueness. Basically RDBM keeps the original dimension table and a Hive has only copy for read. One of the possibilities how to load a dimension using additional RDBM is to take records that should be inserted into dimension and export them into staging table on RDBM. For export from a Hive Sqoop tool can be used. Figure 3.1: Diagram describing using RDBM to assign surrogate keys. sqoop import --connect jdbc:sqlserver://edw.server.com;database=dw; username=sqoop;password=sqoop --table dim_email --hive-import --schema dbo --incremental append --check-column id_dim_email --last-value 500 --hive-import Figure 3.2: Sqoop imports all records from dim_email with id_d_email greater than 500. Loading into dimension in RDBM runs in a transaction, therefore concurrent inserts are excluded. When dimension is loaded it can be exported back to Hive. Depending on size of the table, incremental of full export can be chosen. This export needs to be performed after each insertion into dimension as Hive has only copy and it needs to be synchronized with the original. 28