}w!"#$%&'()+,-./012345<ya

Size: px
Start display at page:

Download "}w!"#$%&'()+,-./012345<ya"

Transcription

1 }w!"#$%&'()+,-./012345<ya MASARYK UNIVERSITY FACULTY OF INFORMATICS Hadoop as an Extension of the Enterprise Data Warehouse MASTER THESIS Bc. Aleš Hejmalíček Brno, 2015

2 Declaration Hereby I declare, that this paper is my original authorial work, which I have worked out by my own. All sources, references and literature used or excerpted during elaboration of this work are properly cited and listed in complete reference to the due source. Bc. Aleš Hejmalíček Advisor: doc. RNDr. Vlastislav Dohnal, Ph.D. ii

3 Acknowledgement First of all I would like to thank my supervisor Vlastislav Dohnal for his time, for his advices and feedback along the way. Then I would like to thank to my family for their continuous support through whole university studies. I would also like to apologize to my girlfriend for waking her late at night with loud typing. Last but not least is a thank to my colleagues in AVG BI team for respecting my time inflexibility, especially while finishing this thesis. iii

4 Abstract The goal of this thesis is to describe issues with processing big data and to propose and explain an enterprise data warehouse architecture that is capable of processing large volumes of structured and unstructured data. The thesis aims to explain integration of Hadoop framework as a part of the proposed architecture into existing enterprise data warehouses. iv

5 Keywords Hadoop, Data warehouse, Kimball, Analytic platform, OLAP, Hive, ETL, Analytics v

6 Contents Introduction Analytic platforms introduction New data sources Data warehouse Analytic platform Extract, transform and load Kimball s architecture Dimensional modelling Conformed dimensions Surrogate keys Fact tables Business intelligence Online analytical processing Analytics Reporting Existing technologies in DW Proposed architecture Architecture overview Multi-Platform DW environment Hadoop HDFS YARN MapReduce Hive HiveQL Hive ETL Impala Pig Kylin Sqoop Hadoop integration Star schema implementation Dimensions implementation Facts implementation Star schema performance optimization Security Data management Master data Metadata Orchestration Data profiling Data quality vi

7 3.3.6 Data archiving and preservation Extract, transform and load Hand coding Commercial tools Business intelligence Online analytical processing Reporting Analytics Real time processing Physical implementation Hadoop SaaS Hadoop commercial distributions Other Hadoop use cases Data lakes ETL offload platform Data archive platform Analytic sandboxes Conclusion vii

8 Introduction Companies want to leverage emerging new data sources to gain market advantage. However, traditional technologies are not sufficient for processing large volumes of data or streaming real time data. Hence, in last few years a lot of companies invested into development of new technologies that are capable of processing such data. Those data processing technologies are generally expensive. Therefore, Hadoop, open source framework for distributed computing was developed. Lately, companies have adopted and integrated Hadoop framework to improve their data processing capabilities, despite the fact that they already use some form of data warehouse. However adopting Hadoop and other similar technologies brings new challenges for all people involved in data processing, reporting and data analytics. Main issue is integration into already running data warehouse environment as Hadoop technology is relatively new and not many business use cases and successful implementations have been published and therefore there are no existing best practices or guidelines. This is the reason why I choose this topic as the topic of my master s thesis. Goal of my thesis is to suggest data warehouse architecture that allows processing of large amount of data in batch manner and streaming data as well and to explain techniques and processes for new system integration into existing enterprise data warehouse. Including explanation of data integration using Kimball s architecture best practices, data management processes and connection to business intelligence systems such as reporting or online analytical processing. Proposed architecture also needs to be accessible the same way as existing data warehouses, so the data consumers can access the data in familiar manner. The first chapter introduce the problems of data processing and explains basic terms such as data warehousing, business intelligence or analytics. The main goal is to present new issues, challenges and currently used technologies for data processing. The second chapter presents requirements on new architecture and purposes architecture that meets necessary requirements. Further in second chapter are described individual technologies and theirs characteristics, advantages and disadvantages. The third chapter focuses on Hadoop integration into existing enterprise data warehouse environment. It includes explanation of data integration in Hadoop following Kimball s best practices from general data warehousing such as start schema implementation and individual parts of data management plan and processes. The other part describes implementation of extract, transform and load process, usage of business intelligence and reporting tools and then focuses on physical Hadoop implementation and Hadoop cluster location. The Last chapter explains other specific business use cases for Hadoop in data processing and data warehousing as it is well-rounded technology and can be used for different purposes. 1

9 1 Analytic platforms introduction In the past few years, the amount of data available to organizations of all kinds has increased exponentially. Businesses are under the pressure to be able to retrieve information that could be leveraged to improve business performance and to gain competitive advantage. However, processing and data integration is getting more complex due to data variety, volume and velocity. This is the challenge for most of organizations, as internal changes are necessary in organization, data management and infrastructure /citebd. Due to adoption of new technologies, companies hire people with specific experience and skills in these new technologies. As the most of the technologies and tools are relatively young, it is expected that skill shortage gap is going to grow. Ideal candidate should have mix of analytic skills, statistics and coding experience and such people are difficult to find. By 2018 The United States alone is going to face a shortage of 140,000 to 190,000 people with analytical expertise and 1.5 million managers and analysts with the skills to understand and make decisions based on an analysis of big data [1]. However, the data are not new for organizations. In most cases the data from mobile devices, log data or sensor data have been available for a long time. A lot of those data were not previously processed and analyzed even though the sources of massive amount of data existed in history. Major change in business perspective happened in last several years with development of distributed analytic platforms. Nowadays, the technologies are available for anyone for an affordable price and a lot of companies have started to look for business cases to build and use analytical platforms. Due to the demand, a lot of companies have created their products to help businesses solve those issues. Companies such as Cloudera, IBM, Microsoft or Teradata offers solution including software, hardware and enterprise support all together. However, an initial price is too high for smaller companies, as the price for a product itself does not takes into consideration companies existing systems and additional data sources and processing or data integration. Data integration itself is a major expense in data processing projects. Most technological companies already somehow integrate data and build data warehouses to ensure simple and unified access to data [2]. However, these data warehouses are built to secure precise reporting and usually uses technologies such as relational database management systems (RDBM) that are not designed for processing of petabytes of data. Companies build data warehouses to achieve fast, simple and unified reporting. This is mostly performed by aggregating the data. This approach improve processing speeds and reduce needed storage space. On the other hand, aggregated data doesn t allow complex analytics due to a lack of detail. For analytic purposes data should have the same detail as raw source data. When building analytic platform or building complementary solution to an already existing data warehouse, business must decide if they prefer commercial 2

10 1. ANALYTIC PLATFORMS INTRODUCTION third party product with enterprise support or if they are able to build the solution themselves in-house. Both approaches have advantages and disadvantages. Main advantage of an in-house solution is the price and modifiability. On the other hand it can be difficult to find experts with enough experience to develop and deliver end to end solution and to provide continuous maintenance. Major disadvantages of buying complete analytical platform is the price as it can hide additional costs in data integration, maintenance and support. 1.1 New data sources As new devices are connecting to the network, various data become available, a large amount of data is hard to manage with common technologies and tools and to process it within tolerable time. Those new data sources include mobile phones, tablets, sensors, wearable devices, which number has grown significantly in last years. All these devices interact with its surrounding or web sites such as social media and every action can be recorded and logged. New data sources have following characteristics: Volume - As mentioned before, the amount of data is rapidly increasing every year. According to The Economist [3], the amount of digital information increases tenfold every five years. Velocity - Interaction with social medias or with mobile applications usually happens in real time and hence causing continuous data flow. Processing real time data can help business make valuable decisions. Variety - Data structure can be dynamic and it can change significantly with any record. Such data include XML or nested JSON and JSON arrays. However, unstructured or semi structured data are harder to process with traditional technologies, although these data can contain valuable information. Veracity - With different data sources, it is getting more difficult to maintain data certainty and this issue is more challenging with more volume and higher velocity. With increasing volume and complexity of data, integrating and cleaning processes get more difficult and companies are integrating new platforms and processes to deal with data processing and analytics [4, 5]. 1.2 Data warehouse Data warehouse is a system that allows process data in order to be easily analysable and queryable. Bill Inmon defined a data warehouse in 1990 as follows: A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management s decision making process. [6] 3

11 1. ANALYTIC PLATFORMS INTRODUCTION He defined the terms as follows: Subject Oriented - Data that gives information about a particular subject instead of about a company s ongoing operations. Integrated - Data that is gathered into the data warehouse from a variety of sources and merged into a coherent whole. Time-variant - All data in the data warehouse is identified with a particular time period. Non-volatile - Data are stable in a data warehouse. More data can be added but historic data are never removed or modified. This enables management to gain a consistent picture of the business. Enterprise data warehouse integrates and unifies all the business information of an organization and makes it accessible all across a company without compromising security or data integrity. It allows complex reporting and analytics across different systems and business processes Analytic platform Analytic platform is a set of tools and technologies that allow storing, processing and analysis of data. Most companies want to focus on processing of higher amount of data, therefore a distributed system is a base of analytical platform. Often newer technologies are used such as NoSQL and NewSQL databases and advanced statistical tools and machine learning. Many companies providing analytic platforms focus on providing as much integrated tools as possible in order to make adaptation of a new platform seemingly easy. Those platforms are designed for processing a large amount of data in order to make advanced analytics possible. Among others, advance analytics include following cases: Search ranking Ad tracking User experience analysis Big science data analysis Location and proximity tracking Causal factor discovery Social CRM Document similarity testing Customer churn or wake up analysis 4

12 1. ANALYTIC PLATFORMS INTRODUCTION Extract, transform and load Extract, transform and load (ETL) [7] is a process of moving data across systems and databases in order to make them easily analysable. ETL is mostly used in data warehousing as data that are being loaded into data warehouse are often transformed and cleansed to ensure data quality needed for analysis. ETL describes three steps of moving data: Extract - Process of an extraction of data from a source system. Extraction can be directly from database or through some API. Extraction can implement complex mechanisms of data extraction to extract only changes from a database. This process is called change data capture and is one of the most efficient mechanism for data extraction. Extraction also often includes extract archiving for audit purposes. Transform - Transformation can implement any algorithm or transformation logic. Data are transformed so they satisfy data model and data quality needs of data warehouse. This can include data type conversion, data cleansing or even fuzzy lookups. Usually data from more data sources are integrated together within this step. Load - Process of loading transformed data into data warehouse. Usually includes loading transformed dimensional and fact tables Kimball s architecture Ralph Kimball designed individual processes and tasks within a data warehouse in order to simplify its development and maintenance [8]. The Kimball s architecture includes processes used in end to end delivery of a data warehouse. The whole data warehouse development process starts with gathering users requirements. Primary goal is to gather metrics which needs to be analyzed in order to identify data sources and design data model. Then it describes incremental development method that focus on continuous delivery as data warehouse project are rather bigger and it is important to start delivering business value early in the data warehouse development. Key features of Kimball s architecture are dimensional modelling and identifying fact tables, which are described further in this thesis. Regarding ETL process, Kimball described steps of an extraction, data transformation, data cleaning and loading into stage as into temporal storage and into final data warehouse dimensional and fact tables. All other processes around data warehouse such as data warehouse security, data management and data governance are included as well. All together Kimball s architecture is a quite simple and understandable framework for data warehouse development Dimensional modelling Dimensional modeling [8] is a technique for designing data models simple and understandable. As most of the business users tend to describe world in the enti- 5

13 1. ANALYTIC PLATFORMS INTRODUCTION ties such as product, customer or dates it is reasonable to model the data the same way. It is intuitive implementation of data cube that has edges labeled product, customer and date for example. This implementation allows users to easily slice the data and breakdown them by different dimensions. Inside of a data cube are measured metrics. When cube is sliced, metrics are shown, depending on how many dimensions are sliced. Implementation of dimensional model is a star schema [8]. In star schema, all dimensions are tied to fact tables [8]. Therefore in star schema it is easily visible, which dimensions can be used in for slicing. Figure 1.1: Star schema example. Dimensions can also contain different hierarchies and additional attributes. As the dimension needs to track history, Kimball defines several types [8] of dimensions. The most commons are: Type one - It does not track historical changes. When a record is changed in source system, it is updated in dimensions. Therefore only one record is stored for each natural key. Type two - It uses additional attributes such as date effective from and date effective from to track different versions of record. When record is changed in source system, new record is inserted into dimensions and date effective to attribute of old record is updated. Type three - It is used to track changes in defined attributes. If history tracking is needed, two separate attributes are created. One defines current state and the second one previous value Conformed dimensions One of the key features in data integration are conformed dimensions [8]. These are dimensions that describes one entity the same way across all integrated data 6

14 1. ANALYTIC PLATFORMS INTRODUCTION sources. Main reason for implementation of conformed dimensions is that CRM, ERP or billing systems can have different attributes and different ways how to describe business entities such as customer. Dimension conforming is process of taking all information about an entity and designing the transformation process in the way that data about the entity from all data sources are merged into one dimension. Dimension created using this process is called conformed dimension. Using conformed dimension significantly simplifies business data as all people involved in the business use the same view on customer and the same definition. This allows simple data reconsolidation. Figure 1.2: Dimension table example Surrogate keys Surrogate keys [8] ensures identification of individual entities in a dimension. Usually surrogate keys are implemented as an incremental sequence of integers. Surrogate keys are used because duplication of natural keys is expected in dimension tables as the changes in time need to be tracked, therefore surrogate key identifies specific version of the record. In addition, more than one natural key would have to be used in conformed dimension as the data may come from more data sources. Surrogate keys are predictable and easily manageable as they are assigned within a data warehouse Fact tables Fact tables are tables containing specific events or measures of a business process. A fact table has typically two types of columns. Foreign keys of dimension 7

15 1. ANALYTIC PLATFORMS INTRODUCTION tables and numeric measures. In an ETL process lookup on dimension tables is performed and values or keys describing entity in dimension are replaced by surrogate keys from particular dimensions. Fact table is defined by its granularity. Fact tables should always contain only one level of granularity. Having different granularity in a fact table could cause issues in measures aggregations. An example of fact tables with specific granularity can be table of sales orders and another one with order items. Figure 1.3: Fact table example. While designing fact table it is important to identify business processes that users want to analyze in order to specify data sources needed. Then follows a definition of measures such as sale amount or tax amount and a definition of dimensions that make sense within a business process context Business intelligence Business intelligence (BI) is set of tools and techniques, which goal is to simplify querying and analysing data sets. Commonly, BI tools are used as a top layer of data warehouse which is accessible for wide spectrum of users as a lot of BI techniques do not require advanced technological skills [9]. Business intelligence uses variation of techniques. Depending on a business requirements different tool or technique is chosen. Therefore BI in companies usually contains various tools from different providers. Example of BI techniques: Online analytical processing (OLAP) Reporting 8

16 1. ANALYTIC PLATFORMS INTRODUCTION Dashboards Predictive analytics Data mining Online analytical processing Online analytical processing (OLAP) [10, 9] is an approach that achieves faster response when querying multidimensional data, therefore it is one of the key features of companies decision system. The main advantage of OLAP is a leverage of star or snowflake schema structure. Three types of OLAP exist. If OLAP tool stores data in special structure such as hash table (SSAS) or multidimensional array on OLAP server then it is called multidimensional OLAP (MOLAP). MOLAP provides quick response to operations such as slice, dice, roll-up or drill-down as a tool is able to simply navigate trough precalculated aggregation to the lowest level. Among MOLAP, two other types of OLAP tools exist. Those are relational OLAP (ROLAP), which is based on querying data in relational structure (e.g. in RDBM). ROLAP is not very common as it does not achieve querying response time of MOLAP. The third type is hybrid OLAP (HOLAP), which is combination of MOLAP and ROLAP. One of the popular implementation is precalculating aggregations into MOLAP and keeping underlying data stored in ROLAP, therefore only when an aggregation is requested, underlying data are not queried. OLAP is often queried via multidimensional query language such as MDX either directly or through analytic tool such as Excel 1 or Tableau 2. User then only works with a pivot table and is able to query or filter aggregated and underlying data depending on OLAP definition. Some of the OLAP tools include: SSAS Microsoft 3. Mondrian developed by Pentaho 4. SAS OLAP Server 5. Oracle OLAP

17 1. ANALYTIC PLATFORMS INTRODUCTION Analytics Analytics is a process of discovering patterns in data and providing insight. As a multidimensional discipline, analytics includes methodologies from mathematics, statistics and predictive modeling to retrieve valuable knowledge. Considering data requirements, often analyses such as initial estimate calculations or data profiling does not require transformed data. Those analyses can be performed on slightly cleaned or even on raw data. Bigger impact on an analysis result usually have chosen subset of data. Therefore data quality processes are usually less strict than for data that are integrated into a EDW for reporting purposes. However, in general it is better to perform analyses on cleansed and transformed data Reporting Reporting is one of the last parts of a process that starts with discovering useful data sets within the company and continues through their integration with ETL process into EDW. This process, together with reports, have to be well designed, tested and audited as reports are often used for reconsolidation of manual business processes. Reports can also be used as official da stock market. Therefore, the process of cleansing and transformation have to be well documented and precise. Due to significant requirements on data quality with increasing volume of data, an ETL processing gets significantly more complex. Having a large amount of data needed for reporting can cause significant delivery issues. 1.3 Existing technologies in DW There are many architectural approaches to how to build a data warehouse. For last 20 years a lot of experts have been improving methodologies and processes that can be followed. Those methodologies are well known and applicable with traditional business intelligence and data warehouse technologies. As the methodologies are common, companies developing RDBMs such as Oracle or Microsoft have integrated functionality to make data warehouse development easier. There is also a lot of ETL frameworks and data warehouse IDEs (such as WhereScape) that provide higher level of data warehouse development abstraction [7, 9]. Thanks to more than 20 years of continuous development and support from community, the technology have been adjusted, so developers can focus on business needs rather than on technology. A lot of companies are developing or maintaining data warehouses built to integrate data from various internal systems such as ERP, CRM or back end systems. Those data sources are usually built on RDBMs and therefore data are structured and well defined. However, the ETL process is still necessary. Those data are not usually bigger than a few gigabytes a day, hence a processing of data transformations is feasible. After data are transformed and loaded into data warehouse, data are usually accessed via BI tools for easier data manipulation by data 10

18 1. ANALYTIC PLATFORMS INTRODUCTION consumers. Data from different sources are integrated either in data warehouse, data marts or BI depending on specific architecture. TDWI performed relevant research about architectural components used in data warehousing with the following results. Figure 1.4: Currently used components and plan for the next three years [2]. From results following statements are deductible. EDW and data marts are commonly used and will be used even in the future. OLAP and tabular data are one of the key components of a BI. Dimensional star schema is preferable method in data modelling. RDBMs are used commonly, but it is expected that they will be used less in the next few years. 11

19 1. ANALYTIC PLATFORMS INTRODUCTION In-memory analytics, columnar databases and Hadoop usage is expected to grow. However, not all companies are planning to adopt these technologies as the technologies are used mostly for specific purposes. Therefore it is expected that usage of new technologies will grow significantly. Nonetheless, existing basic principles of data warehousing will prevail. Hence, it is important that new technologies can be completely integrated into existing data warehouse architecture. This includes both logical and physical architecture of data warehouse. Regarding data modelling, currently, most data warehouses use some form of hybrid architecture [2] that has origin either in Kimball s architectural approach, which is often called bottom-up approach and is based on dimensional modeling or in Inmon s top-down approach which prefers building data warehouse in third normal form [6]. Modeling technique highly depends on development method used. Inmnon s third normal form or Data Vaults are preferable for agile development. Data Vault is combination of both approaches. On the other hand, Kimball is more suitable for iterative data warehouse development. While integrating new data sources, a conventional data warehouse faces several issues: Scalability - Commonly platforms for building data warehouses are RDBMs such as SQL Server from Microsoft or Oracle database. Databases alone are not distributed, therefore adding new data sources can cause adding another server with new database instance. In addition, new processes have to be designed to support such a distributed environment. Price - RDBMs can run almost on any hardware. However, for processing gigabytes of data and at the same time accessing data for reporting and analytics, it is necessary to either buy powerful server and software licenses or significantly optimize DW processes. Usually all three steps are necessary. Unfortunately, licenses, servers and people are all expensive resources. Storage space - One of the most expensive parts of the DW is storage. For the best performance RBMS needs fast accessible storage. Considering data warehouse environment, where data are stored more than once in RDBM (e.g. in persistent stage or operational data store), for fault tolerance disks are set up in RAID, storage plan needs to be designed precisely to keep costs low as possible. Data also need to be backed up regularly and archived. In addition, historically not many sources were able to provide data in a real time or close to a real time, therefore traditional batch processing approach was very convenient. However, business needs output with the lowest latency possible, especially information for operational intelligence as a business needs to respond and act. Most suitable process for batch processing is ETL. Nonetheless, with real time streaming data it is more convenient to use extract, load and 12

20 1. ANALYTIC PLATFORMS INTRODUCTION transform (ELT) processes in order to be able to analyze data before long running transformations start. However, both an ETL and an ELT are implementable for real time processing with a right set of tools. 13

21 2 Proposed architecture In order to effectively tackle issues and challenges of current data warehouses and processing of new data sources, DW architecture has to be adjusted. Main requirements on purposed architecture are: Ability to process large amount of data in batch processing as well as in a real time. Scalability up to petabytes of data stored in a system. Distributed computing and linear performance scalability. Ability to process structured and unstructured data. Linear storage scalability. Support for ETL tools and BI tools. Relatively low price. Similar data accessibility to RDBMs. Support for variety of analytical tools. Star schema support. 14

22 2. PROPOSED ARCHITECTURE 2.1 Architecture overview Proposed architecture uses Hadoop [11] as an additional system to RDBM that brings new features that are useful for specific tasks that are expensive on RDBM. Hadoop is logically implemented as RDBM however it should be used mainly for processing of big amount of data and streaming real time data. Figure 2.1: Diagram describing proposed high-level architecture. Adopting Hadoop framework into existing data warehouse environment brings several benefits: Scalability - Hadoop supports linear scalability. By adding new nodes into Hadoop cluster we can linearly scale performance and storage space [12]. Price - As an open source framework, Hadoop is free to use. This however does not mean that it is inexpensive, as maintaining and developing applications on Hadoop cluster, hardware and experienced people with knowledge of this specific technology are expensive. However, generally Hadoop is significantly cheaper for a terabyte of storage than RDBM as it run on commodity hardware. Modifiability - Another advantage of open source project is modifiability. In case some significant functionality is not available, it is possible to develop it in-house. 15

23 2. PROPOSED ARCHITECTURE Community - A lot of companies as Yahoo, Facebook or Google are significantly contributing into Hadoop source codes. Either with developing and improving new tools or publishing theirs libraries. Usage of individual Hadoop tools is described in following diagram. Figure 2.2: Diagram of Hadoop tools usage in EDW architecture. Hadoop Open source Structured and unstructured data Less expensive Better for massive full data scans Support for unstructured data No support for transaction processing RDBM Proprietary Structured only, mostly Expensive Usage of index lookups Indirect support for unstructured data Support for transaction processing Table 2.1: Hadoop and RDBS comparsion. 2.2 Multi-Platform DW environment Existence of core data warehouse is still crucial for reporting and dashboards. It is also mostly used as a data source for online analytic processing [2]. In order to solve issues mentioned before, it is convenient to integrate new data platform that 16

24 2. PROPOSED ARCHITECTURE will support massive volumes and variety of data. What makes this task difficult is precise integration of a new platform into an existing DW architecture. As a data warehousing focuses on data consumers, they should be able to access a new platforms the same way and they should feel confident using the new platform. Using the Kimball s approach for integrating data on both platforms gives users the same view on data and also unites other internal data warehouse processes. 2.3 Hadoop Hadoop is an open-source software framework for distributed processing which allows to cheaply store and process vast amounts of structured and unstructured data. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. High-availability is covered at the application layer, therefore Hadoop does not rely on hardware to secure data and processing. Hadoop delivers service on top of a cluster of computers, each of which may be prone to failures. The framework itself consists of tens of different tools for various purposes and the number of tools available is growing fast. Three major components of Hadoop 2.x are Hadoop distributed file system (HDFS), Yet another resource negotiator (YARN) and MapReduce programming model. Most relevant tools to data warehousing and analytics include: Hive Pig Sqoop Impala Some other tools are not part of Hadoop, but are well integrated with Hadoop framework (such as Storm) HDFS HDFS was designed to be a scalable, fault-tolerant, distributed storage system that works closely with MapReduce and runs on commodity hardware. "HDFS has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets." [13] An HDFS cluster consists of NameNode which manages metadata and DataNodes that stores the data. Typically each file is split into large blocks of 64 or

25 2. PROPOSED ARCHITECTURE megabytes and then distributed to DataNodes. HDFS secure high-availability by replicating and distributing to other nodes. When a block is lost due to failure, NameNode creates another replica of the block and distributes it automatically to different DataNode YARN YARN is a cluster management technology and it combines a central resource manager that reconciles the way applications use Hadoop with node manager agents that monitor processing on individual DataNodes. Main purpose of YARN is to allow parallel access and usage of a Hadoop system and resources as until Hadoop 2.x processing of parallel queries was not possible due to lack of resource management. YARN opens Hadoop for wider usage MapReduce MapReduce is a software framework that allows developers to write programs that process massive amounts of structured or unstructured data in parallel across a distributed cluster. The MapReduce is divided into two major parts: Map - The Map function takes a series of key/value pairs, processes each, and generates zero or more output key/value pairs. The input and output types of the map can be different from each other. Reduce - The input for each Reduce is pulled from the machine where the Map ran and sorted using the application s sorting function. Number of Reducers does not depend on number of Map functions. MapReduce framework works closely with Hadoop, however, MapReduce programming paradigm can be used with any programming language Hive The Hive [11] is a data warehouse software combining querying and managing large data sets stored in HDFS. Developers can specify a structure of tables the same way as in RDBM and then query underlying data using SQL-like language HiveQL [14]. Hive gives a developer power to create tables over data in HDFS or over external data sources and specifies how these tables are stored. Hive metadata are stored in HCatalog. Specifically in Metastore database. Major advantage of the metastore is that its database can reside outside Hadoop cluster. Such a located metastore database can be used by other services and it prevail in case of cluster failure. One of the most distinguishing feature of Hive is validation of schema on read and not with write like in RDBMs. Due to this behaviour, it is not possible to define referential integrity using foreign keys or even define uniqueness. 18

26 2. PROPOSED ARCHITECTURE Clients can connect to Hive using two different drivers, ODBC and JDBC. ODBC is a standard written in C and C++ and is supported by majority of tools and client tools. JDBC, on the other hand, is based on programming language Java, therefore some technologies, especially the ones from companies not developing on Java, lack native support. However, a lot of tools have separate JDBC drivers that can be installed. For example Microsoft SQL Server have downloadable JDBC driver that supports SQL Server 2005 and up. Oracle and MySQL databases have native support for JDBC drivers. Drivers performance is highly dependable on implementation of driver that we use to connect to Hive [15, 16]. Figure 2.3: Hive connection diagram [11]. Tables in Hive can be internal or external. Internal table is a table completely managed by Hadoop. However, an external table can be located elsewhere and then only metadata are stored in Hive metastore. For querying data outside Hadoop Hive uses Storage Handlers. Currently, support for JDBC storage handler is not included in official Hive release, but code can be downloaded and compiled from open-source project [17]. This feature gives Hive ability to query data stored in databases through JDBC drivers. However, external tables cannot be modified from Hive. Among the JDBC driver, Hive supports external tables for HBase, Cassandra, BigTable and others. Hive uses HiveQL language to query the data [14]. Every query is translated into Java MapReduce jobs first and then executed. In general, Hive has not been build for quick iterative analysis, but mostly for long running jobs. The translation and query distribution itself takes around 20 seconds to finish. This disadvantages of Hive in using it in customer facing BI tools such as reporting services, dashboards or in a OLAP with data stored in Hadoop, as every interaction such as a refresh or change of parameters, generate new query to the Hive and forces users to wait at least 20 seconds for any results. 19

27 2. PROPOSED ARCHITECTURE Another Hive feature used in data warehousing and analytics is support for different file types and compressed files. This includes text file, binary sequence file, columnar storage called Parquet or JSON objects. Other feature related to file storage is compression. Compressing files can save significant amount of storage space, therefore decreasing read and write time changes for CPU time. In some cases compression can improve both disk usage and query performance [18]. Text files consisting csv, XML or JSON can be parsed using different SerDe functions, which are serialisation and deserialisation functions. Natively Hive offers several basic SerDe functions or RegEx SerDe for regular expressions. Open-source libraries for parsing JSON exist, although they are not included in official Hive releases. However, those libraries often have issues with nested JSON or JSON arrays. In data warehousing data are mostly stored as a time series. Typically every hour, day or week new data are exported from a source system for the particular period. In order to easily append data in Hive tables, Hive supports table partitioning. Each partition is defined by a meta column which is not part of a data files. By adding new partition into existing table, all queries automatically query even new partition. Performance wise specific partition can be specified in where clause of HiveQL statement in order to reduce number of redundant reads as a Hive reads only partitions that are needed. This is useful feature in an ETL process as an ETL usually process only small set of data. Typically, table would be partitioned by a date or a datetime, depending on a period of data exports. Hive also supports indices, which are similar to indices in RDBMs. Index is sorted structure that increases read performance. Using the right query conditions and index, it is possible to decrease number of reads, as only portion of table or partition needs to be loaded and processed. Hive supports index rebuild function on table or individual partitions HiveQL HiveQL is Hive query language that has been developed by Facebook to simplify data querying in Hadoop. As most developers and analysts are used to SQL language, developing similar language for Hadoop was very reasonable. HiveQL give users SQL-like access to data stored in Hadoop. It does not follow full SQL standard, however the syntax is familiar to SQL. Among others HiveQL supports following features: Advanced SQL features such as window functions (e.g. Rank, Lag, Lead) Querying JSON objects (e.g. using get_json_object() function). User defined functions. Indexes. On the other hand, HiveQL does not support delete and update statements. 20

28 2. PROPOSED ARCHITECTURE Hive ETL This is an example of hand coded ETL in Hive. Browser and country dimensions are not included as they are identical to country. Individual parts of the code are commented. 1 PREPARATION 2 add t h i r d p a r t y JSON SerDe l i b r a r y 3 add j a r s3 :// elasticmapreduce/samples/hive ads/ l i b s /jsonserde. j a r ; 4 5 s t a g e f o r s t o r i n g temporary d a t a 6 CREATE TABLE s t a g e _ v i s i t _ l o g 7 ( 8 id_dim_ int, 9 id_dim_country int, 10 id_dim_browser int, 11 logtime timestamp 12 ) ; f i n a l f a c t t a b l e 15 CREATE TABLE f a c t _ v i s i t _ l o g 16 ( 17 id_dim_ i n t COMMENT Surrogate key to dimension, 18 id_dim_country i n t COMMENT Surrogate key to dimension, 19 id_dim_browser i n t COMMENT Surrogate key to dimension, 20 logtime timestamp COMMENT Time when user loged i n t o our s e r v i c e 21 ) 22 COMMENT Fact t a b l e containing a l l l o g i n s to our s e r v i c e 23 PARTITIONED BY( etl_timestamp timestamp ) 24 i n d i v i d u a l p a r t i t i o n f o r e a c h e t l i t e r a t i o n 25 STORED AS SEQUENCEFILE 26 s t o r i n g as a s e q u e n c e f i l e t o d e c r e a s e f i l e s i z e 27 ; f i n a l dimension e m a i l t a b l e 30 CREATE TABLE dim_ 31 ( 32 id_dim_ i n t COMMENT Surrogate key, 33 s t r i n g COMMENT F u l l address 34 ) 35 COMMENT dimension t a b l e 36 STORED AS SEQUENCEFILE 37 s t o r i n g as a s e q u e n c e f i l e t o d e c r e a s e f i l e s i z e 38 ; 21

29 2. PROPOSED ARCHITECTURE s t a g e e m a i l union t a b l e 41 CREATE TABLE s t a g e _ e m a i l _ a l l 42 ( 43 id_dim_ i n t COMMENT Surrogate key, 44 s t r i n g COMMENT F u l l address 45 ) 46 COMMENT dimension t a b l e 47 STORED AS SEQUENCEFILE 48 s t o r i n g as a s e q u e n c e f i l e t o d e c r e a s e f i l e s i z e 49 ; DATA LOADING 52 EXAMPLE OF ONE ITERATION c r e a t e e x t e r n a l t a b l e t o s3 f o l d e r 55 a l l f i l e s in t h e f o l d e r and u n d e r l y i n g f o l d e r s a r e p a r s e d and q u e r i e d 56 a l l o w s s i m p l e ETL j o b rerun 57 DROP TABLE IF EXISTS s o u r c e _ v i s i t _ l o g _ ; 58 CREATE EXTERNAL TABLE s o u r c e _ v i s i t _ l o g _ ( 60 s t r i n g, 61 country s t r i n g, 62 browser s t r i n g, 63 logtime s t r i n g 64 ) 65 row format serde com. amazon. elasticmapreduce. JsonSerde 66 with s e r d e p r o p e r t i e s ( paths = , country, browser, logtime ) 67 LOCATION s3 :// Incoming data/ v i s i t log / / 68 ; g e t maximum i d from e m a i l dimension 71 f i l l dimension e m a i l 72 INSERY OVERWRITE TABLE stage_ 73 SELECT 74 ROW_NUMBER( ) OVER ( PARTITION BY ORDER BY ) + max_id as id_dim_ 75, 76 FROM 77 ( 78 SELECT 79 DISTINCT l o g. as 80 FROM s o u r c e _ v i s i t _ l o g _ log 81 LEFT JOIN dim_ em 82 ON log. = em. 22

30 2. PROPOSED ARCHITECTURE 83 WHERE log. i s null 84 ) dist_ em 85 LEFT JOIN 86 (SELECT max( id_dim_ ) as max_id FROM dim_ ) max_id_tab 87 ; union s t a g e and dimension d a t a 90 INSERT OVERRIDE s t a g e _ e m a i l _ a l l 91 SELECT id_dim_ , FROM dim_ 92 UNION ALL 93 SELECT id_dim_ , FROM stage_ 94 ; s w i t c h s t a g e and dimension t a b l e 97 ALTER TABLE dim_ RENAME TO dim_ _old ; 98 ALTER TABLE s t a g e _ e m a i l _ a l l RENAME TO dim_ ; 99 ALTER TABLE dim_ _old RENAME TO s t a g e _ e m a i l _ a l l ; p e r f o r m s dimension l o o k u p s and l o a d t r a n s f o r m e d d a t a i n t o s t a g e t a b l e 102 a l l o w s s i m p l e ETL j o b rerun 103 INSERT OVERWRITE TABLE s t a g e _ v i s i t _ l o g 104 SELECT 105 ISNULL(em. id_dim_ , 1) as id_dim_ 106, ISNULL( c n t. id_dim_country, 1) as id_ dim_ country 107, ISNULL( br. id_dim_browser, 1) as id_dim_browser 108,CAST( logtime as timestamp ) as logtime 109 FROM s o u r c e _ v i s i t _ l o g _ log 110 LEFT JOIN dim_ em 111 ON ISNULL( log. , unknown ) = em LEFT JOIN dim_country c n t 113 ON ISNULL( log. country, unknown ) = cnt. country 114 LEFT JOIN dim_browser br 115 ON ISNULL( log. browser, unknown ) = cnt. browser 116 ; p a r t i t i o n swap i n t o f a c t t a b l e 119 ALTER TABLE f a c t _ v i s i t _ l o g DROP IF EXITS PARTITION ( etl_timestamp = ) ; ALTER TABLE f a c t _ v i s i t _ l o g EXCHANGE PARTITION ( etl_timestamp = ) WITH TABLE s t a g e _ v i s i t _ l o g ; 23

31 2. PROPOSED ARCHITECTURE Impala Impala is an open source distributed processing framework developed by Cloudera. Impala provides low-latency SQL queries on data stored in Hadoop. Lowlatency is achieved with in-memory computing. Impala stores data in-memory in Parquet columnar format, therefore it more suitable for querying large fact tables with smaller amount of columns. Another feature that distinguish Impala from Hive is a query engine. Imapala does not compile queries into MapReduce rather uses its own programming model. Impala is closely tied to Hive because it uses Hive metastore for storing data metadata. Impala can run on Hadoop cluster without Hive, however Hive metastore has to be installed. Impala accepts queries sent via impala shell, JDBC, ODBC or Hue console Pig Pig [19] is simple scripting language for data processing and transformations. It has been developed mainly by Yahoo. The power of Pig and Hive is similar. The main difference is audience for which the tool was developed. Pig is more intuitive for developers with experience in procedural languages. On the other hand, Hive leverages knowledge of SQL. Pig s advantages: Supports structured, semi-structured and unstructured data sets. Procedural, simple, easy to learn. Supports user defined functions. One of the powerful features is capability to work with dynamic schema. Meaning that PIG can parse unstructured data such as JSONs without predefined schema. This is helpful while working with dynamic nested JSONs or JSONs arrays. DataJson = LOAD s3://pig-test/test-data.json using com.twitter.elephantbird.pig.load.jsonloader( -nestedload ); Figure 2.4: Example of Pig script for loading complex JSON structure. Usage of Pig or Hive highly depends on audience. As most of the data warehouses are built on RDBMs and people involved in its development have knowledge of SQL, the Hive is preferable tool for data transformation and data querying Kylin Relatively new project that implements OLAP into Hadoop is Kylin [20]. Kylin originally started as an in-house project of ebay but it has been published as opensource few months ago. 24

32 2. PROPOSED ARCHITECTURE Kylin works in the following manner. Aggregations are calculated from Hive using HiveQL and stored in HBase. Then when data are queried Kylin query engine looks into HBase if requested data are precalculated there and if so, returns data from HBase with sub-second latency. Otherwise Kylin query engine routes query into Hive. Kylin supports JDBC, ODBC and Rest API for client tools, therefore it is possible to connect from analytic tools such as Tableau, SAS or Excel. Figure 2.5: Kylin high-level architecture [20] Sqoop Sqoop [11] is tool provided by Hadoop that is used for importing and exporting data. Different modes of exporting are supported such as full export, incremental or limiting size of an export using WHERE clause. Data can be exported from HDFS, Hive or any RDBM that supports JDBC. Sqoop works bi-directionally therefore supports importing into HDFS, Hive or RDBM. Data can also be exported into delimited text file with specific field and row terminators or sequence file. As a tool that is a part of distributed system, Sqoop also supports distributed and parallel processing. Sqoop includes additional functionality for Hive export and import in order to simplify data transfers with RDBMs. Sqoop Hive additional support includes: Incremental imports. CREATE TABLE statements within imports. Data import into specific table partition. Data compression. 25

33 2. PROPOSED ARCHITECTURE For user interaction, Sqoop has simple command line interface. 26

34 3 Hadoop integration The following chapters will explain how to integrate Hadoop into enterprise data warehouse including implementation of star schema, data management and physical implementation. 3.1 Star schema implementation A logical star schema model captures business point of view and it is highly platform independable on physical implementation. Hive provides almost the same SQL-like interface as common RDBMs, therefore the data model build for RDBMs data warehouse can be implemented in Hive with some adjustments. Thesis therefore focuses rather on physical implementation than on logical. Main advantages of implementing star schema in Hive: Simple and understandable view on data. Easy support for master data management. Dimensions support user defined hierarchies. Performance improvement [18]. Conformed dimensions can be shared across different platforms Dimensions implementation One of the most challenging part of implementing dimension tables are unsupported DML functions update and delete at the row level in HIVE. Even though, append operation is available, it creates new file with every append. This can cause significant issues with NameNode performance as amount of small files can grow significantly and it can cause issues with insufficient memory. Also only first type of slowly changing dimension can be implemented without support of update and delete. Because dimension needs to keep the same surrogate keys and update is not available, it is necessary to recreate the table every ETL run with the same keys. Although, auto increment functionality is not developed yet, other ways exist. While inserting new rows into table, we need to merge existing dimension data and new data. It is necessary to keep old keys and generate new keys for new records in this step. One of the ways how to generate a sequence is getting maximal key value from existing data and then use function UDFRowsequence.java to generate sequence starting with number one and then add maximum key value to all generated keys. The same result can be achieved using ROW_NUMBER() window function. Due to lack of transactions in Hive it is recommended to stage data from more data sources first and load them into final dimension table at once 27

35 3. HADOOP INTEGRATION or use different techniques to disable possibility of concurrent runs of filling a dimensional table. Example of filling dimension table is at rows in chapter However, as the Hive does not support primary keys, it is better to develop a job that validates uniqueness of surrogate keys in dimension tables. As a duplication of surrogate keys can lead to a record duplication in a fact tables. If a dimension is conformed or at least shared across different data sources on different platforms, more suitable solution is to generate keys on platform such as RDBM that have auto-increment functionality, supports transactions and ensures uniqueness. Basically RDBM keeps the original dimension table and a Hive has only copy for read. One of the possibilities how to load a dimension using additional RDBM is to take records that should be inserted into dimension and export them into staging table on RDBM. For export from a Hive Sqoop tool can be used. Figure 3.1: Diagram describing using RDBM to assign surrogate keys. sqoop import --connect jdbc:sqlserver://edw.server.com;database=dw; username=sqoop;password=sqoop --table dim_ --hive-import --schema dbo --incremental append --check-column id_dim_ --last-value hive-import Figure 3.2: Sqoop imports all records from dim_ with id_d_ greater than 500. Loading into dimension in RDBM runs in a transaction, therefore concurrent inserts are excluded. When dimension is loaded it can be exported back to Hive. Depending on size of the table, incremental of full export can be chosen. This export needs to be performed after each insertion into dimension as Hive has only copy and it needs to be synchronized with the original. 28

36 3. HADOOP INTEGRATION Regarding slowly changing dimensions, they can be implemented the same way as in RDBM. First type uses two additional columns of date or timestamp format which specifies time period in which column is valid and second type can be implemented by using additional columns of previous values [8]. If data are in JSON or different unstructured format it is recommended to split JSON into structured table. Even though JSON record can contain dynamic arrays, it is important to decide if the data in such a dynamic structure are necessary. Splitting JSON into structured table has four main advantages: Performance. Simple synchronization with RDMB. Dimensions are simple and understandable for business users. Simple usage of dimensions in additional tools such as ETL and OLAP. Otherwise JSONs need to be parsed with every query and JSON parsing is an expensive operation [21] Facts implementation Fact tables contain snapshot of data or log data. Naturally, data stored in data warehouse are tied to a day snapshot was taken or event occurred. This creates time series defined mostly by time period between individual data exports (such as hourly, daily, weekly). Crucial to a fact table implementation is ability to effectively work with individual dates as well as with whole history. Due to limited DML operations, few possible options are available. First, the least efficient way is to load new data into the same not partitioned fact table. This causes read and write of whole table as delete operation is not supported. Delete operation is needed to ensure ability to rerun an ETL job. When incorrect data are loaded into a fact table, some easy way how to fix the issue and rerun ETL process has to exist. This is usually achieved by deleting data from current ETL iteration. The second option is to create individual tables for each hour, day or week depending on how often an ETL iteration runs. Using separate tables and then creating view we can query individual tables as well as the whole history. The most effective option is to use partitioning that is available in Hive. Each ETL iteration works with one partition that is then appended into existing table. A partition in Hive provides similar benefits as clustered index. Partition is defined by an ETL iteration ID, which can be timestamp or int. Using partitioning it is possible to query whole table or filter data by partition attribute and then only specific partitions of the table are read. Example of partitioned fact table is at rows in chapter It uses partition swap to move data into fact table more efficiently as it only affects metadata. 29

37 3. HADOOP INTEGRATION In case of unstructured data such a JSON the same rule applies as for dimensions. It is better to split the data for easier querying, data understanding and accessing via ETL and BI tools and better performance. Surrogate keys can be substituted for natural keys by a dimension lookup using join operation Star schema performance optimization Several ways how to improve star schema performance exist. However, at the end performance highly depends on specific Hadoop cluster and data characteristics [18]. Generally, queries to data warehouse causes lot of reads and need computational power as a large fact table is being joined to several dimensional tables. Implementing start schema improves performance by reducing a size of fact tables and reduces query time for most of the queries [18]. Performance can be improved by using sequence files. Sequence file is a flat file format consisting key-value pairs. A sequence file can be compressed decreasing file size and afterwards it is split into blocks on HDFS, hence improving workload. Sequence file can also be stored as binary files, therefore decreasing size of the files on disk and improving read performance [18, 22]. Additional format for consideration is Parquet. Parquet is columnar storage for Hadoop ecosystem. Columnar storage is the most beneficial if fact table is used in aggregation of large amount of data and for queries that loads lots of rows but use a limited number of columns. Example of table stored as sequence file is at rows in chapter For dimensions that are small enough to fit into memory, Hive have Map join functionality [21]. Map joins are processed by loading the smaller table into an inmemory hash map and matching keys with the larger table as they are streamed through. Another similar approach is to use Hadoop distributed cache [11]. Hive can benefit from building indexes on fact tables. However, due to Hive difference to RDBMs, indexes does not have to be beneficial as in RDBMs. For further queries optimization, Hive supports function EXPLAIN used for explanation of generated MapReduce execution plan. 3.2 Security By default Hadoop runs in non-secure mode in which no actual authentication is required. By configuring Hadoop to run in secure mode, each user and service needs to be authenticated by Kerberos in order to be able to use Hadoop services [11]. Organizations which already have an Active Directory to manage user accounts, are not keen in managing another set of user accounts separately in MIT Kerberos. Nonetheless, Kerberos can be configured to connect to a company LDAP. On the other hand each user has to have principal created and then it is necessary to link principals to users in active directory. Hadoop already provides fine-grained authorization via file permissions in 30

38 3. HADOOP INTEGRATION HDFS, resource-level access control for YARN and MapReduce, and coarser-grained access control at a service level. Hive provides Grant and Revoke access control on tables similarly to RDBMs. 3.3 Data management Master data As an example, it is typically the case in a Master data that the attribute data names and data definitions used to describe master data entities are likely to be the standard data names and data definitions for the whole enterprise. Master data are closely tied to conformed dimensions in data warehouse as some of the conformed dimensions are based on enterprise master data. It is important to be able to match new records with already existing in conformed dimensions in data warehouse. There is more approaches how to match dimensional records: Simple matching - Using Group by or Distinct statement and then performing lookup on dimensional table. This is easy to implement using Hive as both operations are supported. This approach supports auditability natively as we know how exactly records are matched. Example of simple matching is at rows in chapter Complex matching - Often implemented in commercial Master data management systems. Usually complex matching algorithms that are hard to scale or impossible to distribute. One of the disciplines used is machine learning. It can be used to predict matching. One of the Hadoop libraries that focus on machine learning implementation is Mahout. However, using complex matching significantly increase ETL development complexity and auditability [23]. In already existing data warehouse it can be efficient to use existing master data management system. However, loading a lot of data from Hadoop to an existing system can easily become a bottleneck of whole ETL process. While considering this option it is important to transfer the smallest possible amount of data and keep as much of data processing in Hadoop as possible Metadata Metadata contains detailed information about individual steps in an ETL process, reporting or in data sources. With lacking metadata information it is difficult to keep a track of data flow and data transformation. Hadoop does not contain any tool that directly supports metadata collection and management. Therefore metadata repository and metadata collection have to be implemented manually. 31

39 3. HADOOP INTEGRATION Data metadata can be stored in Hive metastore using HiveQL features TBL- PROPERTIES, which allows to define key value pair of commentaries and COM- MENTS, which can be added to columns or table [11]. With regularly ran queries such as ETL planning problem occurs. Commercial tools such as Informatica supports ETL documentation and metadata creation, but if ETL is hand coded in Hive it is necessary to store metadata outside of Hive metastore as Hive does not support stored procedures. Design in orchestration chapter can be expanded for metadata collection process which can be automated, based on query parsing. Example of Hive table metadata is at rows in chapter Metadata stored in database can simply be visualised and displayed using any reporting tool. Reporting tools usually have their own system for storing metadata. Therefore metadata for individual parts of data warehouse can be stored as: Table details - stored in Hive table properties. Procedures details and query dependencies - manually created data structure for storing queries in RDBM. ETL orchestration, data flow and orchestration - orchestration tool. Reporting and BI tools - their own databases Orchestration One of the things Hadoop does not do well at this moment is an orchestration. An orchestration is a central function of whole data warehouse environment and the source of operational information. EDW environment can have several thousands of unique jobs running every day. With near real time and real time processing the number of jobs ran every day can be tens of thousands. In such an environment it is crucial to have ability to define individual tasks, its dependencies and runtime environment. What is required from EDW orchestration: Agents support - EDW usually use more servers, different environments such as development, production or UAT and it is important to easily change job to run on different environments and servers. Complex dependencies support - Dependencies of individual jobs need to be set properly. Often job waits for other projects to finish or just one job in a project. Ability to set dependencies correctly can significantly decrease wait times of EDW processes. Dependencies also secure correctness of data. Simple operability - With thousands of jobs running every day, simple operability of orchestration tools can secure quicker problem resolution time and meeting SLA. Simple user interface - This is similar to previous bullet point. 32

40 3. HADOOP INTEGRATION Simple manageability - EDW processes planning should not take significant amount of time as they can often change. For orchestration purposes Hadoop offers tool Oozie, which is simple Hadoop job scheduler. Considering requirements on orchestration tool previously mentioned, Oozie itself does not meet most of them. The main problem is accessibility only through command line interface or simple console and job definition using XML. Therefore any job definition change has to be made via CLI and job listing in console does not provide enough information to simplify operational task. Another disadvantage is lack of support of non-hadoop technologies. In general, it is better to use general commercial scheduling systems such as Cisco Tidal 1 or Control-M 2. Especially if data warehouse is already running and some scheduling systems are used. Running automated queries on Hive is a matter of connecting to Hadoop master node via ssh tunnel or directly into Hive using one of the supported drivers. This can be easily implemented in most scripting languages and then planned into any job scheduler. To make Oozie more suitable tool, an open source project has started called Cloumon Oozie [24], which aims to implement a job definition and workflow designer and a job management system Data profiling Data profiling [25]is process of examining a data in an available data sets and collecting information and statistics about a data. Main goal is to discover data characteristics and whether data can be used for other purposes. Data profiling consist different kinds of basic statistics such as minimum, maximum, frequency, standard deviation and other information about aggregation such as count or sum. Additional information about data attributes can consist of data type, uniqueness or length. [26] Data profiling in Hadoop can be performed by: Commercial data profiling tool such as SAS or Informatica. Using R or other statistical language. Native Hadoop tool such as Hive or Pig. Having Hive already running on Hadoop, using Hive and HiveQL for data profiling is the easiest choice. HiveQL supports all basic statistical functions e.g. min, max, avg as SQL and also supports aggregation functions such as count, sum. Hive also contains variety of build-in user defined functions [27] that calculate standard deviation, Pearson coefficient, percentiles and many others. Therefore after setting the basic table structure in Hive, data profiling can be done using basic build-in functions. The process of data profiling can be easily

41 3. HADOOP INTEGRATION automated and used as a basic indicator of data quality and occurring data issues. However, some tasks of data profiling can be difficult to implement in HiveQL as they are difficult to implement even in SQL Data quality Cleansing data and conforming data quality [28] is a challenging process especially in Hadoop. Commonly, lots of basic data quality validations in RDBMs are forced by table structure and constraints and are validated with data insertion into a table. Those basic constraints are data types, null and not null or that value can be greater, equal and lesser than predefined value. More complex constraints can be defined as well such as uniqueness, foreign key or functional dependencies on other attributes. However, constraints such as foreign keys can significantly decrease ETL performance. Nonetheless, Hadoop has data structure validation implemented on read operation instead of write. With Hive, all these constraints have to be validated during transformations and while loading data into star schema. This can significantly affect ETL performance on Hadoop, especially with big log fact tables or dimensions. Therefore it is better to identify important data quality validations and issues and consider integrating more data quality validations into one query to reduce number of read operations from disks. Data that are loaded into conformed dimension or used for sensitive business reporting such as accounting information or information that are afterwards publicly published should have higher priority and should be validated very carefully. Many different methods can be implement in Hive into an ETL process that can regularly validate quality of data. One of these methods that are easily implementable with HiveQL are Bollinger bands [29]. Bollinger bands can automatically validate data distribution and capture unexpected distribution changes. Other methods are implemented in build-in Hive UDFs such as standard deviation. At this point there are no open source data quality tools that would support Hadoop or Hive. On the other hand, commercial tools such as Informatica are slowly implementing Hadoop support allowing to manage Hadoop and RDBMs data quality with one tool. Other commercial tools from companies such as Talend or Attacama focus only on Hadoop Data archiving and preservation One of the part of data management is a plan that describes when and how data are archived and what data are preserved in data warehouse. One of the major issue of RDBMs data warehouse is expensive storage space, therefore historic data are either aggregated or archived and deleted from data warehouse. When data are moved from data warehouse they are usually moved to tapes as to a 34

42 3. HADOOP INTEGRATION cheap passive storage where they resides for years due to legal obligations or for audit purposes. Major disadvantage is difficult and slow access to the data when needed. Due to this issue it is not common to retrieve and query data stored on tapes for analytical purposes. On the other hand, Hadoop cluster is built on regular commodity discs, therefore it can be used as a long term storage with advantage of simple data accessibility. Few years old data can be moved from RDBM into Hadoop to leave expensive RDBM storage for current data that are more often queried. When historic data are needed from Hadoop they can be queried using Hive. It is not probably that data structure of historic data would change therefore storing historic data in Hadoop does not require intensive maintenance. Even though data in Hadoop can be easily queried, due to the amount of data, queries can run for a long time. If any reporting or analyses are needed on regular basis it is worth to consider precalculating aggregated tables that can be stored on a RDBM. Depending on amount of data and their usage Hadoop can be used as platform for data archiving or a platform that is between RDBM and tapes. 3.4 Extract, transform and load Choice of an ETL development method depends on complexity of functionality of an ETL process, budget and developers experience with Hadoop platform and ETL tools. Generally, an ETL process is divided into three steps: Extract - Thanks to Hadoop support of variety of file formats and schema validation on read, Hadoop is able to ingest many different export formats. If the data can be exported directly from a database, Sqoop tool can be used for setting up repeated incremental exports. Sqoop can transfer data directly into Hive table or into HDFS. If data are exported into text file, it is important to define field terminator that cannot be in any attribute so the data can be processed correctly when data are needed. Example of data import using Sqoop is shown in figure 3.2. For unstructured data JSON is preferable due to wide support of JSON SerDes and smaller storage space needs than other formats such as XML. Most of the systems, which exporting functions can be accessed via API support flat file export, therefore data can be loaded directly into HDFS. Transform - Depends on used ETL development method. Transformation process can be either hand-coded or developed in some of the commercial ETL tools. More details are specified in the following chapters. Load - As well as transform process, Load process can be a part of whole ETL developed in ETL tool or hand coded. If the data warehouse is located 35

43 3. HADOOP INTEGRATION outside of Hadoop, Sqoop can be used to export and import data in an incremental fashion Hand coding One of the options in ETL design is to write all transformation logic in one of the Hadoop tools such as Hive, Pig or write all transformations in Java or Scala with MapReduce or in almost any other programming language using Hadoop streaming. Because a lot of exisitng ETL processes uses SQL stored procedures for individual data transformation tasks, the most portable solution is to use Hive, due to familiarity of HiveQL and SQL. SQL is also one of the most common language for data analytics, making ETL written in SQL more understandable for developers as well as for data analysts. An ETL can also leverage features of Hive such as table partitioning or indices for an ETL optimization, table caching or growing number of available third party user defined functions, SerDers or growing community. Other option for ETL process is Hadoop tool Pig. For developers who are not familiar with SQL, Pig can be more suitable tool as it uses simple scripting language. Other useful feature to use in ETL is possibility of storing every step onto a disk. Therefore ETL written in Pig can use staging more efficiently. Third option for ETL is writing MapReduce jobs in Java or some other language. Major advantage is that jobs can be optimized for specific purpose. However, choice of MapReduce written ETL can be more time consuming, more complex and code is not likely to be portable. Therefore, it is better to develop MapReduce jobs only when ETL in Hive or Pig is not sufficient. Example of hand-coded ETL is in chapter Commercial tools Main advantage of commercial ETL tools is implemented data integration functionality and high level of development abstraction. Also some tools support data transformation functionality on Hadoop and on RDBMs as well. Using familiar ETL tool can save development time and maintanance costs. Lately companies developing ETL tools have invested into Hadoop support. Here are some examples of ETL tools for Hadoop: Pentaho Business Analytics Platform 3 (PBAP) - Combination of tools for data integration, visual analytics and predictive analytics developed by Pentaho. PBAP supports working with HDFS, HBase and Hive. Connnection to Hive is via JDBC driver. Transformation can be translated into Pig, HiveQL or directly into MapReduce, depending on transformation and data source used. Thanks to translation into natively supported query languages, PBAP uses distributed cluster to process data parallelly

44 3. HADOOP INTEGRATION SSIS 4 - Data integration tool developed by Microsoft. After installing additional ODBC driver for Hive, an SSIS package can connect into Hive and use it as data flow source or destination. Within data source task, data can be pulled using HiveQL query, therefore using power of Hadoop s distributed cluster. Other transformation such as fuzzy lookups are processed outside of Hadoop, depending on where SSIS package is ran. Within SSIS data flows can be processed in parallel to optimize working with large volume of data. Oracle Data Integrator 5 (ODI) - One of several data warehouse and ETL tools developed by Oracle. With version 12c Oracle introduced improved support for Hadoop. ODI uses special adapters for HDFS and Hive and can perform data transfers between Hive tables and HDFS files. Transformations are written in HiveQL, therefore they are processed on Hadoop cluster. For better cooperation with Oracle Data Warehouse, ODI also includes Oracle Loader for Hadoop. Oracle Loader for Hadoop is high performance and efficient connection to load data into Oracle Database from Hadoop. Transformations and Pre-processing are performed within Hadoop. [30] DMX-h - ETL tool for Hadoop developed by Syncsort. DMX-h supports various transformations including joins, change data capture or hash aggregation. All transformations are translated into MapReduce jobs directly without translating them into Hive or Pig first. 3.5 Business intelligence BI uses variety of tools for different purposes and each tool is targeted on specific group of users. Each user wants to access data differently depending on their responsibilities and skills. Data scientist prefers direct access to data and managers simple dashboard or report with KPIs [9]. As Hadoop is based on HDFS, users are able to access and query individual files located in HDFS, depending on privileges. Other users or BI tools access structured tables in Hive using JDBC or ODBC driver. Most common situation is that user access only to BI tool as it is user friendly. Security is being defined often in BI tools such as Reporting portals or OLAP. Depending on purpose of a BI tool, BI tools can be divided into three main categories and those are OLAP, reporting tools and tools for analytics Online analytical processing OLAP tools are based on structured data, therefore in case of Hadoop, OLAP needs to access data stored in Hive or in other database tool such as Impala

45 3. HADOOP INTEGRATION OLAP tool uses data as well as their metadata to define its own data structure, therefore it requires structured underlying data. When designing OLAP, one of the key decisions to make is if MOLAP, ROLAP or HOLAP is used. OLAP usually serves as self-service analytical tool, therefore various of data are queried in the shortest time possible and with self-service tool users usually do not want to wait for the results [10]. Generally, Hive was not developed for quickly running queries but rather for batch processing, therefore making it not suitable for ROLAP or HOLAP. On the other hand, Hive is perfect data source for MOLAP as newly processed data are regularly loaded into MOLAP and then all user queries are processed in MOLAP rather then in Hive. Not all data that are loaded into MOLAP have to be stored in Hadoop. An architecture can leverage the fact, that MOLAP can load dimensions from RDBM and only large fact tables from Hive in order to improve processing speed. Impala with in-memory processing is more suitable for ROLAP, however cluster has to be designed for this type of workload, otherwise cluster can suffer from insufficient memory and in such a case Impala can be force to repeatedly load data into memory, hence significantly decreasing response time [22]. Despite the fact that Hive supports ODBC and JDBC drivers not all OLAP tools can easily connect into Hive. For example common workaround for SSAS is to create a linked server in SQL Server instance into Hive and then query data from SSAS through linked server. If Hadoop contains all dimension and fact tables necessary for OLAP implementation, then Kylin can be used. Which is ideal solution for large fact tables thanks to support of incremental refresh of cubes. However with increasing number of dimensions and amount of dimensions members problems can occur, because a Kylin precalculates all combinations of all dimension members and with any added dimension complexity of data exponentially grows. Therefore it is important to identify only needed dimensions ideally with smaller cardinality. Kylin also expects star schema stored in Hive, therefore making impossible to combine more fact tables into one cube Reporting Reporting consists of two distinct parts. Reports are usually longer running queries ran either on aggregated data or detailed data. Often reports are automatically generated and sent to an or a file share. As a user does not know the real latency Hive is suitable for this type of usage. Even though Hive needs approximately 20 seconds to translate query into MapReduce, this delay is not usually crucial for reports. The other part of reporting are dashboards presenting visualized KPIs. Such a dashboards are designed to be responsive as they show small amount of critical information and often can be viewed from mobile phones or tablets. This presents an issue for Hive, hence the solution is to either pull data from OLAP instead of Hive or prepare smaller aggregated tables that can reside in memory allocated by Impala. 38

46 3. HADOOP INTEGRATION Depending on workload Hadoop caching can be used for commonly running reports or dashboards Analytics Analytics can process large amount of data or even work with a small subset. Hive is a great tool for long running ad hoc analytical queries or preparing smaller data sets for further analyses with more suitable tools. One of the most popular analytical languages is R. RHadoop [31] project implements integration of R with Hadoop. It includes collection of R packages for connection into Hadoop. RHadoop uses Streaming API. Main advantage of the project is ability of accessing files directly in HDFS and functions for data querying and data transfer. Several other open source projects were created to ease data analysis in Hadoop such as Spark, Shark 6, Tez 7 or Mahout. Even though a lot of projects are developed as open source, a lot of commercial tools exists such as SAS, Alteryx or Pentaho Analytics. Companies often offer whole analytical solution with Hadoop and other analytical tools together. 3.6 Real time processing Batch and real time processing are two distinct processes. In general batch processing implementation is more simple then implementing processing of real time data as it uses different approach and therefore different technologies. Currently, real time processing is not being implemented very often, as a lot of source system are not capable of providing data in real time and also because implementation of real time processing is expensive. Compared to near real time processing with micro batches real time processing does not necessary have to add any business value. Batch processing is usually easier to develop and to maintain. Therefore real time data processing is being replaced by micro batch processing in near real time. For a real time processing, Storm [32] can be used. Storm is framework for distributed, fault-tolerant, streaming processing. Storm is not a part of Hadoop framework as it was developed separately for different purposes. However, integration with Hadoop is well known. Another tool for a real time processing is Spark Streaming [33]. It provides in memory processing and it does not process each record or message separately as Storm but rather in micro-batches. Both technologies guarantee excatly-once processing. Spark Streaming secure exacly-once processing by using Resilient Distributed Datasets (RDD). Basically, Spark Streaming takes stream of data and converts it into micro-batches that are then processed by Spark batch processing. In case of node failure, Spark stores data lineage and it can restore data from logs. 6. shark.cs.berkeley.edu/

47 3. HADOOP INTEGRATION Spark can be easier technology to adopt as it natively includes libraries for work with aggregation functions. Storm has similar framework to Spark Streaming called Storm Trident, which provides aggregation functions and micro-batch processing. Main disadvantage of micro-batch processing is higher latency. Both technologies have build in YARN support for resource management within a cluster. Storm and Spark Streaming can be used in several ETL designs: As Storm supports JDBC and ODBC, it can connect to almost any database. Storm can in short periods read data from source table and perform transformation and load transformed data back into database or HDFS. Storm can be integrated with Kafka [34], which is distributed messaging system. Data are being buffered in Kafka in order to not get lost and Storm is periodically reading data from it. As both technologies are fault tolerant this ensures that all data loaded into Kafka are read and processed. Using Kafka and Storm ETL is offloaded into distributed system and data are loaded into database or Hadoop only when data are already transformed or at least preprocessed. Decreasing number of reads and offloading ETL processing completely into memory. Kafka, however, stores messages on disks. Spark Streaming implementation is similar to Storm in case of real time processing. Main advantage of Spark Streaming is that queries written for Spark Streaming and are easily portable to general-purpose Spark [35]. Figure 3.3: Diagram describing data flow in architecture with Kafka and Storm or Spark Streaming. While designing real time ETL, data visualisation and access should be considered. It is not worth to implement real time solution if user has to wait more than 30 seconds to get the data. For example Hive is not suitable tool for real time processing due to query translation delay. Other things to consider regarding purpose of data processing: 40

48 3. HADOOP INTEGRATION OLAP - only feasible solution is using ROLAP or HOLAP. However, Hive has naturally approximately 20s delay due to HiveQL translation and distribution. It is not worth to reprocess MOLAP data every few seconds as this would cause a lot of disk reads. Reporting - As mentioned before. Reporting can be build on other tools than Hive. For example Impala or Tez gives much better query time thanks to not using MapReduce paradigm and keeping data in memory. Dashboards - Preferable choice of visualisation. Storm can sent aggregated data parallelly into dashboards to show various time lines and KPIs. The data are refreshed with every cycle of realtime ETL, which enables shortest information delivery possible. 3.7 Physical implementation Several aspects come into consideration while deciding physical implementation of Hadoop cluster. While choosing Hadoop cluster hardware it is important to keep cluster workload and tools that are going to be used in mind. The main components to consider: RAM - To effectively use processors, amount of RAM should not be low. RAM are one of cheaper cluster component, therefore it is better not to try to save too much. More parallel processes can run with higher amount of RAM or if the memory is not used by currently running processing, Hadoop can still use it to cache data (e.g. smaller dimensional tables). Higher amount of memory is recommended if the Hadoop uses in-memory analytical tools such as Impala, Tez or Spark. CPU - Highly depends on workload. Not more than a two sockets are recommended as it increases upfront costs and power requirements significantly. Storage - depending on cluster size and workload four to eight two TB or three TB disks, usually 7200 RPM. HDFS cluster stores copies of files into other nodes and racks depending on configuration and replication level to ensure that data are not lost during single node failure. Due to this feature Hadoop cluster usually requires more storage space. Network - Similarly to previous aspects, network design highly depends on workload and cluster size. It is important to allow all nodes within a cluster to communicate with each other. Network design also significantly affects cluster costs. Smaller clusters work with 1 GB all-to-all bandwidth. However larger nodes may need dual 1 GB links for all nodes in a 20 node rack and 2 x 10 GB interconnect links per rack going to a pair of central switches. Power - One of the major concerns. It is important to 41

49 3. HADOOP INTEGRATION think about Hadoop configuration and workload in advance in order to keep costs reasonable. With more powerful nodes and more CPUs power consumption can grow significantly. Node configuration slightly differs for name node and data node, due to different roles in a cluster [36]. When the Hadoop cluster is implemented as an part of data warehouse it is important to keep sufficient network connectivity between Hadoop cluster and rest of the data warehouse server. Amount of data transfer highly depends on Hadoop integration into data warehouse environment. Among the regular Hadoop to data warehouse traffic and vice versa, ad hoc data transfers have to be considered as well. Cases such as recalculating few years of historical fact tables into OLAP servers have high demand on bandwidth and if those cases are not considered in design it can cause issues with SLA of other services Hadoop SaaS Regarding physical location one of the most crucial decision to make is to decide if running Hadoop cluster on-premise or as SaaS. Hadoop cluster located in cloud has several advantages: Costs - Having no need for building own infrastructure saves a lot of money. Companies does not need to buy theirs own hardware or expand theirs data centres and improve theirs network. Resizability - Hadoop cluster can be easily adjusted and resized in matter of number of nodes or nodes hardware. Support - Companies provide tested versions of Hadoop and tools among the other additional software to ease maintanance and development. However, SaaS also has major disadvantages and that is location of data centres. As Hadoop clusters are build for massive data processing it can be expected that a lot of data are transferred between local data centres and Hadoop cluster in a cloud. This causes several issues. In most cases, data transfer out of a cloud is being charged and transfer speed itself is not sufficient in all cases or connectivity issues can occur during a transfer. SaaS is also disjoined from other companies security processes, hence security have to be defined specifically for SaaS. Main aspects what to consider when deciding about Hadoop cluster location is location of existing data sources, data warehouse and other BI servers. As clusters in cloud are usually very easy to start, it can be used for proof of concept for Hadoop projects. Major competitors in Hadoop SaaS are: Amazon Web Service 8 (AWS) - AWS is a cloud computing platform developed by Amazon. Among other services such as elastic cloud, S3 storage

50 3. HADOOP INTEGRATION and support for RDBMs, AWS also has modified version of Hadoop called Elastic MapReduce (EMR). EMR is priced per hour per node depending on size of nodes and support CLI, API or AWS Console access. Microsoft Azure 9 - Microsoft with Hortonworks modified Hadoop to be runnable on Windows machine. Within Azure Microsoft offers HDInsight, which is Hadoop-based cloud service. Pricing model is similar to AWS. 3.8 Hadoop commercial distributions Hadoop was developed as open source project with contribution of several leading technological companies and it still has a growing active community of developers that adds new functionalities and work on new versions. Even though all information in this thesis are based on open source Hadoop or third party open source libraries, it is useful to know that commercial distribution exists. Most popular commercial distributions are: EMR 10 - EMR runs on predefined EC2 nodes in AWS. Depending on the workload, nodes can have more RAM, CPU or storage. Amazon improves functionality of Hadoop by integrating it into other AWS services. Among the other, EMR specific features contains Hive connector into DynamoDB, S3 or Redshift or JSON SerDer. EMR is suitable for ad hoc workload or non stop processing. Hortonworks Hadoop Data Platform 11 (HDP) - HDP is open source analytical solution. HDP simplifies Hadoop adaption by integrating necessary tools into one easy-to-install package. Hortonworks also develops HDP branch installable on Windows servers. One of the key feature is improved security integration. Cloudera Distribution Including Hadoop 12 (CDH) - CDH is open source easy-to-install package of Hadoop tools with addition of own components such as user interface or improved security

51 4 Other Hadoop use cases Hadoop can be used for different specific business use cases. Some of those use cases are explained in this chapter. 4.1 Data lakes Data lakes are new trend in a data warehousing and analytics. Data lake is a place where data are stored in a raw form and can be used for analyses. As different data sources can have significantly different characteristics and structure it is challenging to analyze them and to discover useful information and unseen relations. Nonetheless, specially in agile data warehouse development, it is important to bring business value as soon as possible. To create some business value, data sources need to be first investigated and data analysed in order to identify useful data that can be used for additional integration or analyses. Due to variety of existing data, Hadoop represent an ideal platform for data lakes, thanks to support of various different file formats, dynamic structure of data and variety of tools for data access and querying. Data are simply copied into HDFS and then can be queried with different tools. One of the first steps and one of the most meaningful usage of data lakes in development is data profiling. Data profiling can be easily performed using Hive or Pig. More advanced steps in data profiling contains discovering relations in different data sets and other data usage. Having all data from available sources on one platform can significantly ease data profiling and to reduce time needed. Especially if the processes for data loading are set and every new data source is added into data lake semi-automatically. 4.2 ETL offload platform One of the common Hadoop use cases in data warehouse environment is ETL offloading. This use case specially uses linear and relatively cheap scalability. Usually computer time is cheaper on Hadoop platform than on RDBMs. If ETL is offloaded from main RDBM data warehouse system, RDBM can focus on serving customers queries. This means that whole ETL process will be performed on Hadoop and when data are cleansed and transformed, they are moved into RDBM. There are more ways how to implement this use case: Dimensions stored on both platforms - Hadoop has a copy of all dimensions stored locally. Main advantage of this approach is to reduce amount of reads on RDBM while performing lookups. Whole ETL is indenpendent on RDBM and uses only resources of Hadoop. However this means 44

52 4. OTHER HADOOP USE CASES Figure 4.1: Diagram describing steps in use case of offloading an ETL. that it is necessary to synchronize dimensions in both platform as the issues can occur and data integrity can be corrupted during transformation process. Dimensions stored only in RDBM - Dimensions exist only in RDBM. Advantage is minimal synchronization and easier implementation. However this implementation causes full dimension read with every lookup. In this case it is better not to use external tables in Hive. Having 10 fact tables with 20 dimension keys each, Hive has to perform 200 full scan of dimension tables on RDBM and store data into HDFS. The number of scan can be reduced by import of needed tables only once before lookups or using caching of smaller dimensions. This is not suitable if some big dimension tables have to used. Dimensions can have hundreds of thousands records and can take hundreds of gigabytes of storage. Importing such a table with each ETL process is not feasible. While implementing solution one can consider size of the tables implementation difficulty. Probably the best solution is to use hybrid approach. Small dimension containing hundreds of rows can be imported with every lookup, however big dimensions are better to synchronize. More details about dimension implementation in Hive is in chapter If data warehouse team is already using commercial ETL tool with Hadoop support, the whole process of ETL migration is simplified as some parts of ETL process implementation are already implemented. 4.3 Data archive platform As mentioned in chapter 3.3.6, Hadoop is great platform for storing massive amount of data. Usually the most current data are accessed more often and the older the data are the less likely they will be needed. However, this is true mainly for detailed data. Aggregated data are queried commonly as the part of the time lines. This is one of the reasons why Inmon suggests having aggregation plans in data warehouse [6]. 45

53 4. OTHER HADOOP USE CASES When historical data are aggregated, it is not necessary to delete underlying data as they can be used for audit purposes and ad hoc historical analysis, such as analyzing change in customer behavioural patterns. Storing several years of the data can be challenging and having them somewhere, where data can be queried, is even more challenging. Despite the fact that data located on tapes are difficult to access, commonly data are archived there as it is one of the cheapest storage medium. As Hadoop cluster usually runs on commodity hardware, storage in Hadoop cluster is cheap compared to storage connected to enterprise RDBMs. Therefore, if historical data are moved into Hadoop they still can be analysed and used for reporting or processed into OLAP. With advantage of keeping data structured and accessibility via HiveQL. If necessary, aggregated data can be located in RDBM for quicker response time. 4.4 Analytic sandboxes As research in [2] shows, analytic sandboxes are one of the components that is expected to be used in the future. Analytic sandbox is a platform designed for a specific group of users or individual users. Individual analytic sandboxes contains specific subset of data from EDW depending on needs of users. These sandboxes are used for analyses, designing data models or even testing new tools for data analysis and visualisation. Platform for analytic sandboxes have to meet following requirements: Low price - As number of sandboxes can be high if every individual user have his own. The most important factor is price per terabyte of storage. Price is closely related to the second bullet point. Storage scalability - Each sandbox can store large fact tables and even more copies. It is important to design sandboxes with this fact in mind and expect significant storage needs. Support for variety of tools - Each user can be familiar with different analytic languages and tools such as SQL, R, Python, MapReduce, Shark, SAS or Excel. Extended security - Sandboxes have to be separated and users should not be allowed to access each other s sandbox. As some sandboxes can contain personally identifiable information (PII) or other sensitive data. Parallel processing - It is expected that many concurrent queries will run on a platform. With support of variety of tools, linear storage and performance scalability, YARN and HDFS fine-grained security, Hadoop is suitable platform for building analytic sandboxes. On the other hand, working with Hadoop requires its advanced knowledge. Therefore, it is important to identify analytic sandbox users 46

54 4. OTHER HADOOP USE CASES and discuss with them their needs. Otherwise choosing Hadoop can cause problems with platform usage. If processing a large amount of data is required and users are familiar with RDBMs, some commercial distributed database platform such as Teradata 1 can be more suitable

55 5 Conclusion The goal of this thesis was to identify problems of enterprise data warehouses when processing large volumes of data,to propose a complementary architecture that solves identified problems, and to sketch an implementation of key parts of Kimball s architecture and others EDW processes in the proposed architecture. I have identified problems of processing big data with traditional technologies and summarized requirements on new enterprise data warehouse architecture such as linear storage and performance scalability, ability to process structured and unstructured data, relatively low price and support for extract, transform and load tools and business intelligence tools. Then key features of a traditional data warehouse based on Kimball s architecture were identified. Those features are explained and accompanied with code examples. This includes a hand-coded ETL process implementable in Hadoop, an implementation of the star schema, and data management processes such as data quality validation, data archiving or data profiling. Based on these requirements, I have proposed an architecture that exploits Hadoop and related tools, namely Hive, Impala, Kylin or Sqoop. Such an architecture allows implementing data reporting, analysis and dashboards including online analytical processing over large amounts of data. This thesis concludes with explanation of several business use cases of Hadoop in an enterprise data warehouse environment. This thesis can be used as a guide for Hadoop tool selection, Hadoop integration into a data warehouse, and a stand-alone implementation. It also shows integrability of Hadoop with different common architectural components of enterprise data warehouses. Hadoop as an open source project is still in continuous development and it is hard to define best practices for data integration. However, even with growing community of developers, the maturity of Hadoop is still several years behind commercial RDBMs considering data warehousing and business intelligence possibilities. A future work related to this topic could be complete Hadoop integration into data warehouse in audited environment, to discover if Hadoop can meet all necessary requirements on security and other business processes in a public traded company. Further optimization of exiting algorithms in Hadoop is also necessary. As an agile development methodology rises even in data warehousing, it is challenging to investigate how Hadoop can be used in agile data warehouse development. 48

56 Bibliography [1] MANYIKA, J.; CHUI, M.; BROWN, B.; aj.: Big data: The next frontier for innovation, competition, and productivity [online], [cit ]. URL < Insights%20and%20pubs/MGI/Research/Technology%20and% 20Innovation/Big%20Data/MGI_big_data_full_report.ashx> [2] RUSSOM, P.: Evolving Data Warehouse Architectures In the Age of Big Data [online], [cit ]. URL < [3] Economist: Data, data everywhere [online], [cit ]. URL < [4] MERV, A.: Analytic Platforms: Beyond the Traditional Data Warehouse [online], [cit ]. URL <\url{ Analytics pdf}> [5] MAYER-SCHONBERGER, V.; CUKIER, K.: Big Data: A Revolution That Will Transform How We Live, Work and Think. John Murray, 2013, ISBN [6] INMON, W. H.: Building the Data Warehouse, 4th Edition. Indianapolis: WI- LEY, 2005, ISBN [7] KIMBALL, R.; CASERTA, J.: The Data Warehouse ETL Toolkit: Practical Techniques for Extracting, Cleaning, Conforming, and Delivering Data. Indianapolis: WILEY, 2005, ISBN [8] KIMBALL, R.; ROSS, M.: The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling, 3rd Edition. Indianapolis: WILEY, 2013, ISBN [9] KIMBALL, R.; ROSS, M.; THORNTHWAITE, W.; aj.: The Data Warehouse Lifecycle Toolkit, 2nd Edition. New York: WILEY, 2010, ISBN [10] THOMSEN, E.: OLAP Solutions: Building Multidimensional Information Systems, 2nd Edition. Indianapolis: WILEY, 2002, ISBN [11] WHITE, T.: Hadoop: The Definitive Guide, 3rd Edition. Massachusetts: O REILLY, 2012, ISBN

57 [13] Apache: HDFS Design [online], [cit ]. URL < design.html> [14] CAPRIOLO, E.; WAMPLER, D.; RUTHERGLEN, J.: Programming Hive. Massachusetts: O REILLY, 2012, ISBN [15] Technologies, D.: Designing Performance-Optimized ODBC Applications [online], [cit ]. URL < Documents/ODBC/Tutorial/odbcdesignperformance.pdf> [16] Technologies, D.: Designing Performance-Optimized JDBC Applications [online], [cit ]. URL < Documents/JDBC/Tutorials/jdbcdesign2.pdf> [12] SHVACHKO, K. V.: HDFS scalability: the limits to growth [online], [cit ]. URL < /openpdfs/shvachko.pdf> [17] Qubit: Hive jdbc storage handler [online], [cit ]. URL < [18] BRADLEY, C.; HOLLINSHEAD, R.; KRAUS, S.; aj.: Data Modeling Considerations in Hadoop and Hive [online], [cit ]. URL < [19] GATES, A.: Programming Pig. Massachusetts: O REILLY, 2011, ISBN [20] EBAY: Kylin: Hadoop OLAP Engine - Tech deep dive [online], [cit ]. URL < [21] ALLOUCHE, G.: 10 Best Practices for Apache Hive [online], [cit ]. URL < [22] Berkeley: Big Data benchmark [online], [cit ]. URL < [23] KUCERA, R.: Using Hadoop as a platform for Master Data Management [online], [cit ]. URL < 50

58 [24] cloumon-oozie contributors: cloumon-oozie [online], [cit ]. URL < [25] LINDSEY, E.: Three-Dimensional Analysis - Data Profiling Techniques. Data Profiling LLC, 2008, ISBN [26] LOSHIN, D.: Master Data Management. The MK/OMG Press, 2008, ISBN , s. [27] Contributors, C.: LanguageManual UDF [online], [cit ]. URL < LanguageManual+UDF> [28] LOSHIN, D.: The Practitioner s Guide to Data Quality Improvement (The Morgan Kaufmann Series on Business Intelligence). San Francisco: Morgan Kaufmann, 2008, ISBN [29] BOLLINGER, J.: Bollinger on Bollinger Bands. New York: McGraw Hill, 2002, ISBN [30] Oracle: ORACLE BIG DATA CONNECTORS [online], [cit ]. URL < database-technologies/bdc/big-data-connectors/overview/ ds-bigdata-connectors pdf?sssourcesiteid= ocomen> [31] RevolutionAnalytics: RHadoop [online], [cit ]. URL < wikil> [32] Apache: Storm Documentation [online], [cit ]. URL < [33] Apache: Spark Streaming Programming Guide [online], [cit ]. URL < [34] Apache: Apache Kafka: a high-throughput distributed messaging system [online], [cit ]. URL < [35] Apache: Spark Overview [online], [cit ]. URL < [36] Hortonworks: Cluster planning guide [online], [cit ]. URL < /bk_cluster-planning-guide/bk_cluster-planningguide pdf> 51

59 A CD content CD contains following: Thesis in pdf format. 52

BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014

BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014 BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014 Ralph Kimball Associates 2014 The Data Warehouse Mission Identify all possible enterprise data assets Select those assets

More information

Hadoop Ecosystem B Y R A H I M A.

Hadoop Ecosystem B Y R A H I M A. Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open

More information

How To Scale Out Of A Nosql Database

How To Scale Out Of A Nosql Database Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 [email protected] www.scch.at Michael Zwick DI

More information

How To Handle Big Data With A Data Scientist

How To Handle Big Data With A Data Scientist III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution

More information

Implement Hadoop jobs to extract business value from large and varied data sets

Implement Hadoop jobs to extract business value from large and varied data sets Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to

More information

Big Data With Hadoop

Big Data With Hadoop With Saurabh Singh [email protected] The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

More information

Native Connectivity to Big Data Sources in MSTR 10

Native Connectivity to Big Data Sources in MSTR 10 Native Connectivity to Big Data Sources in MSTR 10 Bring All Relevant Data to Decision Makers Support for More Big Data Sources Optimized Access to Your Entire Big Data Ecosystem as If It Were a Single

More information

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Managing Big Data with Hadoop & Vertica A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Copyright Vertica Systems, Inc. October 2009 Cloudera and Vertica

More information

1. OLAP is an acronym for a. Online Analytical Processing b. Online Analysis Process c. Online Arithmetic Processing d. Object Linking and Processing

1. OLAP is an acronym for a. Online Analytical Processing b. Online Analysis Process c. Online Arithmetic Processing d. Object Linking and Processing 1. OLAP is an acronym for a. Online Analytical Processing b. Online Analysis Process c. Online Arithmetic Processing d. Object Linking and Processing 2. What is a Data warehouse a. A database application

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

Using Tableau Software with Hortonworks Data Platform

Using Tableau Software with Hortonworks Data Platform Using Tableau Software with Hortonworks Data Platform September 2013 2013 Hortonworks Inc. http:// Modern businesses need to manage vast amounts of data, and in many cases they have accumulated this data

More information

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe

More information

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing

More information

Data Warehouse and Hive. Presented By: Shalva Gelenidze Supervisor: Nodar Momtselidze

Data Warehouse and Hive. Presented By: Shalva Gelenidze Supervisor: Nodar Momtselidze Data Warehouse and Hive Presented By: Shalva Gelenidze Supervisor: Nodar Momtselidze Decision support systems Decision Support Systems allowed managers, supervisors, and executives to once again see the

More information

Hadoop implementation of MapReduce computational model. Ján Vaňo

Hadoop implementation of MapReduce computational model. Ján Vaňo Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed

More information

Native Connectivity to Big Data Sources in MicroStrategy 10. Presented by: Raja Ganapathy

Native Connectivity to Big Data Sources in MicroStrategy 10. Presented by: Raja Ganapathy Native Connectivity to Big Data Sources in MicroStrategy 10 Presented by: Raja Ganapathy Agenda MicroStrategy supports several data sources, including Hadoop Why Hadoop? How does MicroStrategy Analytics

More information

Constructing a Data Lake: Hadoop and Oracle Database United!

Constructing a Data Lake: Hadoop and Oracle Database United! Constructing a Data Lake: Hadoop and Oracle Database United! Sharon Sophia Stephen Big Data PreSales Consultant February 21, 2015 Safe Harbor The following is intended to outline our general product direction.

More information

BIG DATA AND THE ENTERPRISE DATA WAREHOUSE WORKSHOP

BIG DATA AND THE ENTERPRISE DATA WAREHOUSE WORKSHOP BIG DATA AND THE ENTERPRISE DATA WAREHOUSE WORKSHOP Business Analytics for All Amsterdam - 2015 Value of Big Data is Being Recognized Executives beginning to see the path from data insights to revenue

More information

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview Programming Hadoop 5-day, instructor-led BD-106 MapReduce Overview The Client Server Processing Pattern Distributed Computing Challenges MapReduce Defined Google's MapReduce The Map Phase of MapReduce

More information

Using distributed technologies to analyze Big Data

Using distributed technologies to analyze Big Data Using distributed technologies to analyze Big Data Abhijit Sharma Innovation Lab BMC Software 1 Data Explosion in Data Center Performance / Time Series Data Incoming data rates ~Millions of data points/

More information

Traditional BI vs. Business Data Lake A comparison

Traditional BI vs. Business Data Lake A comparison Traditional BI vs. Business Data Lake A comparison The need for new thinking around data storage and analysis Traditional Business Intelligence (BI) systems provide various levels and kinds of analyses

More information

Large scale processing using Hadoop. Ján Vaňo

Large scale processing using Hadoop. Ján Vaňo Large scale processing using Hadoop Ján Vaňo What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data Includes: MapReduce offline computing engine

More information

Composite Data Virtualization Composite Data Virtualization And NOSQL Data Stores

Composite Data Virtualization Composite Data Virtualization And NOSQL Data Stores Composite Data Virtualization Composite Data Virtualization And NOSQL Data Stores Composite Software October 2010 TABLE OF CONTENTS INTRODUCTION... 3 BUSINESS AND IT DRIVERS... 4 NOSQL DATA STORES LANDSCAPE...

More information

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing Wayne W. Eckerson Director of Research, TechTarget Founder, BI Leadership Forum Business Analytics

More information

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES Relational vs. Non-Relational Architecture Relational Non-Relational Rational Predictable Traditional Agile Flexible Modern 2 Agenda Big Data

More information

Emerging Technologies Shaping the Future of Data Warehouses & Business Intelligence

Emerging Technologies Shaping the Future of Data Warehouses & Business Intelligence Emerging Technologies Shaping the Future of Data Warehouses & Business Intelligence Appliances and DW Architectures John O Brien President and Executive Architect Zukeran Technologies 1 TDWI 1 Agenda What

More information

BIG DATA TECHNOLOGY. Hadoop Ecosystem

BIG DATA TECHNOLOGY. Hadoop Ecosystem BIG DATA TECHNOLOGY Hadoop Ecosystem Agenda Background What is Big Data Solution Objective Introduction to Hadoop Hadoop Ecosystem Hybrid EDW Model Predictive Analysis using Hadoop Conclusion What is Big

More information

Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science

Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science A Seminar report On Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science SUBMITTED TO: www.studymafia.org SUBMITTED BY: www.studymafia.org

More information

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web

More information

Alexander Nikov. 5. Database Systems and Managing Data Resources. Learning Objectives. RR Donnelley Tries to Master Its Data

Alexander Nikov. 5. Database Systems and Managing Data Resources. Learning Objectives. RR Donnelley Tries to Master Its Data INFO 1500 Introduction to IT Fundamentals 5. Database Systems and Managing Data Resources Learning Objectives 1. Describe how the problems of managing data resources in a traditional file environment are

More information

SQL Server 2012 Business Intelligence Boot Camp

SQL Server 2012 Business Intelligence Boot Camp SQL Server 2012 Business Intelligence Boot Camp Length: 5 Days Technology: Microsoft SQL Server 2012 Delivery Method: Instructor-led (classroom) About this Course Data warehousing is a solution organizations

More information

Oracle Database 12c Plug In. Switch On. Get SMART.

Oracle Database 12c Plug In. Switch On. Get SMART. Oracle Database 12c Plug In. Switch On. Get SMART. Duncan Harvey Head of Core Technology, Oracle EMEA March 2015 Safe Harbor Statement The following is intended to outline our general product direction.

More information

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Kanchan A. Khedikar Department of Computer Science & Engineering Walchand Institute of Technoloy, Solapur, Maharashtra,

More information

#mstrworld. Tapping into Hadoop and NoSQL Data Sources in MicroStrategy. Presented by: Trishla Maru. #mstrworld

#mstrworld. Tapping into Hadoop and NoSQL Data Sources in MicroStrategy. Presented by: Trishla Maru. #mstrworld Tapping into Hadoop and NoSQL Data Sources in MicroStrategy Presented by: Trishla Maru Agenda Big Data Overview All About Hadoop What is Hadoop? How does MicroStrategy connects to Hadoop? Customer Case

More information

Agile Business Intelligence Data Lake Architecture

Agile Business Intelligence Data Lake Architecture Agile Business Intelligence Data Lake Architecture TABLE OF CONTENTS Introduction... 2 Data Lake Architecture... 2 Step 1 Extract From Source Data... 5 Step 2 Register And Catalogue Data Sets... 5 Step

More information

MySQL and Hadoop. Percona Live 2014 Chris Schneider

MySQL and Hadoop. Percona Live 2014 Chris Schneider MySQL and Hadoop Percona Live 2014 Chris Schneider About Me Chris Schneider, Database Architect @ Groupon Spent the last 10 years building MySQL architecture for multiple companies Worked with Hadoop for

More information

Apache Kylin Introduction Dec 8, 2014 @ApacheKylin

Apache Kylin Introduction Dec 8, 2014 @ApacheKylin Apache Kylin Introduction Dec 8, 2014 @ApacheKylin Luke Han Sr. Product Manager [email protected] @lukehq Yang Li Architect & Tech Leader [email protected] Agenda What s Apache Kylin? Tech Highlights Performance

More information

Reference Architecture, Requirements, Gaps, Roles

Reference Architecture, Requirements, Gaps, Roles Reference Architecture, Requirements, Gaps, Roles The contents of this document are an excerpt from the brainstorming document M0014. The purpose is to show how a detailed Big Data Reference Architecture

More information

An Oracle White Paper November 2010. Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

An Oracle White Paper November 2010. Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics An Oracle White Paper November 2010 Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics 1 Introduction New applications such as web searches, recommendation engines,

More information

Ramesh Bhashyam Teradata Fellow Teradata Corporation [email protected]

Ramesh Bhashyam Teradata Fellow Teradata Corporation bhashyam.ramesh@teradata.com Challenges of Handling Big Data Ramesh Bhashyam Teradata Fellow Teradata Corporation [email protected] Trend Too much information is a storage issue, certainly, but too much information is also

More information

Apache Hadoop: The Pla/orm for Big Data. Amr Awadallah CTO, Founder, Cloudera, Inc. [email protected], twicer: @awadallah

Apache Hadoop: The Pla/orm for Big Data. Amr Awadallah CTO, Founder, Cloudera, Inc. aaa@cloudera.com, twicer: @awadallah Apache Hadoop: The Pla/orm for Big Data Amr Awadallah CTO, Founder, Cloudera, Inc. [email protected], twicer: @awadallah 1 The Problems with Current Data Systems BI Reports + Interac7ve Apps RDBMS (aggregated

More information

Integrating Big Data into the Computing Curricula

Integrating Big Data into the Computing Curricula Integrating Big Data into the Computing Curricula Yasin Silva, Suzanne Dietrich, Jason Reed, Lisa Tsosie Arizona State University http://www.public.asu.edu/~ynsilva/ibigdata/ 1 Overview Motivation Big

More information

Integrating Hadoop. Into Business Intelligence & Data Warehousing. Philip Russom TDWI Research Director for Data Management, April 9 2013

Integrating Hadoop. Into Business Intelligence & Data Warehousing. Philip Russom TDWI Research Director for Data Management, April 9 2013 Integrating Hadoop Into Business Intelligence & Data Warehousing Philip Russom TDWI Research Director for Data Management, April 9 2013 TDWI would like to thank the following companies for sponsoring the

More information

NoSQL and Hadoop Technologies On Oracle Cloud

NoSQL and Hadoop Technologies On Oracle Cloud NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath

More information

Internals of Hadoop Application Framework and Distributed File System

Internals of Hadoop Application Framework and Distributed File System International Journal of Scientific and Research Publications, Volume 5, Issue 7, July 2015 1 Internals of Hadoop Application Framework and Distributed File System Saminath.V, Sangeetha.M.S Abstract- Hadoop

More information

Hadoop. http://hadoop.apache.org/ Sunday, November 25, 12

Hadoop. http://hadoop.apache.org/ Sunday, November 25, 12 Hadoop http://hadoop.apache.org/ What Is Apache Hadoop? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using

More information

Data warehousing with PostgreSQL

Data warehousing with PostgreSQL Data warehousing with PostgreSQL Gabriele Bartolini http://www.2ndquadrant.it/ European PostgreSQL Day 2009 6 November, ParisTech Telecom, Paris, France Audience

More information

Data processing goes big

Data processing goes big Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,

More information

HDP Hadoop From concept to deployment.

HDP Hadoop From concept to deployment. HDP Hadoop From concept to deployment. Ankur Gupta Senior Solutions Engineer Rackspace: Page 41 27 th Jan 2015 Where are you in your Hadoop Journey? A. Researching our options B. Currently evaluating some

More information

Datenverwaltung im Wandel - Building an Enterprise Data Hub with

Datenverwaltung im Wandel - Building an Enterprise Data Hub with Datenverwaltung im Wandel - Building an Enterprise Data Hub with Cloudera Bernard Doering Regional Director, Central EMEA, Cloudera Cloudera Your Hadoop Experts Founded 2008, by former employees of Employees

More information

SAS BI Course Content; Introduction to DWH / BI Concepts

SAS BI Course Content; Introduction to DWH / BI Concepts SAS BI Course Content; Introduction to DWH / BI Concepts SAS Web Report Studio 4.2 SAS EG 4.2 SAS Information Delivery Portal 4.2 SAS Data Integration Studio 4.2 SAS BI Dashboard 4.2 SAS Management Console

More information

How to Enhance Traditional BI Architecture to Leverage Big Data

How to Enhance Traditional BI Architecture to Leverage Big Data B I G D ATA How to Enhance Traditional BI Architecture to Leverage Big Data Contents Executive Summary... 1 Traditional BI - DataStack 2.0 Architecture... 2 Benefits of Traditional BI - DataStack 2.0...

More information

Chapter 6. Foundations of Business Intelligence: Databases and Information Management

Chapter 6. Foundations of Business Intelligence: Databases and Information Management Chapter 6 Foundations of Business Intelligence: Databases and Information Management VIDEO CASES Case 1a: City of Dubuque Uses Cloud Computing and Sensors to Build a Smarter, Sustainable City Case 1b:

More information

Tap into Hadoop and Other No SQL Sources

Tap into Hadoop and Other No SQL Sources Tap into Hadoop and Other No SQL Sources Presented by: Trishla Maru What is Big Data really? The Three Vs of Big Data According to Gartner Volume Volume Orders of magnitude bigger than conventional data

More information

Business Intelligence for Big Data

Business Intelligence for Big Data Business Intelligence for Big Data Will Gorman, Vice President, Engineering May, 2011 2010, Pentaho. All Rights Reserved. www.pentaho.com. What is BI? Business Intelligence = reports, dashboards, analysis,

More information

The Future of Data Management

The Future of Data Management The Future of Data Management with Hadoop and the Enterprise Data Hub Amr Awadallah (@awadallah) Cofounder and CTO Cloudera Snapshot Founded 2008, by former employees of Employees Today ~ 800 World Class

More information

Big Data Introduction

Big Data Introduction Big Data Introduction Ralf Lange Global ISV & OEM Sales 1 Copyright 2012, Oracle and/or its affiliates. All rights Conventional infrastructure 2 Copyright 2012, Oracle and/or its affiliates. All rights

More information

SQL Server Administrator Introduction - 3 Days Objectives

SQL Server Administrator Introduction - 3 Days Objectives SQL Server Administrator Introduction - 3 Days INTRODUCTION TO MICROSOFT SQL SERVER Exploring the components of SQL Server Identifying SQL Server administration tasks INSTALLING SQL SERVER Identifying

More information

CitusDB Architecture for Real-Time Big Data

CitusDB Architecture for Real-Time Big Data CitusDB Architecture for Real-Time Big Data CitusDB Highlights Empowers real-time Big Data using PostgreSQL Scales out PostgreSQL to support up to hundreds of terabytes of data Fast parallel processing

More information

Data Integration Checklist

Data Integration Checklist The need for data integration tools exists in every company, small to large. Whether it is extracting data that exists in spreadsheets, packaged applications, databases, sensor networks or social media

More information

GAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION

GAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION GAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION Syed Rasheed Solution Manager Red Hat Corp. Kenny Peeples Technical Manager Red Hat Corp. Kimberly Palko Product Manager Red Hat Corp.

More information

Cloudera Certified Developer for Apache Hadoop

Cloudera Certified Developer for Apache Hadoop Cloudera CCD-333 Cloudera Certified Developer for Apache Hadoop Version: 5.6 QUESTION NO: 1 Cloudera CCD-333 Exam What is a SequenceFile? A. A SequenceFile contains a binary encoding of an arbitrary number

More information

Oracle BI EE Implementation on Netezza. Prepared by SureShot Strategies, Inc.

Oracle BI EE Implementation on Netezza. Prepared by SureShot Strategies, Inc. Oracle BI EE Implementation on Netezza Prepared by SureShot Strategies, Inc. The goal of this paper is to give an insight to Netezza architecture and implementation experience to strategize Oracle BI EE

More information

Lecture Data Warehouse Systems

Lecture Data Warehouse Systems Lecture Data Warehouse Systems Eva Zangerle SS 2013 PART C: Novel Approaches in DW NoSQL and MapReduce Stonebraker on Data Warehouses Star and snowflake schemas are a good idea in the DW world C-Stores

More information

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William sampd@stumbleupon.

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William sampd@stumbleupon. Building Scalable Big Data Infrastructure Using Open Source Software Sam William sampd@stumbleupon. What is StumbleUpon? Help users find content they did not expect to find The best way to discover new

More information

Luncheon Webinar Series May 13, 2013

Luncheon Webinar Series May 13, 2013 Luncheon Webinar Series May 13, 2013 InfoSphere DataStage is Big Data Integration Sponsored By: Presented by : Tony Curcio, InfoSphere Product Management 0 InfoSphere DataStage is Big Data Integration

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

QLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM

QLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM QLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM QlikView Technical Case Study Series Big Data June 2012 qlikview.com Introduction This QlikView technical case study focuses on the QlikView deployment

More information

A very short talk about Apache Kylin Business Intelligence meets Big Data. Fabian Wilckens EMEA Solutions Architect

A very short talk about Apache Kylin Business Intelligence meets Big Data. Fabian Wilckens EMEA Solutions Architect A very short talk about Apache Kylin Business Intelligence meets Big Data Fabian Wilckens EMEA Solutions Architect 1 The challenge today 2 Very quickly: OLAP Online Analytical Processing How many beers

More information

Open source large scale distributed data management with Google s MapReduce and Bigtable

Open source large scale distributed data management with Google s MapReduce and Bigtable Open source large scale distributed data management with Google s MapReduce and Bigtable Ioannis Konstantinou Email: [email protected] Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory

More information

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani Technical Architect - Big Data Syntel Agenda Welcome to the Zoo! Evolution Timeline Traditional BI/DW Architecture Where Hadoop Fits In 2 Welcome to

More information

Bringing the Power of SAS to Hadoop. White Paper

Bringing the Power of SAS to Hadoop. White Paper White Paper Bringing the Power of SAS to Hadoop Combine SAS World-Class Analytic Strength with Hadoop s Low-Cost, Distributed Data Storage to Uncover Hidden Opportunities Contents Introduction... 1 What

More information

Transparently Offloading Data Warehouse Data to Hadoop using Data Virtualization

Transparently Offloading Data Warehouse Data to Hadoop using Data Virtualization Transparently Offloading Data Warehouse Data to Hadoop using Data Virtualization A Technical Whitepaper Rick F. van der Lans Independent Business Intelligence Analyst R20/Consultancy February 2015 Sponsored

More information

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase Architectural patterns for building real time applications with Apache HBase Andrew Purtell Committer and PMC, Apache HBase Who am I? Distributed systems engineer Principal Architect in the Big Data Platform

More information

BUILDING OLAP TOOLS OVER LARGE DATABASES

BUILDING OLAP TOOLS OVER LARGE DATABASES BUILDING OLAP TOOLS OVER LARGE DATABASES Rui Oliveira, Jorge Bernardino ISEC Instituto Superior de Engenharia de Coimbra, Polytechnic Institute of Coimbra Quinta da Nora, Rua Pedro Nunes, P-3030-199 Coimbra,

More information

Big Data and Market Surveillance. April 28, 2014

Big Data and Market Surveillance. April 28, 2014 Big Data and Market Surveillance April 28, 2014 Copyright 2014 Scila AB. All rights reserved. Scila AB reserves the right to make changes to the information contained herein without prior notice. No part

More information

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future

More information

Chapter 6 8/12/2015. Foundations of Business Intelligence: Databases and Information Management. Problem:

Chapter 6 8/12/2015. Foundations of Business Intelligence: Databases and Information Management. Problem: Foundations of Business Intelligence: Databases and Information Management VIDEO CASES Chapter 6 Case 1a: City of Dubuque Uses Cloud Computing and Sensors to Build a Smarter, Sustainable City Case 1b:

More information

Data Warehousing Systems: Foundations and Architectures

Data Warehousing Systems: Foundations and Architectures Data Warehousing Systems: Foundations and Architectures Il-Yeol Song Drexel University, http://www.ischool.drexel.edu/faculty/song/ SYNONYMS None DEFINITION A data warehouse (DW) is an integrated repository

More information

Big Data Can Drive the Business and IT to Evolve and Adapt

Big Data Can Drive the Business and IT to Evolve and Adapt Big Data Can Drive the Business and IT to Evolve and Adapt Ralph Kimball Associates 2013 Ralph Kimball Brussels 2013 Big Data Itself is Being Monetized Executives see the short path from data insights

More information

Tapping Into Hadoop and NoSQL Data Sources with MicroStrategy. Presented by: Jeffrey Zhang and Trishla Maru

Tapping Into Hadoop and NoSQL Data Sources with MicroStrategy. Presented by: Jeffrey Zhang and Trishla Maru Tapping Into Hadoop and NoSQL Data Sources with MicroStrategy Presented by: Jeffrey Zhang and Trishla Maru Agenda Big Data Overview All About Hadoop What is Hadoop? How does MicroStrategy connects to Hadoop?

More information

Data Warehouse Overview. Srini Rengarajan

Data Warehouse Overview. Srini Rengarajan Data Warehouse Overview Srini Rengarajan Please mute Your cell! Agenda Data Warehouse Architecture Approaches to build a Data Warehouse Top Down Approach Bottom Up Approach Best Practices Case Example

More information

Open source Google-style large scale data analysis with Hadoop

Open source Google-style large scale data analysis with Hadoop Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: [email protected] Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical

More information

Big Data: Tools and Technologies in Big Data

Big Data: Tools and Technologies in Big Data Big Data: Tools and Technologies in Big Data Jaskaran Singh Student Lovely Professional University, Punjab Varun Singla Assistant Professor Lovely Professional University, Punjab ABSTRACT Big data can

More information

Hadoop Job Oriented Training Agenda

Hadoop Job Oriented Training Agenda 1 Hadoop Job Oriented Training Agenda Kapil CK [email protected] Module 1 M o d u l e 1 Understanding Hadoop This module covers an overview of big data, Hadoop, and the Hortonworks Data Platform. 1.1 Module

More information

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract W H I T E P A P E R Deriving Intelligence from Large Data Using Hadoop and Applying Analytics Abstract This white paper is focused on discussing the challenges facing large scale data processing and the

More information

BIG DATA What it is and how to use?

BIG DATA What it is and how to use? BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14

More information

Data Mining in the Swamp

Data Mining in the Swamp WHITE PAPER Page 1 of 8 Data Mining in the Swamp Taming Unruly Data with Cloud Computing By John Brothers Business Intelligence is all about making better decisions from the data you have. However, all

More information

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning How to use Big Data in Industry 4.0 implementations LAURI ILISON, PhD Head of Big Data and Machine Learning Big Data definition? Big Data is about structured vs unstructured data Big Data is about Volume

More information

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763 International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing

More information

Big Data at Cloud Scale

Big Data at Cloud Scale Big Data at Cloud Scale Pushing the limits of flexible & powerful analytics Copyright 2015 Pentaho Corporation. Redistribution permitted. All trademarks are the property of their respective owners. For

More information

Big Data and Apache Hadoop Adoption:

Big Data and Apache Hadoop Adoption: Expert Reference Series of White Papers Big Data and Apache Hadoop Adoption: Key Challenges and Rewards 1-800-COURSES www.globalknowledge.com Big Data and Apache Hadoop Adoption: Key Challenges and Rewards

More information

Testing Big data is one of the biggest

Testing Big data is one of the biggest Infosys Labs Briefings VOL 11 NO 1 2013 Big Data: Testing Approach to Overcome Quality Challenges By Mahesh Gudipati, Shanthi Rao, Naju D. Mohan and Naveen Kumar Gajja Validate data quality by employing

More information

Big Data for the Rest of Us Technical White Paper

Big Data for the Rest of Us Technical White Paper Big Data for the Rest of Us Technical White Paper Treasure Data - Big Data for the Rest of Us 1 Introduction The importance of data warehousing and analytics has increased as companies seek to gain competitive

More information

Monitoring Genebanks using Datamarts based in an Open Source Tool

Monitoring Genebanks using Datamarts based in an Open Source Tool Monitoring Genebanks using Datamarts based in an Open Source Tool April 10 th, 2008 Edwin Rojas Research Informatics Unit (RIU) International Potato Center (CIP) GPG2 Workshop 2008 Datamarts Motivation

More information

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing Data-Intensive Programming Timo Aaltonen Department of Pervasive Computing Data-Intensive Programming Lecturer: Timo Aaltonen University Lecturer [email protected] Assistants: Henri Terho and Antti

More information

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after

More information

Big Data Analytics in LinkedIn. Danielle Aring & William Merritt

Big Data Analytics in LinkedIn. Danielle Aring & William Merritt Big Data Analytics in LinkedIn by Danielle Aring & William Merritt 2 Brief History of LinkedIn - Launched in 2003 by Reid Hoffman (https://ourstory.linkedin.com/) - 2005: Introduced first business lines

More information

Best Practices for Hadoop Data Analysis with Tableau

Best Practices for Hadoop Data Analysis with Tableau Best Practices for Hadoop Data Analysis with Tableau September 2013 2013 Hortonworks Inc. http:// Tableau 6.1.4 introduced the ability to visualize large, complex data stored in Apache Hadoop with Hortonworks

More information