}w!"#$%&'()+,-./012345<ya

Size: px
Start display at page:

Download "}w!"#$%&'()+,-./012345<ya"

Transcription

1 }w!"#$%&'()+,-./012345<ya MASARYK UNIVERSITY FACULTY OF INFORMATICS Hadoop as an Extension of the Enterprise Data Warehouse MASTER THESIS Bc. Aleš Hejmalíček Brno, 2015

2 Declaration Hereby I declare, that this paper is my original authorial work, which I have worked out by my own. All sources, references and literature used or excerpted during elaboration of this work are properly cited and listed in complete reference to the due source. Bc. Aleš Hejmalíček Advisor: doc. RNDr. Vlastislav Dohnal, Ph.D. ii

3 Acknowledgement First of all I would like to thank my supervisor Vlastislav Dohnal for his time, for his advices and feedback along the way. Then I would like to thank to my family for their continuous support through whole university studies. I would also like to apologize to my girlfriend for waking her late at night with loud typing. Last but not least is a thank to my colleagues in AVG BI team for respecting my time inflexibility, especially while finishing this thesis. iii

4 Abstract The goal of this thesis is to describe issues with processing big data and to propose and explain an enterprise data warehouse architecture that is capable of processing large volumes of structured and unstructured data. The thesis aims to explain integration of Hadoop framework as a part of the proposed architecture into existing enterprise data warehouses. iv

5 Keywords Hadoop, Data warehouse, Kimball, Analytic platform, OLAP, Hive, ETL, Analytics v

6 Contents Introduction Analytic platforms introduction New data sources Data warehouse Analytic platform Extract, transform and load Kimball s architecture Dimensional modelling Conformed dimensions Surrogate keys Fact tables Business intelligence Online analytical processing Analytics Reporting Existing technologies in DW Proposed architecture Architecture overview Multi-Platform DW environment Hadoop HDFS YARN MapReduce Hive HiveQL Hive ETL Impala Pig Kylin Sqoop Hadoop integration Star schema implementation Dimensions implementation Facts implementation Star schema performance optimization Security Data management Master data Metadata Orchestration Data profiling Data quality vi

7 3.3.6 Data archiving and preservation Extract, transform and load Hand coding Commercial tools Business intelligence Online analytical processing Reporting Analytics Real time processing Physical implementation Hadoop SaaS Hadoop commercial distributions Other Hadoop use cases Data lakes ETL offload platform Data archive platform Analytic sandboxes Conclusion vii

8 Introduction Companies want to leverage emerging new data sources to gain market advantage. However, traditional technologies are not sufficient for processing large volumes of data or streaming real time data. Hence, in last few years a lot of companies invested into development of new technologies that are capable of processing such data. Those data processing technologies are generally expensive. Therefore, Hadoop, open source framework for distributed computing was developed. Lately, companies have adopted and integrated Hadoop framework to improve their data processing capabilities, despite the fact that they already use some form of data warehouse. However adopting Hadoop and other similar technologies brings new challenges for all people involved in data processing, reporting and data analytics. Main issue is integration into already running data warehouse environment as Hadoop technology is relatively new and not many business use cases and successful implementations have been published and therefore there are no existing best practices or guidelines. This is the reason why I choose this topic as the topic of my master s thesis. Goal of my thesis is to suggest data warehouse architecture that allows processing of large amount of data in batch manner and streaming data as well and to explain techniques and processes for new system integration into existing enterprise data warehouse. Including explanation of data integration using Kimball s architecture best practices, data management processes and connection to business intelligence systems such as reporting or online analytical processing. Proposed architecture also needs to be accessible the same way as existing data warehouses, so the data consumers can access the data in familiar manner. The first chapter introduce the problems of data processing and explains basic terms such as data warehousing, business intelligence or analytics. The main goal is to present new issues, challenges and currently used technologies for data processing. The second chapter presents requirements on new architecture and purposes architecture that meets necessary requirements. Further in second chapter are described individual technologies and theirs characteristics, advantages and disadvantages. The third chapter focuses on Hadoop integration into existing enterprise data warehouse environment. It includes explanation of data integration in Hadoop following Kimball s best practices from general data warehousing such as start schema implementation and individual parts of data management plan and processes. The other part describes implementation of extract, transform and load process, usage of business intelligence and reporting tools and then focuses on physical Hadoop implementation and Hadoop cluster location. The Last chapter explains other specific business use cases for Hadoop in data processing and data warehousing as it is well-rounded technology and can be used for different purposes. 1

9 1 Analytic platforms introduction In the past few years, the amount of data available to organizations of all kinds has increased exponentially. Businesses are under the pressure to be able to retrieve information that could be leveraged to improve business performance and to gain competitive advantage. However, processing and data integration is getting more complex due to data variety, volume and velocity. This is the challenge for most of organizations, as internal changes are necessary in organization, data management and infrastructure /citebd. Due to adoption of new technologies, companies hire people with specific experience and skills in these new technologies. As the most of the technologies and tools are relatively young, it is expected that skill shortage gap is going to grow. Ideal candidate should have mix of analytic skills, statistics and coding experience and such people are difficult to find. By 2018 The United States alone is going to face a shortage of 140,000 to 190,000 people with analytical expertise and 1.5 million managers and analysts with the skills to understand and make decisions based on an analysis of big data [1]. However, the data are not new for organizations. In most cases the data from mobile devices, log data or sensor data have been available for a long time. A lot of those data were not previously processed and analyzed even though the sources of massive amount of data existed in history. Major change in business perspective happened in last several years with development of distributed analytic platforms. Nowadays, the technologies are available for anyone for an affordable price and a lot of companies have started to look for business cases to build and use analytical platforms. Due to the demand, a lot of companies have created their products to help businesses solve those issues. Companies such as Cloudera, IBM, Microsoft or Teradata offers solution including software, hardware and enterprise support all together. However, an initial price is too high for smaller companies, as the price for a product itself does not takes into consideration companies existing systems and additional data sources and processing or data integration. Data integration itself is a major expense in data processing projects. Most technological companies already somehow integrate data and build data warehouses to ensure simple and unified access to data [2]. However, these data warehouses are built to secure precise reporting and usually uses technologies such as relational database management systems (RDBM) that are not designed for processing of petabytes of data. Companies build data warehouses to achieve fast, simple and unified reporting. This is mostly performed by aggregating the data. This approach improve processing speeds and reduce needed storage space. On the other hand, aggregated data doesn t allow complex analytics due to a lack of detail. For analytic purposes data should have the same detail as raw source data. When building analytic platform or building complementary solution to an already existing data warehouse, business must decide if they prefer commercial 2

10 1. ANALYTIC PLATFORMS INTRODUCTION third party product with enterprise support or if they are able to build the solution themselves in-house. Both approaches have advantages and disadvantages. Main advantage of an in-house solution is the price and modifiability. On the other hand it can be difficult to find experts with enough experience to develop and deliver end to end solution and to provide continuous maintenance. Major disadvantages of buying complete analytical platform is the price as it can hide additional costs in data integration, maintenance and support. 1.1 New data sources As new devices are connecting to the network, various data become available, a large amount of data is hard to manage with common technologies and tools and to process it within tolerable time. Those new data sources include mobile phones, tablets, sensors, wearable devices, which number has grown significantly in last years. All these devices interact with its surrounding or web sites such as social media and every action can be recorded and logged. New data sources have following characteristics: Volume - As mentioned before, the amount of data is rapidly increasing every year. According to The Economist [3], the amount of digital information increases tenfold every five years. Velocity - Interaction with social medias or with mobile applications usually happens in real time and hence causing continuous data flow. Processing real time data can help business make valuable decisions. Variety - Data structure can be dynamic and it can change significantly with any record. Such data include XML or nested JSON and JSON arrays. However, unstructured or semi structured data are harder to process with traditional technologies, although these data can contain valuable information. Veracity - With different data sources, it is getting more difficult to maintain data certainty and this issue is more challenging with more volume and higher velocity. With increasing volume and complexity of data, integrating and cleaning processes get more difficult and companies are integrating new platforms and processes to deal with data processing and analytics [4, 5]. 1.2 Data warehouse Data warehouse is a system that allows process data in order to be easily analysable and queryable. Bill Inmon defined a data warehouse in 1990 as follows: A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management s decision making process. [6] 3

11 1. ANALYTIC PLATFORMS INTRODUCTION He defined the terms as follows: Subject Oriented - Data that gives information about a particular subject instead of about a company s ongoing operations. Integrated - Data that is gathered into the data warehouse from a variety of sources and merged into a coherent whole. Time-variant - All data in the data warehouse is identified with a particular time period. Non-volatile - Data are stable in a data warehouse. More data can be added but historic data are never removed or modified. This enables management to gain a consistent picture of the business. Enterprise data warehouse integrates and unifies all the business information of an organization and makes it accessible all across a company without compromising security or data integrity. It allows complex reporting and analytics across different systems and business processes Analytic platform Analytic platform is a set of tools and technologies that allow storing, processing and analysis of data. Most companies want to focus on processing of higher amount of data, therefore a distributed system is a base of analytical platform. Often newer technologies are used such as NoSQL and NewSQL databases and advanced statistical tools and machine learning. Many companies providing analytic platforms focus on providing as much integrated tools as possible in order to make adaptation of a new platform seemingly easy. Those platforms are designed for processing a large amount of data in order to make advanced analytics possible. Among others, advance analytics include following cases: Search ranking Ad tracking User experience analysis Big science data analysis Location and proximity tracking Causal factor discovery Social CRM Document similarity testing Customer churn or wake up analysis 4

12 1. ANALYTIC PLATFORMS INTRODUCTION Extract, transform and load Extract, transform and load (ETL) [7] is a process of moving data across systems and databases in order to make them easily analysable. ETL is mostly used in data warehousing as data that are being loaded into data warehouse are often transformed and cleansed to ensure data quality needed for analysis. ETL describes three steps of moving data: Extract - Process of an extraction of data from a source system. Extraction can be directly from database or through some API. Extraction can implement complex mechanisms of data extraction to extract only changes from a database. This process is called change data capture and is one of the most efficient mechanism for data extraction. Extraction also often includes extract archiving for audit purposes. Transform - Transformation can implement any algorithm or transformation logic. Data are transformed so they satisfy data model and data quality needs of data warehouse. This can include data type conversion, data cleansing or even fuzzy lookups. Usually data from more data sources are integrated together within this step. Load - Process of loading transformed data into data warehouse. Usually includes loading transformed dimensional and fact tables Kimball s architecture Ralph Kimball designed individual processes and tasks within a data warehouse in order to simplify its development and maintenance [8]. The Kimball s architecture includes processes used in end to end delivery of a data warehouse. The whole data warehouse development process starts with gathering users requirements. Primary goal is to gather metrics which needs to be analyzed in order to identify data sources and design data model. Then it describes incremental development method that focus on continuous delivery as data warehouse project are rather bigger and it is important to start delivering business value early in the data warehouse development. Key features of Kimball s architecture are dimensional modelling and identifying fact tables, which are described further in this thesis. Regarding ETL process, Kimball described steps of an extraction, data transformation, data cleaning and loading into stage as into temporal storage and into final data warehouse dimensional and fact tables. All other processes around data warehouse such as data warehouse security, data management and data governance are included as well. All together Kimball s architecture is a quite simple and understandable framework for data warehouse development Dimensional modelling Dimensional modeling [8] is a technique for designing data models simple and understandable. As most of the business users tend to describe world in the enti- 5

13 1. ANALYTIC PLATFORMS INTRODUCTION ties such as product, customer or dates it is reasonable to model the data the same way. It is intuitive implementation of data cube that has edges labeled product, customer and date for example. This implementation allows users to easily slice the data and breakdown them by different dimensions. Inside of a data cube are measured metrics. When cube is sliced, metrics are shown, depending on how many dimensions are sliced. Implementation of dimensional model is a star schema [8]. In star schema, all dimensions are tied to fact tables [8]. Therefore in star schema it is easily visible, which dimensions can be used in for slicing. Figure 1.1: Star schema example. Dimensions can also contain different hierarchies and additional attributes. As the dimension needs to track history, Kimball defines several types [8] of dimensions. The most commons are: Type one - It does not track historical changes. When a record is changed in source system, it is updated in dimensions. Therefore only one record is stored for each natural key. Type two - It uses additional attributes such as date effective from and date effective from to track different versions of record. When record is changed in source system, new record is inserted into dimensions and date effective to attribute of old record is updated. Type three - It is used to track changes in defined attributes. If history tracking is needed, two separate attributes are created. One defines current state and the second one previous value Conformed dimensions One of the key features in data integration are conformed dimensions [8]. These are dimensions that describes one entity the same way across all integrated data 6

14 1. ANALYTIC PLATFORMS INTRODUCTION sources. Main reason for implementation of conformed dimensions is that CRM, ERP or billing systems can have different attributes and different ways how to describe business entities such as customer. Dimension conforming is process of taking all information about an entity and designing the transformation process in the way that data about the entity from all data sources are merged into one dimension. Dimension created using this process is called conformed dimension. Using conformed dimension significantly simplifies business data as all people involved in the business use the same view on customer and the same definition. This allows simple data reconsolidation. Figure 1.2: Dimension table example Surrogate keys Surrogate keys [8] ensures identification of individual entities in a dimension. Usually surrogate keys are implemented as an incremental sequence of integers. Surrogate keys are used because duplication of natural keys is expected in dimension tables as the changes in time need to be tracked, therefore surrogate key identifies specific version of the record. In addition, more than one natural key would have to be used in conformed dimension as the data may come from more data sources. Surrogate keys are predictable and easily manageable as they are assigned within a data warehouse Fact tables Fact tables are tables containing specific events or measures of a business process. A fact table has typically two types of columns. Foreign keys of dimension 7

15 1. ANALYTIC PLATFORMS INTRODUCTION tables and numeric measures. In an ETL process lookup on dimension tables is performed and values or keys describing entity in dimension are replaced by surrogate keys from particular dimensions. Fact table is defined by its granularity. Fact tables should always contain only one level of granularity. Having different granularity in a fact table could cause issues in measures aggregations. An example of fact tables with specific granularity can be table of sales orders and another one with order items. Figure 1.3: Fact table example. While designing fact table it is important to identify business processes that users want to analyze in order to specify data sources needed. Then follows a definition of measures such as sale amount or tax amount and a definition of dimensions that make sense within a business process context Business intelligence Business intelligence (BI) is set of tools and techniques, which goal is to simplify querying and analysing data sets. Commonly, BI tools are used as a top layer of data warehouse which is accessible for wide spectrum of users as a lot of BI techniques do not require advanced technological skills [9]. Business intelligence uses variation of techniques. Depending on a business requirements different tool or technique is chosen. Therefore BI in companies usually contains various tools from different providers. Example of BI techniques: Online analytical processing (OLAP) Reporting 8

16 1. ANALYTIC PLATFORMS INTRODUCTION Dashboards Predictive analytics Data mining Online analytical processing Online analytical processing (OLAP) [10, 9] is an approach that achieves faster response when querying multidimensional data, therefore it is one of the key features of companies decision system. The main advantage of OLAP is a leverage of star or snowflake schema structure. Three types of OLAP exist. If OLAP tool stores data in special structure such as hash table (SSAS) or multidimensional array on OLAP server then it is called multidimensional OLAP (MOLAP). MOLAP provides quick response to operations such as slice, dice, roll-up or drill-down as a tool is able to simply navigate trough precalculated aggregation to the lowest level. Among MOLAP, two other types of OLAP tools exist. Those are relational OLAP (ROLAP), which is based on querying data in relational structure (e.g. in RDBM). ROLAP is not very common as it does not achieve querying response time of MOLAP. The third type is hybrid OLAP (HOLAP), which is combination of MOLAP and ROLAP. One of the popular implementation is precalculating aggregations into MOLAP and keeping underlying data stored in ROLAP, therefore only when an aggregation is requested, underlying data are not queried. OLAP is often queried via multidimensional query language such as MDX either directly or through analytic tool such as Excel 1 or Tableau 2. User then only works with a pivot table and is able to query or filter aggregated and underlying data depending on OLAP definition. Some of the OLAP tools include: SSAS Microsoft 3. Mondrian developed by Pentaho 4. SAS OLAP Server 5. Oracle OLAP

17 1. ANALYTIC PLATFORMS INTRODUCTION Analytics Analytics is a process of discovering patterns in data and providing insight. As a multidimensional discipline, analytics includes methodologies from mathematics, statistics and predictive modeling to retrieve valuable knowledge. Considering data requirements, often analyses such as initial estimate calculations or data profiling does not require transformed data. Those analyses can be performed on slightly cleaned or even on raw data. Bigger impact on an analysis result usually have chosen subset of data. Therefore data quality processes are usually less strict than for data that are integrated into a EDW for reporting purposes. However, in general it is better to perform analyses on cleansed and transformed data Reporting Reporting is one of the last parts of a process that starts with discovering useful data sets within the company and continues through their integration with ETL process into EDW. This process, together with reports, have to be well designed, tested and audited as reports are often used for reconsolidation of manual business processes. Reports can also be used as official da stock market. Therefore, the process of cleansing and transformation have to be well documented and precise. Due to significant requirements on data quality with increasing volume of data, an ETL processing gets significantly more complex. Having a large amount of data needed for reporting can cause significant delivery issues. 1.3 Existing technologies in DW There are many architectural approaches to how to build a data warehouse. For last 20 years a lot of experts have been improving methodologies and processes that can be followed. Those methodologies are well known and applicable with traditional business intelligence and data warehouse technologies. As the methodologies are common, companies developing RDBMs such as Oracle or Microsoft have integrated functionality to make data warehouse development easier. There is also a lot of ETL frameworks and data warehouse IDEs (such as WhereScape) that provide higher level of data warehouse development abstraction [7, 9]. Thanks to more than 20 years of continuous development and support from community, the technology have been adjusted, so developers can focus on business needs rather than on technology. A lot of companies are developing or maintaining data warehouses built to integrate data from various internal systems such as ERP, CRM or back end systems. Those data sources are usually built on RDBMs and therefore data are structured and well defined. However, the ETL process is still necessary. Those data are not usually bigger than a few gigabytes a day, hence a processing of data transformations is feasible. After data are transformed and loaded into data warehouse, data are usually accessed via BI tools for easier data manipulation by data 10

18 1. ANALYTIC PLATFORMS INTRODUCTION consumers. Data from different sources are integrated either in data warehouse, data marts or BI depending on specific architecture. TDWI performed relevant research about architectural components used in data warehousing with the following results. Figure 1.4: Currently used components and plan for the next three years [2]. From results following statements are deductible. EDW and data marts are commonly used and will be used even in the future. OLAP and tabular data are one of the key components of a BI. Dimensional star schema is preferable method in data modelling. RDBMs are used commonly, but it is expected that they will be used less in the next few years. 11

19 1. ANALYTIC PLATFORMS INTRODUCTION In-memory analytics, columnar databases and Hadoop usage is expected to grow. However, not all companies are planning to adopt these technologies as the technologies are used mostly for specific purposes. Therefore it is expected that usage of new technologies will grow significantly. Nonetheless, existing basic principles of data warehousing will prevail. Hence, it is important that new technologies can be completely integrated into existing data warehouse architecture. This includes both logical and physical architecture of data warehouse. Regarding data modelling, currently, most data warehouses use some form of hybrid architecture [2] that has origin either in Kimball s architectural approach, which is often called bottom-up approach and is based on dimensional modeling or in Inmon s top-down approach which prefers building data warehouse in third normal form [6]. Modeling technique highly depends on development method used. Inmnon s third normal form or Data Vaults are preferable for agile development. Data Vault is combination of both approaches. On the other hand, Kimball is more suitable for iterative data warehouse development. While integrating new data sources, a conventional data warehouse faces several issues: Scalability - Commonly platforms for building data warehouses are RDBMs such as SQL Server from Microsoft or Oracle database. Databases alone are not distributed, therefore adding new data sources can cause adding another server with new database instance. In addition, new processes have to be designed to support such a distributed environment. Price - RDBMs can run almost on any hardware. However, for processing gigabytes of data and at the same time accessing data for reporting and analytics, it is necessary to either buy powerful server and software licenses or significantly optimize DW processes. Usually all three steps are necessary. Unfortunately, licenses, servers and people are all expensive resources. Storage space - One of the most expensive parts of the DW is storage. For the best performance RBMS needs fast accessible storage. Considering data warehouse environment, where data are stored more than once in RDBM (e.g. in persistent stage or operational data store), for fault tolerance disks are set up in RAID, storage plan needs to be designed precisely to keep costs low as possible. Data also need to be backed up regularly and archived. In addition, historically not many sources were able to provide data in a real time or close to a real time, therefore traditional batch processing approach was very convenient. However, business needs output with the lowest latency possible, especially information for operational intelligence as a business needs to respond and act. Most suitable process for batch processing is ETL. Nonetheless, with real time streaming data it is more convenient to use extract, load and 12

20 1. ANALYTIC PLATFORMS INTRODUCTION transform (ELT) processes in order to be able to analyze data before long running transformations start. However, both an ETL and an ELT are implementable for real time processing with a right set of tools. 13

21 2 Proposed architecture In order to effectively tackle issues and challenges of current data warehouses and processing of new data sources, DW architecture has to be adjusted. Main requirements on purposed architecture are: Ability to process large amount of data in batch processing as well as in a real time. Scalability up to petabytes of data stored in a system. Distributed computing and linear performance scalability. Ability to process structured and unstructured data. Linear storage scalability. Support for ETL tools and BI tools. Relatively low price. Similar data accessibility to RDBMs. Support for variety of analytical tools. Star schema support. 14

22 2. PROPOSED ARCHITECTURE 2.1 Architecture overview Proposed architecture uses Hadoop [11] as an additional system to RDBM that brings new features that are useful for specific tasks that are expensive on RDBM. Hadoop is logically implemented as RDBM however it should be used mainly for processing of big amount of data and streaming real time data. Figure 2.1: Diagram describing proposed high-level architecture. Adopting Hadoop framework into existing data warehouse environment brings several benefits: Scalability - Hadoop supports linear scalability. By adding new nodes into Hadoop cluster we can linearly scale performance and storage space [12]. Price - As an open source framework, Hadoop is free to use. This however does not mean that it is inexpensive, as maintaining and developing applications on Hadoop cluster, hardware and experienced people with knowledge of this specific technology are expensive. However, generally Hadoop is significantly cheaper for a terabyte of storage than RDBM as it run on commodity hardware. Modifiability - Another advantage of open source project is modifiability. In case some significant functionality is not available, it is possible to develop it in-house. 15

23 2. PROPOSED ARCHITECTURE Community - A lot of companies as Yahoo, Facebook or Google are significantly contributing into Hadoop source codes. Either with developing and improving new tools or publishing theirs libraries. Usage of individual Hadoop tools is described in following diagram. Figure 2.2: Diagram of Hadoop tools usage in EDW architecture. Hadoop Open source Structured and unstructured data Less expensive Better for massive full data scans Support for unstructured data No support for transaction processing RDBM Proprietary Structured only, mostly Expensive Usage of index lookups Indirect support for unstructured data Support for transaction processing Table 2.1: Hadoop and RDBS comparsion. 2.2 Multi-Platform DW environment Existence of core data warehouse is still crucial for reporting and dashboards. It is also mostly used as a data source for online analytic processing [2]. In order to solve issues mentioned before, it is convenient to integrate new data platform that 16

24 2. PROPOSED ARCHITECTURE will support massive volumes and variety of data. What makes this task difficult is precise integration of a new platform into an existing DW architecture. As a data warehousing focuses on data consumers, they should be able to access a new platforms the same way and they should feel confident using the new platform. Using the Kimball s approach for integrating data on both platforms gives users the same view on data and also unites other internal data warehouse processes. 2.3 Hadoop Hadoop is an open-source software framework for distributed processing which allows to cheaply store and process vast amounts of structured and unstructured data. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. High-availability is covered at the application layer, therefore Hadoop does not rely on hardware to secure data and processing. Hadoop delivers service on top of a cluster of computers, each of which may be prone to failures. The framework itself consists of tens of different tools for various purposes and the number of tools available is growing fast. Three major components of Hadoop 2.x are Hadoop distributed file system (HDFS), Yet another resource negotiator (YARN) and MapReduce programming model. Most relevant tools to data warehousing and analytics include: Hive Pig Sqoop Impala Some other tools are not part of Hadoop, but are well integrated with Hadoop framework (such as Storm) HDFS HDFS was designed to be a scalable, fault-tolerant, distributed storage system that works closely with MapReduce and runs on commodity hardware. "HDFS has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets." [13] An HDFS cluster consists of NameNode which manages metadata and DataNodes that stores the data. Typically each file is split into large blocks of 64 or

25 2. PROPOSED ARCHITECTURE megabytes and then distributed to DataNodes. HDFS secure high-availability by replicating and distributing to other nodes. When a block is lost due to failure, NameNode creates another replica of the block and distributes it automatically to different DataNode YARN YARN is a cluster management technology and it combines a central resource manager that reconciles the way applications use Hadoop with node manager agents that monitor processing on individual DataNodes. Main purpose of YARN is to allow parallel access and usage of a Hadoop system and resources as until Hadoop 2.x processing of parallel queries was not possible due to lack of resource management. YARN opens Hadoop for wider usage MapReduce MapReduce is a software framework that allows developers to write programs that process massive amounts of structured or unstructured data in parallel across a distributed cluster. The MapReduce is divided into two major parts: Map - The Map function takes a series of key/value pairs, processes each, and generates zero or more output key/value pairs. The input and output types of the map can be different from each other. Reduce - The input for each Reduce is pulled from the machine where the Map ran and sorted using the application s sorting function. Number of Reducers does not depend on number of Map functions. MapReduce framework works closely with Hadoop, however, MapReduce programming paradigm can be used with any programming language Hive The Hive [11] is a data warehouse software combining querying and managing large data sets stored in HDFS. Developers can specify a structure of tables the same way as in RDBM and then query underlying data using SQL-like language HiveQL [14]. Hive gives a developer power to create tables over data in HDFS or over external data sources and specifies how these tables are stored. Hive metadata are stored in HCatalog. Specifically in Metastore database. Major advantage of the metastore is that its database can reside outside Hadoop cluster. Such a located metastore database can be used by other services and it prevail in case of cluster failure. One of the most distinguishing feature of Hive is validation of schema on read and not with write like in RDBMs. Due to this behaviour, it is not possible to define referential integrity using foreign keys or even define uniqueness. 18

26 2. PROPOSED ARCHITECTURE Clients can connect to Hive using two different drivers, ODBC and JDBC. ODBC is a standard written in C and C++ and is supported by majority of tools and client tools. JDBC, on the other hand, is based on programming language Java, therefore some technologies, especially the ones from companies not developing on Java, lack native support. However, a lot of tools have separate JDBC drivers that can be installed. For example Microsoft SQL Server have downloadable JDBC driver that supports SQL Server 2005 and up. Oracle and MySQL databases have native support for JDBC drivers. Drivers performance is highly dependable on implementation of driver that we use to connect to Hive [15, 16]. Figure 2.3: Hive connection diagram [11]. Tables in Hive can be internal or external. Internal table is a table completely managed by Hadoop. However, an external table can be located elsewhere and then only metadata are stored in Hive metastore. For querying data outside Hadoop Hive uses Storage Handlers. Currently, support for JDBC storage handler is not included in official Hive release, but code can be downloaded and compiled from open-source project [17]. This feature gives Hive ability to query data stored in databases through JDBC drivers. However, external tables cannot be modified from Hive. Among the JDBC driver, Hive supports external tables for HBase, Cassandra, BigTable and others. Hive uses HiveQL language to query the data [14]. Every query is translated into Java MapReduce jobs first and then executed. In general, Hive has not been build for quick iterative analysis, but mostly for long running jobs. The translation and query distribution itself takes around 20 seconds to finish. This disadvantages of Hive in using it in customer facing BI tools such as reporting services, dashboards or in a OLAP with data stored in Hadoop, as every interaction such as a refresh or change of parameters, generate new query to the Hive and forces users to wait at least 20 seconds for any results. 19

27 2. PROPOSED ARCHITECTURE Another Hive feature used in data warehousing and analytics is support for different file types and compressed files. This includes text file, binary sequence file, columnar storage called Parquet or JSON objects. Other feature related to file storage is compression. Compressing files can save significant amount of storage space, therefore decreasing read and write time changes for CPU time. In some cases compression can improve both disk usage and query performance [18]. Text files consisting csv, XML or JSON can be parsed using different SerDe functions, which are serialisation and deserialisation functions. Natively Hive offers several basic SerDe functions or RegEx SerDe for regular expressions. Open-source libraries for parsing JSON exist, although they are not included in official Hive releases. However, those libraries often have issues with nested JSON or JSON arrays. In data warehousing data are mostly stored as a time series. Typically every hour, day or week new data are exported from a source system for the particular period. In order to easily append data in Hive tables, Hive supports table partitioning. Each partition is defined by a meta column which is not part of a data files. By adding new partition into existing table, all queries automatically query even new partition. Performance wise specific partition can be specified in where clause of HiveQL statement in order to reduce number of redundant reads as a Hive reads only partitions that are needed. This is useful feature in an ETL process as an ETL usually process only small set of data. Typically, table would be partitioned by a date or a datetime, depending on a period of data exports. Hive also supports indices, which are similar to indices in RDBMs. Index is sorted structure that increases read performance. Using the right query conditions and index, it is possible to decrease number of reads, as only portion of table or partition needs to be loaded and processed. Hive supports index rebuild function on table or individual partitions HiveQL HiveQL is Hive query language that has been developed by Facebook to simplify data querying in Hadoop. As most developers and analysts are used to SQL language, developing similar language for Hadoop was very reasonable. HiveQL give users SQL-like access to data stored in Hadoop. It does not follow full SQL standard, however the syntax is familiar to SQL. Among others HiveQL supports following features: Advanced SQL features such as window functions (e.g. Rank, Lag, Lead) Querying JSON objects (e.g. using get_json_object() function). User defined functions. Indexes. On the other hand, HiveQL does not support delete and update statements. 20

28 2. PROPOSED ARCHITECTURE Hive ETL This is an example of hand coded ETL in Hive. Browser and country dimensions are not included as they are identical to country. Individual parts of the code are commented. 1 PREPARATION 2 add t h i r d p a r t y JSON SerDe l i b r a r y 3 add j a r s3 :// elasticmapreduce/samples/hive ads/ l i b s /jsonserde. j a r ; 4 5 s t a g e f o r s t o r i n g temporary d a t a 6 CREATE TABLE s t a g e _ v i s i t _ l o g 7 ( 8 id_dim_ int, 9 id_dim_country int, 10 id_dim_browser int, 11 logtime timestamp 12 ) ; f i n a l f a c t t a b l e 15 CREATE TABLE f a c t _ v i s i t _ l o g 16 ( 17 id_dim_ i n t COMMENT Surrogate key to dimension, 18 id_dim_country i n t COMMENT Surrogate key to dimension, 19 id_dim_browser i n t COMMENT Surrogate key to dimension, 20 logtime timestamp COMMENT Time when user loged i n t o our s e r v i c e 21 ) 22 COMMENT Fact t a b l e containing a l l l o g i n s to our s e r v i c e 23 PARTITIONED BY( etl_timestamp timestamp ) 24 i n d i v i d u a l p a r t i t i o n f o r e a c h e t l i t e r a t i o n 25 STORED AS SEQUENCEFILE 26 s t o r i n g as a s e q u e n c e f i l e t o d e c r e a s e f i l e s i z e 27 ; f i n a l dimension e m a i l t a b l e 30 CREATE TABLE dim_ 31 ( 32 id_dim_ i n t COMMENT Surrogate key, 33 s t r i n g COMMENT F u l l address 34 ) 35 COMMENT dimension t a b l e 36 STORED AS SEQUENCEFILE 37 s t o r i n g as a s e q u e n c e f i l e t o d e c r e a s e f i l e s i z e 38 ; 21

29 2. PROPOSED ARCHITECTURE s t a g e e m a i l union t a b l e 41 CREATE TABLE s t a g e _ e m a i l _ a l l 42 ( 43 id_dim_ i n t COMMENT Surrogate key, 44 s t r i n g COMMENT F u l l address 45 ) 46 COMMENT dimension t a b l e 47 STORED AS SEQUENCEFILE 48 s t o r i n g as a s e q u e n c e f i l e t o d e c r e a s e f i l e s i z e 49 ; DATA LOADING 52 EXAMPLE OF ONE ITERATION c r e a t e e x t e r n a l t a b l e t o s3 f o l d e r 55 a l l f i l e s in t h e f o l d e r and u n d e r l y i n g f o l d e r s a r e p a r s e d and q u e r i e d 56 a l l o w s s i m p l e ETL j o b rerun 57 DROP TABLE IF EXISTS s o u r c e _ v i s i t _ l o g _ ; 58 CREATE EXTERNAL TABLE s o u r c e _ v i s i t _ l o g _ ( 60 s t r i n g, 61 country s t r i n g, 62 browser s t r i n g, 63 logtime s t r i n g 64 ) 65 row format serde com. amazon. elasticmapreduce. JsonSerde 66 with s e r d e p r o p e r t i e s ( paths = , country, browser, logtime ) 67 LOCATION s3 :// Incoming data/ v i s i t log / / 68 ; g e t maximum i d from e m a i l dimension 71 f i l l dimension e m a i l 72 INSERY OVERWRITE TABLE stage_ 73 SELECT 74 ROW_NUMBER( ) OVER ( PARTITION BY ORDER BY ) + max_id as id_dim_ 75, 76 FROM 77 ( 78 SELECT 79 DISTINCT l o g. as 80 FROM s o u r c e _ v i s i t _ l o g _ log 81 LEFT JOIN dim_ em 82 ON log. = em. 22

30 2. PROPOSED ARCHITECTURE 83 WHERE log. i s null 84 ) dist_ em 85 LEFT JOIN 86 (SELECT max( id_dim_ ) as max_id FROM dim_ ) max_id_tab 87 ; union s t a g e and dimension d a t a 90 INSERT OVERRIDE s t a g e _ e m a i l _ a l l 91 SELECT id_dim_ , FROM dim_ 92 UNION ALL 93 SELECT id_dim_ , FROM stage_ 94 ; s w i t c h s t a g e and dimension t a b l e 97 ALTER TABLE dim_ RENAME TO dim_ _old ; 98 ALTER TABLE s t a g e _ e m a i l _ a l l RENAME TO dim_ ; 99 ALTER TABLE dim_ _old RENAME TO s t a g e _ e m a i l _ a l l ; p e r f o r m s dimension l o o k u p s and l o a d t r a n s f o r m e d d a t a i n t o s t a g e t a b l e 102 a l l o w s s i m p l e ETL j o b rerun 103 INSERT OVERWRITE TABLE s t a g e _ v i s i t _ l o g 104 SELECT 105 ISNULL(em. id_dim_ , 1) as id_dim_ 106, ISNULL( c n t. id_dim_country, 1) as id_ dim_ country 107, ISNULL( br. id_dim_browser, 1) as id_dim_browser 108,CAST( logtime as timestamp ) as logtime 109 FROM s o u r c e _ v i s i t _ l o g _ log 110 LEFT JOIN dim_ em 111 ON ISNULL( log. , unknown ) = em LEFT JOIN dim_country c n t 113 ON ISNULL( log. country, unknown ) = cnt. country 114 LEFT JOIN dim_browser br 115 ON ISNULL( log. browser, unknown ) = cnt. browser 116 ; p a r t i t i o n swap i n t o f a c t t a b l e 119 ALTER TABLE f a c t _ v i s i t _ l o g DROP IF EXITS PARTITION ( etl_timestamp = ) ; ALTER TABLE f a c t _ v i s i t _ l o g EXCHANGE PARTITION ( etl_timestamp = ) WITH TABLE s t a g e _ v i s i t _ l o g ; 23

31 2. PROPOSED ARCHITECTURE Impala Impala is an open source distributed processing framework developed by Cloudera. Impala provides low-latency SQL queries on data stored in Hadoop. Lowlatency is achieved with in-memory computing. Impala stores data in-memory in Parquet columnar format, therefore it more suitable for querying large fact tables with smaller amount of columns. Another feature that distinguish Impala from Hive is a query engine. Imapala does not compile queries into MapReduce rather uses its own programming model. Impala is closely tied to Hive because it uses Hive metastore for storing data metadata. Impala can run on Hadoop cluster without Hive, however Hive metastore has to be installed. Impala accepts queries sent via impala shell, JDBC, ODBC or Hue console Pig Pig [19] is simple scripting language for data processing and transformations. It has been developed mainly by Yahoo. The power of Pig and Hive is similar. The main difference is audience for which the tool was developed. Pig is more intuitive for developers with experience in procedural languages. On the other hand, Hive leverages knowledge of SQL. Pig s advantages: Supports structured, semi-structured and unstructured data sets. Procedural, simple, easy to learn. Supports user defined functions. One of the powerful features is capability to work with dynamic schema. Meaning that PIG can parse unstructured data such as JSONs without predefined schema. This is helpful while working with dynamic nested JSONs or JSONs arrays. DataJson = LOAD s3://pig-test/test-data.json using com.twitter.elephantbird.pig.load.jsonloader( -nestedload ); Figure 2.4: Example of Pig script for loading complex JSON structure. Usage of Pig or Hive highly depends on audience. As most of the data warehouses are built on RDBMs and people involved in its development have knowledge of SQL, the Hive is preferable tool for data transformation and data querying Kylin Relatively new project that implements OLAP into Hadoop is Kylin [20]. Kylin originally started as an in-house project of ebay but it has been published as opensource few months ago. 24

32 2. PROPOSED ARCHITECTURE Kylin works in the following manner. Aggregations are calculated from Hive using HiveQL and stored in HBase. Then when data are queried Kylin query engine looks into HBase if requested data are precalculated there and if so, returns data from HBase with sub-second latency. Otherwise Kylin query engine routes query into Hive. Kylin supports JDBC, ODBC and Rest API for client tools, therefore it is possible to connect from analytic tools such as Tableau, SAS or Excel. Figure 2.5: Kylin high-level architecture [20] Sqoop Sqoop [11] is tool provided by Hadoop that is used for importing and exporting data. Different modes of exporting are supported such as full export, incremental or limiting size of an export using WHERE clause. Data can be exported from HDFS, Hive or any RDBM that supports JDBC. Sqoop works bi-directionally therefore supports importing into HDFS, Hive or RDBM. Data can also be exported into delimited text file with specific field and row terminators or sequence file. As a tool that is a part of distributed system, Sqoop also supports distributed and parallel processing. Sqoop includes additional functionality for Hive export and import in order to simplify data transfers with RDBMs. Sqoop Hive additional support includes: Incremental imports. CREATE TABLE statements within imports. Data import into specific table partition. Data compression. 25

33 2. PROPOSED ARCHITECTURE For user interaction, Sqoop has simple command line interface. 26

34 3 Hadoop integration The following chapters will explain how to integrate Hadoop into enterprise data warehouse including implementation of star schema, data management and physical implementation. 3.1 Star schema implementation A logical star schema model captures business point of view and it is highly platform independable on physical implementation. Hive provides almost the same SQL-like interface as common RDBMs, therefore the data model build for RDBMs data warehouse can be implemented in Hive with some adjustments. Thesis therefore focuses rather on physical implementation than on logical. Main advantages of implementing star schema in Hive: Simple and understandable view on data. Easy support for master data management. Dimensions support user defined hierarchies. Performance improvement [18]. Conformed dimensions can be shared across different platforms Dimensions implementation One of the most challenging part of implementing dimension tables are unsupported DML functions update and delete at the row level in HIVE. Even though, append operation is available, it creates new file with every append. This can cause significant issues with NameNode performance as amount of small files can grow significantly and it can cause issues with insufficient memory. Also only first type of slowly changing dimension can be implemented without support of update and delete. Because dimension needs to keep the same surrogate keys and update is not available, it is necessary to recreate the table every ETL run with the same keys. Although, auto increment functionality is not developed yet, other ways exist. While inserting new rows into table, we need to merge existing dimension data and new data. It is necessary to keep old keys and generate new keys for new records in this step. One of the ways how to generate a sequence is getting maximal key value from existing data and then use function UDFRowsequence.java to generate sequence starting with number one and then add maximum key value to all generated keys. The same result can be achieved using ROW_NUMBER() window function. Due to lack of transactions in Hive it is recommended to stage data from more data sources first and load them into final dimension table at once 27

35 3. HADOOP INTEGRATION or use different techniques to disable possibility of concurrent runs of filling a dimensional table. Example of filling dimension table is at rows in chapter However, as the Hive does not support primary keys, it is better to develop a job that validates uniqueness of surrogate keys in dimension tables. As a duplication of surrogate keys can lead to a record duplication in a fact tables. If a dimension is conformed or at least shared across different data sources on different platforms, more suitable solution is to generate keys on platform such as RDBM that have auto-increment functionality, supports transactions and ensures uniqueness. Basically RDBM keeps the original dimension table and a Hive has only copy for read. One of the possibilities how to load a dimension using additional RDBM is to take records that should be inserted into dimension and export them into staging table on RDBM. For export from a Hive Sqoop tool can be used. Figure 3.1: Diagram describing using RDBM to assign surrogate keys. sqoop import --connect jdbc:sqlserver://edw.server.com;database=dw; username=sqoop;password=sqoop --table dim_ --hive-import --schema dbo --incremental append --check-column id_dim_ --last-value hive-import Figure 3.2: Sqoop imports all records from dim_ with id_d_ greater than 500. Loading into dimension in RDBM runs in a transaction, therefore concurrent inserts are excluded. When dimension is loaded it can be exported back to Hive. Depending on size of the table, incremental of full export can be chosen. This export needs to be performed after each insertion into dimension as Hive has only copy and it needs to be synchronized with the original. 28

BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014

BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014 BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014 Ralph Kimball Associates 2014 The Data Warehouse Mission Identify all possible enterprise data assets Select those assets

More information

Hadoop Ecosystem B Y R A H I M A.

Hadoop Ecosystem B Y R A H I M A. Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open

More information

How To Scale Out Of A Nosql Database

How To Scale Out Of A Nosql Database Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 thomas.steinmaurer@scch.at www.scch.at Michael Zwick DI

More information

How To Handle Big Data With A Data Scientist

How To Handle Big Data With A Data Scientist III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution

More information

Implement Hadoop jobs to extract business value from large and varied data sets

Implement Hadoop jobs to extract business value from large and varied data sets Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to

More information

Big Data With Hadoop

Big Data With Hadoop With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

More information

Native Connectivity to Big Data Sources in MSTR 10

Native Connectivity to Big Data Sources in MSTR 10 Native Connectivity to Big Data Sources in MSTR 10 Bring All Relevant Data to Decision Makers Support for More Big Data Sources Optimized Access to Your Entire Big Data Ecosystem as If It Were a Single

More information

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Managing Big Data with Hadoop & Vertica A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Copyright Vertica Systems, Inc. October 2009 Cloudera and Vertica

More information

1. OLAP is an acronym for a. Online Analytical Processing b. Online Analysis Process c. Online Arithmetic Processing d. Object Linking and Processing

1. OLAP is an acronym for a. Online Analytical Processing b. Online Analysis Process c. Online Arithmetic Processing d. Object Linking and Processing 1. OLAP is an acronym for a. Online Analytical Processing b. Online Analysis Process c. Online Arithmetic Processing d. Object Linking and Processing 2. What is a Data warehouse a. A database application

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

Using Tableau Software with Hortonworks Data Platform

Using Tableau Software with Hortonworks Data Platform Using Tableau Software with Hortonworks Data Platform September 2013 2013 Hortonworks Inc. http:// Modern businesses need to manage vast amounts of data, and in many cases they have accumulated this data

More information

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe

More information

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing

More information

Data Warehouse and Hive. Presented By: Shalva Gelenidze Supervisor: Nodar Momtselidze

Data Warehouse and Hive. Presented By: Shalva Gelenidze Supervisor: Nodar Momtselidze Data Warehouse and Hive Presented By: Shalva Gelenidze Supervisor: Nodar Momtselidze Decision support systems Decision Support Systems allowed managers, supervisors, and executives to once again see the

More information

Hadoop implementation of MapReduce computational model. Ján Vaňo

Hadoop implementation of MapReduce computational model. Ján Vaňo Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed

More information

Native Connectivity to Big Data Sources in MicroStrategy 10. Presented by: Raja Ganapathy

Native Connectivity to Big Data Sources in MicroStrategy 10. Presented by: Raja Ganapathy Native Connectivity to Big Data Sources in MicroStrategy 10 Presented by: Raja Ganapathy Agenda MicroStrategy supports several data sources, including Hadoop Why Hadoop? How does MicroStrategy Analytics

More information

Constructing a Data Lake: Hadoop and Oracle Database United!

Constructing a Data Lake: Hadoop and Oracle Database United! Constructing a Data Lake: Hadoop and Oracle Database United! Sharon Sophia Stephen Big Data PreSales Consultant February 21, 2015 Safe Harbor The following is intended to outline our general product direction.

More information

BIG DATA AND THE ENTERPRISE DATA WAREHOUSE WORKSHOP

BIG DATA AND THE ENTERPRISE DATA WAREHOUSE WORKSHOP BIG DATA AND THE ENTERPRISE DATA WAREHOUSE WORKSHOP Business Analytics for All Amsterdam - 2015 Value of Big Data is Being Recognized Executives beginning to see the path from data insights to revenue

More information

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview Programming Hadoop 5-day, instructor-led BD-106 MapReduce Overview The Client Server Processing Pattern Distributed Computing Challenges MapReduce Defined Google's MapReduce The Map Phase of MapReduce

More information

Using distributed technologies to analyze Big Data

Using distributed technologies to analyze Big Data Using distributed technologies to analyze Big Data Abhijit Sharma Innovation Lab BMC Software 1 Data Explosion in Data Center Performance / Time Series Data Incoming data rates ~Millions of data points/

More information

Traditional BI vs. Business Data Lake A comparison

Traditional BI vs. Business Data Lake A comparison Traditional BI vs. Business Data Lake A comparison The need for new thinking around data storage and analysis Traditional Business Intelligence (BI) systems provide various levels and kinds of analyses

More information

Large scale processing using Hadoop. Ján Vaňo

Large scale processing using Hadoop. Ján Vaňo Large scale processing using Hadoop Ján Vaňo What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data Includes: MapReduce offline computing engine

More information

Composite Data Virtualization Composite Data Virtualization And NOSQL Data Stores

Composite Data Virtualization Composite Data Virtualization And NOSQL Data Stores Composite Data Virtualization Composite Data Virtualization And NOSQL Data Stores Composite Software October 2010 TABLE OF CONTENTS INTRODUCTION... 3 BUSINESS AND IT DRIVERS... 4 NOSQL DATA STORES LANDSCAPE...

More information

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing Wayne W. Eckerson Director of Research, TechTarget Founder, BI Leadership Forum Business Analytics

More information

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES Relational vs. Non-Relational Architecture Relational Non-Relational Rational Predictable Traditional Agile Flexible Modern 2 Agenda Big Data

More information

Emerging Technologies Shaping the Future of Data Warehouses & Business Intelligence

Emerging Technologies Shaping the Future of Data Warehouses & Business Intelligence Emerging Technologies Shaping the Future of Data Warehouses & Business Intelligence Appliances and DW Architectures John O Brien President and Executive Architect Zukeran Technologies 1 TDWI 1 Agenda What

More information

BIG DATA TECHNOLOGY. Hadoop Ecosystem

BIG DATA TECHNOLOGY. Hadoop Ecosystem BIG DATA TECHNOLOGY Hadoop Ecosystem Agenda Background What is Big Data Solution Objective Introduction to Hadoop Hadoop Ecosystem Hybrid EDW Model Predictive Analysis using Hadoop Conclusion What is Big

More information

Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science

Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science A Seminar report On Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science SUBMITTED TO: www.studymafia.org SUBMITTED BY: www.studymafia.org

More information

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web

More information

Alexander Nikov. 5. Database Systems and Managing Data Resources. Learning Objectives. RR Donnelley Tries to Master Its Data

Alexander Nikov. 5. Database Systems and Managing Data Resources. Learning Objectives. RR Donnelley Tries to Master Its Data INFO 1500 Introduction to IT Fundamentals 5. Database Systems and Managing Data Resources Learning Objectives 1. Describe how the problems of managing data resources in a traditional file environment are

More information

SQL Server 2012 Business Intelligence Boot Camp

SQL Server 2012 Business Intelligence Boot Camp SQL Server 2012 Business Intelligence Boot Camp Length: 5 Days Technology: Microsoft SQL Server 2012 Delivery Method: Instructor-led (classroom) About this Course Data warehousing is a solution organizations

More information

Oracle Database 12c Plug In. Switch On. Get SMART.

Oracle Database 12c Plug In. Switch On. Get SMART. Oracle Database 12c Plug In. Switch On. Get SMART. Duncan Harvey Head of Core Technology, Oracle EMEA March 2015 Safe Harbor Statement The following is intended to outline our general product direction.

More information

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Kanchan A. Khedikar Department of Computer Science & Engineering Walchand Institute of Technoloy, Solapur, Maharashtra,

More information

#mstrworld. Tapping into Hadoop and NoSQL Data Sources in MicroStrategy. Presented by: Trishla Maru. #mstrworld

#mstrworld. Tapping into Hadoop and NoSQL Data Sources in MicroStrategy. Presented by: Trishla Maru. #mstrworld Tapping into Hadoop and NoSQL Data Sources in MicroStrategy Presented by: Trishla Maru Agenda Big Data Overview All About Hadoop What is Hadoop? How does MicroStrategy connects to Hadoop? Customer Case

More information

Agile Business Intelligence Data Lake Architecture

Agile Business Intelligence Data Lake Architecture Agile Business Intelligence Data Lake Architecture TABLE OF CONTENTS Introduction... 2 Data Lake Architecture... 2 Step 1 Extract From Source Data... 5 Step 2 Register And Catalogue Data Sets... 5 Step

More information

MySQL and Hadoop. Percona Live 2014 Chris Schneider

MySQL and Hadoop. Percona Live 2014 Chris Schneider MySQL and Hadoop Percona Live 2014 Chris Schneider About Me Chris Schneider, Database Architect @ Groupon Spent the last 10 years building MySQL architecture for multiple companies Worked with Hadoop for

More information

Apache Kylin Introduction Dec 8, 2014 @ApacheKylin

Apache Kylin Introduction Dec 8, 2014 @ApacheKylin Apache Kylin Introduction Dec 8, 2014 @ApacheKylin Luke Han Sr. Product Manager lukhan@ebay.com @lukehq Yang Li Architect & Tech Leader yangli9@ebay.com Agenda What s Apache Kylin? Tech Highlights Performance

More information

Reference Architecture, Requirements, Gaps, Roles

Reference Architecture, Requirements, Gaps, Roles Reference Architecture, Requirements, Gaps, Roles The contents of this document are an excerpt from the brainstorming document M0014. The purpose is to show how a detailed Big Data Reference Architecture

More information

An Oracle White Paper November 2010. Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

An Oracle White Paper November 2010. Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics An Oracle White Paper November 2010 Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics 1 Introduction New applications such as web searches, recommendation engines,

More information

Ramesh Bhashyam Teradata Fellow Teradata Corporation bhashyam.ramesh@teradata.com

Ramesh Bhashyam Teradata Fellow Teradata Corporation bhashyam.ramesh@teradata.com Challenges of Handling Big Data Ramesh Bhashyam Teradata Fellow Teradata Corporation bhashyam.ramesh@teradata.com Trend Too much information is a storage issue, certainly, but too much information is also

More information

Apache Hadoop: The Pla/orm for Big Data. Amr Awadallah CTO, Founder, Cloudera, Inc. aaa@cloudera.com, twicer: @awadallah

Apache Hadoop: The Pla/orm for Big Data. Amr Awadallah CTO, Founder, Cloudera, Inc. aaa@cloudera.com, twicer: @awadallah Apache Hadoop: The Pla/orm for Big Data Amr Awadallah CTO, Founder, Cloudera, Inc. aaa@cloudera.com, twicer: @awadallah 1 The Problems with Current Data Systems BI Reports + Interac7ve Apps RDBMS (aggregated

More information

Integrating Big Data into the Computing Curricula

Integrating Big Data into the Computing Curricula Integrating Big Data into the Computing Curricula Yasin Silva, Suzanne Dietrich, Jason Reed, Lisa Tsosie Arizona State University http://www.public.asu.edu/~ynsilva/ibigdata/ 1 Overview Motivation Big

More information

Integrating Hadoop. Into Business Intelligence & Data Warehousing. Philip Russom TDWI Research Director for Data Management, April 9 2013

Integrating Hadoop. Into Business Intelligence & Data Warehousing. Philip Russom TDWI Research Director for Data Management, April 9 2013 Integrating Hadoop Into Business Intelligence & Data Warehousing Philip Russom TDWI Research Director for Data Management, April 9 2013 TDWI would like to thank the following companies for sponsoring the

More information

NoSQL and Hadoop Technologies On Oracle Cloud

NoSQL and Hadoop Technologies On Oracle Cloud NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath

More information

Internals of Hadoop Application Framework and Distributed File System

Internals of Hadoop Application Framework and Distributed File System International Journal of Scientific and Research Publications, Volume 5, Issue 7, July 2015 1 Internals of Hadoop Application Framework and Distributed File System Saminath.V, Sangeetha.M.S Abstract- Hadoop

More information

Hadoop. http://hadoop.apache.org/ Sunday, November 25, 12

Hadoop. http://hadoop.apache.org/ Sunday, November 25, 12 Hadoop http://hadoop.apache.org/ What Is Apache Hadoop? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using

More information

Data warehousing with PostgreSQL

Data warehousing with PostgreSQL Data warehousing with PostgreSQL Gabriele Bartolini http://www.2ndquadrant.it/ European PostgreSQL Day 2009 6 November, ParisTech Telecom, Paris, France Audience

More information

Data processing goes big

Data processing goes big Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,

More information

HDP Hadoop From concept to deployment.

HDP Hadoop From concept to deployment. HDP Hadoop From concept to deployment. Ankur Gupta Senior Solutions Engineer Rackspace: Page 41 27 th Jan 2015 Where are you in your Hadoop Journey? A. Researching our options B. Currently evaluating some

More information

Datenverwaltung im Wandel - Building an Enterprise Data Hub with

Datenverwaltung im Wandel - Building an Enterprise Data Hub with Datenverwaltung im Wandel - Building an Enterprise Data Hub with Cloudera Bernard Doering Regional Director, Central EMEA, Cloudera Cloudera Your Hadoop Experts Founded 2008, by former employees of Employees

More information

Oracle9i Data Warehouse Review. Robert F. Edwards Dulcian, Inc.

Oracle9i Data Warehouse Review. Robert F. Edwards Dulcian, Inc. Oracle9i Data Warehouse Review Robert F. Edwards Dulcian, Inc. Agenda Oracle9i Server OLAP Server Analytical SQL Data Mining ETL Warehouse Builder 3i Oracle 9i Server Overview 9i Server = Data Warehouse

More information

SAS BI Course Content; Introduction to DWH / BI Concepts

SAS BI Course Content; Introduction to DWH / BI Concepts SAS BI Course Content; Introduction to DWH / BI Concepts SAS Web Report Studio 4.2 SAS EG 4.2 SAS Information Delivery Portal 4.2 SAS Data Integration Studio 4.2 SAS BI Dashboard 4.2 SAS Management Console

More information

How to Enhance Traditional BI Architecture to Leverage Big Data

How to Enhance Traditional BI Architecture to Leverage Big Data B I G D ATA How to Enhance Traditional BI Architecture to Leverage Big Data Contents Executive Summary... 1 Traditional BI - DataStack 2.0 Architecture... 2 Benefits of Traditional BI - DataStack 2.0...

More information

Chapter 6. Foundations of Business Intelligence: Databases and Information Management

Chapter 6. Foundations of Business Intelligence: Databases and Information Management Chapter 6 Foundations of Business Intelligence: Databases and Information Management VIDEO CASES Case 1a: City of Dubuque Uses Cloud Computing and Sensors to Build a Smarter, Sustainable City Case 1b:

More information

Tap into Hadoop and Other No SQL Sources

Tap into Hadoop and Other No SQL Sources Tap into Hadoop and Other No SQL Sources Presented by: Trishla Maru What is Big Data really? The Three Vs of Big Data According to Gartner Volume Volume Orders of magnitude bigger than conventional data

More information

Business Intelligence for Big Data

Business Intelligence for Big Data Business Intelligence for Big Data Will Gorman, Vice President, Engineering May, 2011 2010, Pentaho. All Rights Reserved. www.pentaho.com. What is BI? Business Intelligence = reports, dashboards, analysis,

More information

The Future of Data Management

The Future of Data Management The Future of Data Management with Hadoop and the Enterprise Data Hub Amr Awadallah (@awadallah) Cofounder and CTO Cloudera Snapshot Founded 2008, by former employees of Employees Today ~ 800 World Class

More information

Big Data Introduction

Big Data Introduction Big Data Introduction Ralf Lange Global ISV & OEM Sales 1 Copyright 2012, Oracle and/or its affiliates. All rights Conventional infrastructure 2 Copyright 2012, Oracle and/or its affiliates. All rights

More information

SQL Server Administrator Introduction - 3 Days Objectives

SQL Server Administrator Introduction - 3 Days Objectives SQL Server Administrator Introduction - 3 Days INTRODUCTION TO MICROSOFT SQL SERVER Exploring the components of SQL Server Identifying SQL Server administration tasks INSTALLING SQL SERVER Identifying

More information

CitusDB Architecture for Real-Time Big Data

CitusDB Architecture for Real-Time Big Data CitusDB Architecture for Real-Time Big Data CitusDB Highlights Empowers real-time Big Data using PostgreSQL Scales out PostgreSQL to support up to hundreds of terabytes of data Fast parallel processing

More information

Data Integration Checklist

Data Integration Checklist The need for data integration tools exists in every company, small to large. Whether it is extracting data that exists in spreadsheets, packaged applications, databases, sensor networks or social media

More information

GAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION

GAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION GAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION Syed Rasheed Solution Manager Red Hat Corp. Kenny Peeples Technical Manager Red Hat Corp. Kimberly Palko Product Manager Red Hat Corp.

More information

Cloudera Certified Developer for Apache Hadoop

Cloudera Certified Developer for Apache Hadoop Cloudera CCD-333 Cloudera Certified Developer for Apache Hadoop Version: 5.6 QUESTION NO: 1 Cloudera CCD-333 Exam What is a SequenceFile? A. A SequenceFile contains a binary encoding of an arbitrary number

More information

Oracle BI EE Implementation on Netezza. Prepared by SureShot Strategies, Inc.

Oracle BI EE Implementation on Netezza. Prepared by SureShot Strategies, Inc. Oracle BI EE Implementation on Netezza Prepared by SureShot Strategies, Inc. The goal of this paper is to give an insight to Netezza architecture and implementation experience to strategize Oracle BI EE

More information

Lecture Data Warehouse Systems

Lecture Data Warehouse Systems Lecture Data Warehouse Systems Eva Zangerle SS 2013 PART C: Novel Approaches in DW NoSQL and MapReduce Stonebraker on Data Warehouses Star and snowflake schemas are a good idea in the DW world C-Stores

More information

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William sampd@stumbleupon.

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William sampd@stumbleupon. Building Scalable Big Data Infrastructure Using Open Source Software Sam William sampd@stumbleupon. What is StumbleUpon? Help users find content they did not expect to find The best way to discover new

More information

Luncheon Webinar Series May 13, 2013

Luncheon Webinar Series May 13, 2013 Luncheon Webinar Series May 13, 2013 InfoSphere DataStage is Big Data Integration Sponsored By: Presented by : Tony Curcio, InfoSphere Product Management 0 InfoSphere DataStage is Big Data Integration

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

QLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM

QLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM QLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM QlikView Technical Case Study Series Big Data June 2012 qlikview.com Introduction This QlikView technical case study focuses on the QlikView deployment

More information

A very short talk about Apache Kylin Business Intelligence meets Big Data. Fabian Wilckens EMEA Solutions Architect

A very short talk about Apache Kylin Business Intelligence meets Big Data. Fabian Wilckens EMEA Solutions Architect A very short talk about Apache Kylin Business Intelligence meets Big Data Fabian Wilckens EMEA Solutions Architect 1 The challenge today 2 Very quickly: OLAP Online Analytical Processing How many beers

More information

Open source large scale distributed data management with Google s MapReduce and Bigtable

Open source large scale distributed data management with Google s MapReduce and Bigtable Open source large scale distributed data management with Google s MapReduce and Bigtable Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory

More information

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani Technical Architect - Big Data Syntel Agenda Welcome to the Zoo! Evolution Timeline Traditional BI/DW Architecture Where Hadoop Fits In 2 Welcome to

More information

Bringing the Power of SAS to Hadoop. White Paper

Bringing the Power of SAS to Hadoop. White Paper White Paper Bringing the Power of SAS to Hadoop Combine SAS World-Class Analytic Strength with Hadoop s Low-Cost, Distributed Data Storage to Uncover Hidden Opportunities Contents Introduction... 1 What

More information

Transparently Offloading Data Warehouse Data to Hadoop using Data Virtualization

Transparently Offloading Data Warehouse Data to Hadoop using Data Virtualization Transparently Offloading Data Warehouse Data to Hadoop using Data Virtualization A Technical Whitepaper Rick F. van der Lans Independent Business Intelligence Analyst R20/Consultancy February 2015 Sponsored

More information

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase Architectural patterns for building real time applications with Apache HBase Andrew Purtell Committer and PMC, Apache HBase Who am I? Distributed systems engineer Principal Architect in the Big Data Platform

More information

BUILDING OLAP TOOLS OVER LARGE DATABASES

BUILDING OLAP TOOLS OVER LARGE DATABASES BUILDING OLAP TOOLS OVER LARGE DATABASES Rui Oliveira, Jorge Bernardino ISEC Instituto Superior de Engenharia de Coimbra, Polytechnic Institute of Coimbra Quinta da Nora, Rua Pedro Nunes, P-3030-199 Coimbra,

More information

Big Data and Market Surveillance. April 28, 2014

Big Data and Market Surveillance. April 28, 2014 Big Data and Market Surveillance April 28, 2014 Copyright 2014 Scila AB. All rights reserved. Scila AB reserves the right to make changes to the information contained herein without prior notice. No part

More information

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future

More information

Chapter 6 8/12/2015. Foundations of Business Intelligence: Databases and Information Management. Problem:

Chapter 6 8/12/2015. Foundations of Business Intelligence: Databases and Information Management. Problem: Foundations of Business Intelligence: Databases and Information Management VIDEO CASES Chapter 6 Case 1a: City of Dubuque Uses Cloud Computing and Sensors to Build a Smarter, Sustainable City Case 1b:

More information

Data Warehousing Systems: Foundations and Architectures

Data Warehousing Systems: Foundations and Architectures Data Warehousing Systems: Foundations and Architectures Il-Yeol Song Drexel University, http://www.ischool.drexel.edu/faculty/song/ SYNONYMS None DEFINITION A data warehouse (DW) is an integrated repository

More information

Big Data Can Drive the Business and IT to Evolve and Adapt

Big Data Can Drive the Business and IT to Evolve and Adapt Big Data Can Drive the Business and IT to Evolve and Adapt Ralph Kimball Associates 2013 Ralph Kimball Brussels 2013 Big Data Itself is Being Monetized Executives see the short path from data insights

More information

Tapping Into Hadoop and NoSQL Data Sources with MicroStrategy. Presented by: Jeffrey Zhang and Trishla Maru

Tapping Into Hadoop and NoSQL Data Sources with MicroStrategy. Presented by: Jeffrey Zhang and Trishla Maru Tapping Into Hadoop and NoSQL Data Sources with MicroStrategy Presented by: Jeffrey Zhang and Trishla Maru Agenda Big Data Overview All About Hadoop What is Hadoop? How does MicroStrategy connects to Hadoop?

More information

Data Warehouse Overview. Srini Rengarajan

Data Warehouse Overview. Srini Rengarajan Data Warehouse Overview Srini Rengarajan Please mute Your cell! Agenda Data Warehouse Architecture Approaches to build a Data Warehouse Top Down Approach Bottom Up Approach Best Practices Case Example

More information

Open source Google-style large scale data analysis with Hadoop

Open source Google-style large scale data analysis with Hadoop Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical

More information

Big Data: Tools and Technologies in Big Data

Big Data: Tools and Technologies in Big Data Big Data: Tools and Technologies in Big Data Jaskaran Singh Student Lovely Professional University, Punjab Varun Singla Assistant Professor Lovely Professional University, Punjab ABSTRACT Big data can

More information

Hadoop Job Oriented Training Agenda

Hadoop Job Oriented Training Agenda 1 Hadoop Job Oriented Training Agenda Kapil CK hdpguru@gmail.com Module 1 M o d u l e 1 Understanding Hadoop This module covers an overview of big data, Hadoop, and the Hortonworks Data Platform. 1.1 Module

More information

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract W H I T E P A P E R Deriving Intelligence from Large Data Using Hadoop and Applying Analytics Abstract This white paper is focused on discussing the challenges facing large scale data processing and the

More information

BIG DATA What it is and how to use?

BIG DATA What it is and how to use? BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14

More information

Data Mining in the Swamp

Data Mining in the Swamp WHITE PAPER Page 1 of 8 Data Mining in the Swamp Taming Unruly Data with Cloud Computing By John Brothers Business Intelligence is all about making better decisions from the data you have. However, all

More information

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning How to use Big Data in Industry 4.0 implementations LAURI ILISON, PhD Head of Big Data and Machine Learning Big Data definition? Big Data is about structured vs unstructured data Big Data is about Volume

More information

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763 International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing

More information

Big Data at Cloud Scale

Big Data at Cloud Scale Big Data at Cloud Scale Pushing the limits of flexible & powerful analytics Copyright 2015 Pentaho Corporation. Redistribution permitted. All trademarks are the property of their respective owners. For

More information

Big Data and Apache Hadoop Adoption:

Big Data and Apache Hadoop Adoption: Expert Reference Series of White Papers Big Data and Apache Hadoop Adoption: Key Challenges and Rewards 1-800-COURSES www.globalknowledge.com Big Data and Apache Hadoop Adoption: Key Challenges and Rewards

More information

Testing Big data is one of the biggest

Testing Big data is one of the biggest Infosys Labs Briefings VOL 11 NO 1 2013 Big Data: Testing Approach to Overcome Quality Challenges By Mahesh Gudipati, Shanthi Rao, Naju D. Mohan and Naveen Kumar Gajja Validate data quality by employing

More information

Big Data for the Rest of Us Technical White Paper

Big Data for the Rest of Us Technical White Paper Big Data for the Rest of Us Technical White Paper Treasure Data - Big Data for the Rest of Us 1 Introduction The importance of data warehousing and analytics has increased as companies seek to gain competitive

More information

Monitoring Genebanks using Datamarts based in an Open Source Tool

Monitoring Genebanks using Datamarts based in an Open Source Tool Monitoring Genebanks using Datamarts based in an Open Source Tool April 10 th, 2008 Edwin Rojas Research Informatics Unit (RIU) International Potato Center (CIP) GPG2 Workshop 2008 Datamarts Motivation

More information

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing Data-Intensive Programming Timo Aaltonen Department of Pervasive Computing Data-Intensive Programming Lecturer: Timo Aaltonen University Lecturer timo.aaltonen@tut.fi Assistants: Henri Terho and Antti

More information

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after

More information

Big Data Analytics in LinkedIn. Danielle Aring & William Merritt

Big Data Analytics in LinkedIn. Danielle Aring & William Merritt Big Data Analytics in LinkedIn by Danielle Aring & William Merritt 2 Brief History of LinkedIn - Launched in 2003 by Reid Hoffman (https://ourstory.linkedin.com/) - 2005: Introduced first business lines

More information

Best Practices for Hadoop Data Analysis with Tableau

Best Practices for Hadoop Data Analysis with Tableau Best Practices for Hadoop Data Analysis with Tableau September 2013 2013 Hortonworks Inc. http:// Tableau 6.1.4 introduced the ability to visualize large, complex data stored in Apache Hadoop with Hortonworks

More information