W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

W H I T E P A P E R Deriving Intelligence from Large Data Using Hadoop and Applying Analytics Abstract This white paper is focused on discussing the challenges facing large scale data processing and the approaches and solutions to manage structure and apply analytics on this voluminous data to draw valuable insights and business intelligence. The paper also elaborates on using the Hadoop eco system for deriving structure from this large scale data. Impetus Technologies, Inc. www.impetus.com October 2010

Table of Contents Introduction... 3 Challenges facing large data management... 5 Finding solutions to large data challenges... 7 BI implementation approaches over large data... 9 Approach 1: Analytics over Hadoop with MPP DWs... 9 Approach 2: Indirect Analytics over Hadoop... 9 Approach 3: Direct Analytics over Hadoop... 11 Case Study... 12 Summary... 13 2

Introduction Large Data can be described as data that occupies very large storage space in the file system, which in the present context can range from 1 terabyte to petabytes. The world is currently witnessing an upsurge in digital data, and organizations require solutions that can help them extract valuable intelligence and insights from this data. In this scenario, the role of Business Intelligence and Analytics in drawing insights from large scale data has increased exponentially. Take the instance of Financial Data Monitoring systems. We all know that billions/trillions of financial transactions take place every single hour. Now for an organization that provides a fraud detection system for financial transactions, it would be nearly impossible to detect suspicious transactions over the huge amount of data generated. Figure 1 Deriving Intelligence from Unstructured Data For such enterprises, it is imperative to find the best solutions to: Store Data Cleanse and Transform Data Apply Analytics over this Set 3

The data has to be collected over a long period of time, maybe months, or even years, from various distributed geographical locations. They require an excellent and cost effective framework for processing this data in a distributed manner on commodity hardware. The data will have to be processed and summarized to its farthest limit, and then pursued and processed time and again. The various mechanisms, such as generating alerts on any dubious transactions would be required by the system. This is where BI and analytics come into the picture. Organizations today are also driving their businesses based on the feeling and emotions of their customers and potential customers. While they have huge amounts of unstructured data available in the form of tweets, comments, posting, blogs etc., it needs to be properly mined and analyzed to gauge the sentiments of the customers and identify their compliments, comments and grievances. Based on this feedback and inputs, organizations can decide the course of their business roadmaps. They can also carry on predictive analysis by identifying the patterns in customer behavior as well as their preferences. Clearly, every business domain can use the data that it already has, or wants to gather, in myriad useful ways and use it to create more business opportunities and ideas. 4

Challenges facing large data management Large data is one of the biggest challenges facing organizations today, and applying analytics is an even more difficult job. Internet applications are showing nearly 100% growths every year and enterprise applications are also gathering momentum. Thus, addressing the concerns related to large data has become the top priority. Figure 2 Analytics Challenges The main challenges associated with Large Enterprise Data Warehouses are as follows: Disparate Sources: Data comes from many different sources and therefore, companies need to identify these data sources and the issues related to them. They have to perform cleansing operations to obtain the sensible data and mechanisms to draw conclusions from the data. Identification of Useful Data: The Extraction Transformation and Loading (ETL) process is very resource intensive. Companies need to throw away a lot of garbage data, correct the missing information and find replacements for certain fields that may be more useful for internal processing. 5

Structure Data: A structure has to be derived out of the raw data, so that analytics can happen. Store Data: One of the biggest challenges with Large Data is storing it. For deriving any sort of intelligence from data of this magnitude, effective mechanisms are required to store it. Use of RDBMS: A lot of traditional data warehouses are built around RDBMS. With the data warehouses growing to terabytes and beyond, conventional relational database management systems are not the best option when it comes to managing the large data sets. Minimizing Response Time: The time needed to perform analytics is very high. This time needs to be reduced for companies to derive results in the form of conclusions. In order to overcome these challenges, organizations must choose the appropriate Business Intelligence strategy, based on factors such as ease and cost of implementation. Since, gathering the right information from the data can be an expensive proposition; companies must select the right BI solution for their business needs, to lower impact on their IT budgets. They must also check to see how easily the solution can be implemented and start giving returns. At the same time, organizations must identify whether the data that is being gathered and stored needs to be analyzed in real time, using high touch queries or is whether the same can be processed in batches, using a framework like Hadoop to produce results in an offline manner. Companies will additionally need to gauge the best strategy to store their large data based on its size and usage. The data, which is available in different formats like Structured or Unstructured forms, might intersect at some level and need to be combined to provide more useful business insights. The other challenges that one has to consider are complex computations, data security, scalability, and accessibility and fault tolerance. Based on the business needs, companies have to weigh each of these challenges and identify the BI strategy that best suits them. Properly utilized, large data can emerge as a differentiator for companies. Organizations can use it to gain an edge over competition. By effectively managing as well as storing terabytes to petabytes of data, performing rich analysis on these massive data sets, and doing it all at ultrafast speeds, organizations can transform their voluminous data into a business asset. 6

Finding solutions to large data challenges Although, there are various solutions available for storing large amounts of data, Hadoop is one of the best options available. Hadoop is a flexible infrastructure for large scale computation and data processing on a network of commodity hardware. It is a common infrastructure pattern extracted from building distributed systems. Hadoop takes a large piece of data, breaks it up into smaller pieces and distributes it into various nodes of the cluster. These nodes then execute the pieces in parallel and independently, feeding back into each other. They follow the programming paradigm of MapReduce, a patented software framework introduced by Google to support distributed computing on large data sets on computer clusters. MapReduce is significant because it allows developers to create a large variety of parallel programs without having to worry about programming for intracluster communication, task monitoring or task failure handling. Earlier, creating programs that handled these issues could consume literally months of programming time. MapReduce programs have been created for everything from text tokenization, indexing, search to data mining and machine learning algorithms. The best part about Hadoop is that it is an Open Source project initiated by Apache. It was created taking inspiration from Google s MapReduce and GFS papers. Today, Yahoo is one of the largest contributors to the evolution of Hadoop and also responsible for getting Hadoop to its current state. It is used extensively by businesses these days because of its various advantages like: Ability to linearly scale up to thousands of nodes Ability to use commodity hardware, leading to a huge cost advantage Flexibility, which gives Hadoop its real power and allows it to implement map and reduce processes to solve particular problems It s simple programming and execution environment, that accommodates for failures Its DFS for storing data and the Map Reduce execution paradigm The availability of tools built around Hadoop such as Hive and Pig that simplify writing MapReduce jobs Hive for instance, is a database or a data warehouse infrastructure that functions on top of Hadoop. It tries to remove most of the complications with Hadoop and provides programmers with an easy interface to Hadoop. Hive is an effective tool for enabling easy ETL, generating summarizations, and putting 7

structures on the data. It also has the capability to query and analyze large data sets stored in Hadoop files. When Hadoop is used as an ETL tool, it enables the storage of huge amounts of unstructured data and also scales up massively with time and as the inflow of data increases. Inputs coming in the raw format from various sources can be collected and transformed using Hadoop ETL. It is possible to create a daisy chain of Hadoop nodes to crunch a huge data set. The structured data can then be utilized for performing analytics, Reporting and deriving conclusions. When choosing a BI Strategy for your organization there are a few things you should consider: 1. Overall ease & cost of implementation: Gathering the right information from the data can be an expensive proposition. Choosing the right BI solution for your business needs will definitely impact the IT budgets if not thought over prudently. Another consideration to be made before choosing a solution is the ease with which it can be implemented and start giving returns. 2. Real Time Analysis vs. Batch Analysis: One needs to identify whether the data that is being gathered and stored needs to be analyzed in real time, using High touch queries or is whether the same can be processed in batches that can use a framework like Hadoop to process and produce results in an offline manner. 3. End User Ad hoc Analytics: Every enterprise today has the ability to determine and define their requirements and needs. Service providers with a reactive approach to those needs can only play a catch up role rather than acting as the innovators. It also gives a competitive advantage where in your smart customers have an option of gathering the data that they need. Based on the business needs, you have to weigh each of these challenges and identify the right BI strategy for yourself. The next section elaborates the various popular BI approaches that are generally being used for achieving BI implementations over large data. 8

BI implementation approaches over large data Approach 1: Analytics over Hadoop with MPP DWs Today, a lot of options are available in the market that allows the integration of MPP DWs with Hadoop. This is worth considering for users that have a large amount of data even after applying summarization over it. Using Hadoop for cleaning/transforming the data into a structured form allows them to load the data into any of the available options of MPP DWs. While the data is being uploaded, they can write UDFs to perform Database level analytics and then integrate the same with BI solutions using ODBC/JDBC connectivity for end user analytics and reporting. Figure 3 Analytics over Hadoop with MPP DW Also, using MPP DWs allows companies to deploy various performance enhancement techniques like index compression, materialized views, result set caching and I/O sharing. Alternatively, some of the MPP DWs may also provide organizations with a robust framework that supports MR jobs executions within their own clusters at MPP levels, providing them with two levels of parallel processing. This feature is really good for working with high touch queries and provides an excellent framework for end user ad hoc analytics. Approach 2: Indirect Analytics over Hadoop Another interesting approach could be to use Hadoop for cleaning/transforming the data into a structured form and then loading the same into the RDBMS 9

databases. Hadoop can efficiently access the data between the RDBMS data sources and Hadoop systems through DBInputFormat and DBOutputFormat interfaces. Once the Unstructured Data is processed, it can be then pushed to an RDBMS database which can subsequently act as a data source for any BI solution. Figure 4 Indirect Analytics over Hadoop This approach provides the end user with the flexibility of parallel processing of Hadoop and an SQL interface at the summarized data level. This approach works well when the summarized data is not too big to pose a challenge for the RDBMS database being used. This solution is not as expensive as the first approach. It is also suitable for the high touch queries where the user wants to perform real time ad hoc analytics as most of the RDBMS databases come with a comprehensive set of performance enhancement techniques. 10

Approach 3: Direct Analytics over Hadoop We can also apply analytics directly over a Hadoop System without moving it to any RDBMS database. It can prove to be a very effective practice to analyze the data directly from the Hadoop file system. If we have a scenario where the processed & summarized data in itself is very huge, is placed on the Hadoop system. So, we do not want to get into the complications of moving the data out of the Hadoop system either to a MPP DW or RDBMS. This can be done by using Hive as an interface for the data present on Hadoop systems. Figure 5 Direct Analytics over Hadoop This approach allows you to do batch and asynchronous analytics over the same data present over the Hadoop system. It is a very cost effective approach as it does not involve any expense in managing the separate data source other than your existing Hadoop System. It also provides you with the flexibility of scaling to any level with your summarized data. The Large Data Processing and BI Strategy Matrix below can be used as a guideline to choose the right BI strategy for our business needs. 11

Figure 6 The Large Scale BI Strategy Case Study One of the customers of Intellicus, a global leader in digital marketing intelligence, was facing the challenge of processing and aggregating statistical advertising data of more than GBs. It wanted to use this data for mining behavioral insights to help its clients better understand their own customers and leverage and profit from the rapidly evolving World Wide Web and mobile arena. The problem was just not limited to creating a storage system for this data but also running analytics over it on a monthly basis. The intent was to extract specific patterns and then aggregate them based on complex parameters and weighing systems. Impetus quickly realized that for deriving intelligence from data of this magnitude, a conventional relational database management system would be inadequate in the long run. We therefore developed and deployed an optimum Hadoop Java implementation of the product, tuning the Hadoop infrastructure configuration to get the maximum throughput. Intellicus also used MapReduce and Tokyocabinet, which is a Berkley like file storage system to handle the 12

processing of a large metadata (>1.2 GB). The Java based implementation helped Intellicus achieve optimum results in the given small cluster. Summary Impetus believes that organizations have to choose the solutions that best suit their needs, to solve their Large Data challenges. At the same time, they also have to select a BI strategy based on their business requirements. Nevertheless, it is desirable to get all the possible options available under the same hood, as it will help in reducing the complications that arise when dealing with multiple alternatives to achieve a common goal. According to Impetus, the ideal solution will be to provide easy interfacing of the data present on Hadoop, the RDBMS and MPP DWs, and create a common platform for doing BI and analytics over that data. When it comes to batch processing, having out of thebox scheduling and asynchronous execution of MR Jobs and BI analytics components like dashboards, Reports, is critical. About Impetus Impetus Technologies offers Product Engineering and Technology R&D services for software product development. With ongoing investments in research and application of emerging technology areas, innovative business models, and an agile approach, we partner with our client base comprising large scale ISVs and technology innovators to deliver cutting edge software products. Our expertise spans the domains of Data Analytics, Large Data Management, SaaS, Cloud Computing, Mobility Solutions, Testing, Performance Engineering, and Social Media among others. Impetus Technologies, Inc. 5300 Stevens Creek Boulevard, Suite 450, San Jose, CA 95129, USA Tel: 408.252.7111 Email: inquiry@impetus.com Regional Development Centers INDIA: New Delhi Indore Hyderabad Disclaimers The information contained in this document is the proprietary and exclusive property of Impetus Technologies Inc. except as otherwise indicated. No part of this document, in whole or in part, may be reproduced, stored, transmitted, or used for design purposes without the prior written permission of Impetus Technologies Inc. 13