HADOOP BEST PRACTICES



Similar documents
Integrating Hadoop. Into Business Intelligence & Data Warehousing. Philip Russom TDWI Research Director for Data Management, April

Evolving Data Warehouse Architectures

Big Data and Your Data Warehouse Philip Russom

Using and Choosing a Cloud Solution for Data Warehousing

A Next-Generation Analytics Ecosystem for Big Data. Colin White, BI Research September 2012 Sponsored by ParAccel

DATA REPLICATION FOR REAL-TIME DATA WAREHOUSING AND ANALYTICS

INTEGRATING HADOOP INTO BUSINESS INTELLIGENCE AND DATA WAREHOUSING

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Ten Mistakes to Avoid

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ

Active Data Archiving

Interactive data analytics drive insights

Apache Hadoop: The Big Data Refinery

Advanced In-Database Analytics

Bringing the Power of SAS to Hadoop. White Paper

Navigating Big Data business analytics

Advanced Big Data Analytics with R and Hadoop

HDP Hadoop From concept to deployment.

Data Integration for Real-Time Data Warehousing and Data Virtualization

The Future of Data Management

Dell Cloudera Syncsort Data Warehouse Optimization ETL Offload

Achieving Business Value through Big Data Analytics Philip Russom

Decoding the Big Data Deluge a Virtual Approach. Dan Luongo, Global Lead, Field Solution Engineering Data Virtualization Business Unit, Cisco

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

Implement Hadoop jobs to extract business value from large and varied data sets

How To Handle Big Data With A Data Scientist

How To Turn Big Data Into An Insight

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

Why Big Data in the Cloud?

Luncheon Webinar Series May 13, 2013

BIG DATA-AS-A-SERVICE

Big Data Are You Ready? Thomas Kyte

P u b l i c a t i o n N u m b e r : W P R e v. A

The Business Analyst s Guide to Hadoop

ANALYTICS BUILT FOR INTERNET OF THINGS

Using Big Data for Smarter Decision Making. Colin White, BI Research July 2011 Sponsored by IBM

HDP Enabling the Modern Data Architecture

PARC and SAP Co-innovation: High-performance Graph Analytics for Big Data Powered by SAP HANA

Understanding the Value of In-Memory in the IT Landscape

TDWI research. TDWI Checklist report. Data Federation. By Wayne Eckerson. Sponsored by.

Using Tableau Software with Hortonworks Data Platform

Ganzheitliches Datenmanagement

Transforming the Telecoms Business using Big Data and Analytics

BIG DATA: FROM HYPE TO REALITY. Leandro Ruiz Presales Partner for C&LA Teradata

Trafodion Operational SQL-on-Hadoop

Data Warehousing in the Cloud

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

5 Keys to Unlocking the Big Data Analytics Puzzle. Anurag Tandon Director, Product Marketing March 26, 2014

Virtualizing Apache Hadoop. June, 2012

Modernizing Your Data Warehouse Architecture with Hadoop

BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014

Gain Contextual Awareness for a Smarter Digital Enterprise with SAP HANA Vora

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

E-Guide THE CHALLENGES BEHIND DATA INTEGRATION IN A BIG DATA WORLD

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics

Hadoop. Sunday, November 25, 12

Big Data and Your Data Warehouse Philip Russom

Data processing goes big

How to Enhance Traditional BI Architecture to Leverage Big Data

BIG DATA: FIVE TACTICS TO MODERNIZE YOUR DATA WAREHOUSE

E-Guide HADOOP MYTHS BUSTED

The Internet of Things and Big Data: Intro

INDUSTRY BRIEF DATA CONSOLIDATION AND MULTI-TENANCY IN FINANCIAL SERVICES

Cloudera Enterprise Data Hub in Telecom:

Are You Ready for Big Data?

How To Make Data Streaming A Real Time Intelligence

Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014

SAP and Hortonworks Reference Architecture

Datenverwaltung im Wandel - Building an Enterprise Data Hub with

Information Architecture

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

1 Performance Moves to the Forefront for Data Warehouse Initiatives. 2 Real-Time Data Gets Real

Data Warehouse Optimization

DATA VISUALIZATION AND DISCOVERY FOR BETTER BUSINESS DECISIONS

Comprehensive Analytics on the Hortonworks Data Platform

Hadoop for the Enterprise:

Big Data Introduction

VIEWPOINT. High Performance Analytics. Industry Context and Trends

Traditional BI vs. Business Data Lake A comparison

Are You Ready for Big Data?

Cost-Effective Business Intelligence with Red Hat and Open Source

Big Data Can Drive the Business and IT to Evolve and Adapt

Five Best Practices for Maximizing Big Data ROI

The IBM Cognos Platform

Agile Business Intelligence Data Lake Architecture

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc All Rights Reserved

HadoopTM Analytics DDN

UNIFY YOUR (BIG) DATA

BIG DATA What it is and how to use?

Best Practices for Hadoop Data Analysis with Tableau

MDM and Data Warehousing Complement Each Other

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Integrating a Big Data Platform into Government:

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Accelerate Business Advantage with Dynamic Warehousing

Big Data Management and Security

Please give me your feedback

Transcription:

TDWI RESEARCH TDWI CHECKLIST REPORT HADOOP BEST PRACTICES For Data Warehousing, Data Integration, and Analytics By Philip Russom Sponsored by tdwi.org

OCTOBER 2013 TDWI CHECKLIST REPORT HADOOP BEST PRACTICES For Data Warehousing, Data Integration, and Analytics By Philip Russom TABLE OF CONTENTS 2 FOREWORD 2 NUMBER ONE Plan how you will use Hadoop for business advantage. 3 NUMBER TWO Get to know the extended Hadoop ecosystem and what it can do for you. 4 NUMBER THREE Interface with HDFS data through Hadoop tools. 5 NUMBER FOUR Extend your data warehouse architecture with Hadoop. 6 NUMBER FIVE Embrace new best practices in data management as enabled by Hadoop. 7 NUMBER SIX Leverage Hadoop for relatively static, regularly repeated queries against massive data sets. 8 NUMBER SEVEN Get many uses from a few HDFS clusters. 8 NUMBER EIGHT Augment your Hadoop environment with special technologies for real-time data. 9 ABOUT OUR SPONSORS 9 ABOUT THE AUTHOR 9 ABOUT TDWI RESEARCH 9 ABOUT THE TDWI CHECKLIST REPORT SERIES 555 S Renton Village Place, Ste. 700 Renton, WA 98057-3295 T 425.277.9126 F 425.687.2842 E info@tdwi.org tdwi.org 2013 by TDWI (The Data Warehousing Institute TM ), a division of 1105 Media, Inc. All rights reserved. Reproductions in whole or in part are prohibited except by written permission. E-mail requests or feedback to info@tdwi.org. Product and company names mentioned herein may be trademarks and/or registered trademarks of their respective companies.

FOREWORD NUMBER ONE PLAN HOW YOU WILL USE HADOOP FOR BUSINESS ADVANTAGE. According to a recent TDWI survey about Hadoop, only 10% of respondents report having the Hadoop Distributed File System (HDFS) in production today, while a whopping 63% expect to deploy HDFS within three years. 1 The survey shows that user organizations are aggressively adopting HDFS and other Hadoop technologies for data warehousing (DW), data integration, and analytics. A number of trends are driving Hadoop adoption. Organizations want more business value from big data. Seventy percent of survey respondents say that Hadoop is a business opportunity, but only if big data is leveraged through analytics. In other words, business value takes the form of insights gained from analyzing big data managed on Hadoop. Other forms of value come from Hadoop s scalable data management, handling of diverse data types, and low cost compared to other data platforms. Hadoop complements data warehouse platforms, data integration, and analytic tools. It handles massive data volumes and diverse, multi-structured data in scalable and cost-effective ways that traditional platforms cannot. Yet, Hadoop lacks the SQL support, low latency, and ACID properties of traditional platforms (atomicity, consistency, isolation, and durability). This is why Hadoop is a practical complement to traditional platforms, but cannot replace them. Users are moving toward multi-platform environments for DW, data integration, and analytics. This is so users can choose the best platform for a given data workload or analytic goal, plus offload certain workloads from the data warehouse. Hadoop is a welcome addition to extended multi-platform architectures in data warehousing, data integration, and analytics, because it excels with workloads for massive data, ETL, and new analytic algorithms. Most of the organizations adopting Hadoop are completely new to it, so they need to educate themselves quickly about emerging best practices. This TDWI Checklist Report will assist with that education by beginning with an overview of the rapidly evolving Hadoop ecosystem. The checklist of best practices presented here can help users make sustainable decisions as they plan their first Hadoop deployments. 1 See Figure 1 in the 2013 TDWI Best Practices Report Integrating Hadoop into Business Intelligence and Data Warehousing, available at tdwi.org/bpreports. Look for a problem to solve or an opportunity to leverage. For example, some organizations need to have many years of data on spinning disk, ready for analytics and reporting. But many of the largest data warehouse environments have an upper limit for data volumes, whether it s based on a technical barrier or an economic one. HDFS can cost-effectively extend analytic data sets. As another example, many organizations have collected some big data, but have not yet leveraged it. Implementing Hadoop can enable broad exploration of big data as a first step toward reaping business value. In yet another example, the data staging areas of many data warehouses (DWs) are stretched to the limit; Hadoop can handle the growing volumes of detailed source data that many DW staging areas manage and process, plus offload heavy transformation workloads (ETL) to free up DW capacity. Involve business people in defining applications for Hadoop. If your organization has an existing program for data stewardship or governance, stewards and board members can tell you multiple ways that managing big data could be beneficial. Other people to involve include BI/DW sponsors, the chief information officer (CIO), and the heads of departments that can benefit most from big data (marketing, sales, Web, customer service, and operations). Consider how others in your industry are leveraging big data and Hadoop. For example, supply chain focused industries such as manufacturing and retail have big data in the form of RFID and XML. These provide valuable information via analytics about products and their movement through a market. Text-laden industries (insurance and healthcare) can use text analytics to get value from large volumes of human-language text. Machine data streaming from robots in manufacturing and sensors on vehicles or packages in logistics firms can lead to improvements in product quality and operational efficiency. Learn the forms of big data your organization has and propose Hadoop-based solutions accordingly. Identify how Hadoop data can integrate with other enterprise data for broader insights. Many organizations have built complete views of customers based on available enterprise data. Hadoop can extend such views by introducing additional information about customers, customer service, and new customer channels (such as social media or a mobile app). Customers aside, big data managed in Hadoop can also broaden views of products, suppliers, and employees, as well as low-level business operations and activities. Likewise, big data extends analytic applications that depend on large data samples (fraud, risk, customer segmentation). 2 TDWI RESEARCH tdwi.org

NUMBER TWO GET TO KNOW THE EXTENDED HADOOP ECOSYSTEM AND WHAT IT CAN DO FOR YOU. Apache Hadoop is an open source software project administered by the Apache Software Foundation (ASF). The Apache Hadoop software library is a framework that enables the distributed processing of large data sets across clusters of computers, each offering local computation and storage. A TDWI survey asked users with hands-on Hadoop experience which Hadoop products they are using today for applications in data warehousing, data integration, and analytics. 2 Their responses identify the five most common Hadoop products in use today: The Hadoop Distributed File System (HDFS) is a file system (not a database) and therefore lacks capabilities associated with a database management system (DBMS), such as random access to data, support for standard SQL, and query optimization. However, HDFS does things DBMSs don t do as well, such as managing and processing massive volumes of filebased, multi-structured data. MapReduce is a general-purpose execution engine that handles the complexities of parallel programming for a wide variety of hand-coded logic and other applications, which includes (but is not restricted to) analytics and ETL. Hive projects structure onto Hadoop data as it scans so the data can be queried using a SQL-like language called HiveQL. HBase provides a few database functions for HDFS data. HBase is a simple record store, not a full-blown DBMS. Pig provides an additional layer of abstraction that enables developers to design logic (specifically for execution by MapReduce) without having to hand code. Besides the Hadoop products listed above, users will adopt others in coming years (according to the TDWI survey), especially Mahout, Zookeeper, and HCatalog. Hadoop is an ecosystem of products and technologies. security) or additional tools (for administering HDFS clusters). Apache does not provide support and maintenance for Hadoop, but a few software vendors do. Vendor support and vendor-developed functionality help Hadoop achieve the enterprise readiness that many user organizations demand. The Hadoop ecosystem is rounded out by software vendors that support interfaces to Hadoop. Thanks to this effort, Hadoop now integrates with a growing number of analytic tools, database management systems, reporting tools, and tools for data integration and extract, transform, and load (ETL). This support has made it much easier for Hadoop to play a beneficial role in established technology stacks for data warehousing, data management, analytics, reporting, and enterprise applications. You can see that many parties are making substantive contributions to the development of the Hadoop product family and its surrounding ecosystem. As a result, Hadoop improves almost daily, making it an even more viable choice for enterprise use. Hadoop products are almost always deployed in combinations. Some combinations are purely open source. For minimal DBMS functionality, users can layer HBase over HDFS. They can also layer a query framework such as Hive over HDFS or HBase. Note that some implementations of MapReduce require HDFS and others don t. The earliest analytic applications for Hadoop data (by early adopters such as Internet firms) were developed using purely open source Hadoop products. However, as Hadoop goes mainstream across many industries, an emerging best practice is to interface open source Hadoop with a growing variety of enterprise applications and data platforms (as explained in the next section of this report). Adding Hadoop to the extended tool and platform ecosystem adds value to these applications and extends their lives by providing a larger and more diverse store for all data, as well as new analytic functionality. Hadoop products are available as open source from ASF. All can be downloaded at no cost from www.apache.org, and ASF encourages contributions from developers to the source code of all the open source products it manages. Hadoop products are also available from software vendors. Several provide a distribution of HDFS, which is usually bundled with other Hadoop tools. A few vendors add value to HDFS and other Hadoop products by providing patches (for high availability or 2 Ibid, Figure 2. 3 TDWI RESEARCH tdwi.org

NUMBER THREE INTERFACE WITH HDFS DATA THROUGH HADOOP TOOLS. Integrating Hadoop into a multi-platform environment for warehousing, analytics, BI, data integration, and applications requires knowledge of the available interfaces. Today, most access to Hadoop data is through Hadoop tools that are layered over HDFS, namely MapReduce and tools that run atop MapReduce (Hive and Pig). In a practice that s common at early adoptor Internet firms, developers access Hadoop data by hand coding data processing logic (to be executed by MapReduce) or routines in the Hive Query Language (executed by Hive). Note that Hadoop Pig is a high-level tool that enables developers to design data access logic for MapReduce, which otherwise would require code written in Java, C++, C#, Python, R, and so on. Best practices for accessing Hadoop data are evolving away from hand-coded routines and toward solutions developed with vendor products. In fact, a wide variety of vendor products already support Hadoop by interfacing with MapReduce and Hive, including databases, analytic tools, applications, and tools for ETL and other forms of data integration. The best practice of interfacing HDFS from a vendor tool or platform has advantages over open source Hadoop tools and hand-coded routines: Hand-coded solutions are inherently slow to develop, timeconsuming to test and debug, difficult to update or reuse, and costly (due to the high payroll of programmers). In Hadoop, coding involves languages that most data professionals don t know. Compared to Spartan open source Hadoop tools, mature vendor tools, platforms, and applications are feature-rich, with modern GUIs that foster collaboration, reuse, standards, and productivity for developers. Today, most interfaces from vendor-built software to HDFS generate code that a Hadoop tool can execute. For example, for straightforward access to Hadoop data, a data visualization tool might generate HiveQL (or SQL, which is translated to HiveQL), then pass that to Hive for execution. If more extensive processing is needed (say, for analytic algorithms or ETL transformational processing), a tool might generate Java code that is optimized for MapReduce. Although code generation works for Hadoop interfaces, there is performance overhead involved in parsing and compiling the code. To avoid overhead, the emerging best practice in vendor interfaces is to run natively in Hadoop tools (especially MapReduce) without generating code. 4 TDWI RESEARCH tdwi.org

NUMBER FOUR EXTEND YOUR DATA WAREHOUSE ARCHITECTURE WITH HADOOP. Hadoop can be the powerful platform that enables scalability and handles diverse data types for certain components of your DW architecture: Data staging area. Much data processing occurs in a DW s staging area to prepare source data for specific uses (reporting, analytics, OLAP) and for loading into specific databases (DWs, marts, appliances). Much of this processing is done by homegrown or toolbased solutions for extract, transform, and load (ETL). Consider extending your ETL infrastructure by staging and processing a wide variety of big data on HDFS. Operational data stores (ODSs). To free up capacity on a data warehouse, many organizations manage detailed source data on an ODS consisting of a standalone hardware server running a DBMS instance. There s a need for ODS platforms that cost-effectively handle massive data volumes and more diverse data, which Hadoop can do. Online data archive. The source data stored long-term in ODSs can approach petabyte scale. Examples include call detail records in telco, sessionized clickstreams in e-commerce, and customer data in financial services. To cope with large volumes, some data is archived offline, which puts it beyond the reach of analytics. Hadoop can keep data online for constant access. Analytic sandbox. Essentially, this is a 21st century terabyte-scale data mart. Data analysts, data scientists, and other users need a place to collect large data sets for ad hoc analytics. This work can degrade DW performance and lead to an analytic silo, so it s best done with data in an analytic sandbox, which can be a governed, virtual area within HDFS. Data integration sandbox. This resembles the analytic sandbox except that it s a work area for data integration specialists designing and testing large joins, aggregations, and transformational logic with big data. Data lake. This is a pool of most of the data the business has collected for analytics. Instead of categorizing data by type and structure before storing it (which alters the usable content of data), data is left in the form in which it arrived at the lake, so all the source material is there for unforeseen analytic applications. Whether you call it a data lake, a logical data warehouse, or a virtual data warehouse, this is the direction many best practices in big data analytics are headed. Note that the data warehouse components just listed are logical components that can be physically deployed to the central warehouse or to a variety of other data platforms. Imagine deploying all of them on HDFS, but with each logical component as a virtual data structure over a data lake. Because it is virtual, each data warehouse component is easily created and altered, and all virtual data structures can share data without replicating it. A long tradition exists of transforming, remodeling, and tweaking data to optimize it for performance. Today, massively parallel processing (MPP) platforms (such as Hadoop and many relational DBMSs) scale and perform so well that on-the-fly aggregations and transformations return results in reasonable time frames. Likewise, virtual views of data in the lake perform well, making best practices in virtual warehousing more pragmatic than in the past. 5 TDWI RESEARCH tdwi.org

NUMBER FIVE EMBRACE NEW BEST PRACTICES IN DATA MANAGEMENT AS ENABLED BY HADOOP. When possible, take analytic logic to the data, not vice versa. For decades, most analytic tools required that data be transformed to a special model and moved to a special database or file prior to analysis. Given the volumes of today s big data, this is no longer feasible. However, Hadoop was designed for processing data in place. Think of how MapReduce and Hive access and process Hadoop data without moving or remodeling it first. Once an information system is identified as a source for analytics, pre-load its data into Hadoop. In the long run, this is faster than retrieving the data prior to each run of an analytic process, especially if the data is voluminous. Keep Hadoop data synchronized with other systems. Preloading data into HDFS means you must devise processes that keep Hadoop data up to date. Look for changed data capture functionality in data integration tools that interface with HDFS. Operationalize the discoveries of analytics as permanent data structures in a data warehouse. It s ironic that data analysts, data scientists, and similar users scan gigantic volumes of data to understand a business problem (what s the root cause of the latest form of churn?) or opportunity (what new customer attributes are emerging?). Then they typically boil it all down to a relatively small data set expressed in a model that represents their epiphany. Too often, analysts share the epiphany with a few peers and managers, then move on to the next analytic assignment. Instead, analysts should always take the outcome of analytics to the BI and DW team in case the team sees the need to operationalize in reports what was initially discovered via analysis. For example, when analysis reveals a new form of churn, metrics for it should be added to dashboards in which managers track churn. Likewise, when a new customer attribute is discovered, metrics and reports about customers should be updated so managers have complete information about customers. Offload the data and processes for which DWs are not well suited. This includes data types that few DWs were designed for, such as detailed source data and any file-based data (logs, XML, text documents, unstructured data). It includes most ETL and data integration processes, especially those that must run at massive scale (e.g., aggregating tens of terabytes, sorting millions of call detail records). Hadoop is designed for these data types and operations, and Hadoop capacity is far less expensive than DW capacity. The real point, however, is that offloading allows the DW to do what it does best: provide squeaky clean, well-modeled data with a well-documented audit trail for standard reports, dashboards, performance management, and OLAP. In turn, this preserves the business s investment in the warehouse, reduces the cost of future expansion, and redirects funds from doomed tasks (managing data that DWs were not designed for) to successful ones (providing quality data for report consumers and OLAP users). Onboard new data sources quickly and without fear. DW professionals are often hesitant when it comes to integrating data from a new source into the warehouse because it takes time to model new data structures and design ETL jobs. In addition, disaggregating poor-quality or untrustworthy data from the DW s calculated values, time series, and dimensional structures is so difficult as to be impossible. With a data-lake approach to HDFS, modeling and ETL are not required, and disaggregation can be as simple as altering virtual views or analytic algorithms so they ignore files containing questionable data. Don t forget mainframe and legacy systems. These, too, have big data that could be pre-loaded to a data lake on Hadoop. Although new sources such as Web logs, machine data, and social media are key to capturing what happens before and after a transaction, in many organizations, detailed transactional data is still captured and processed on mainframes. However, capacity on these systems is often so expensive as to preclude analytics, whereas cheap capacity on Hadoop makes analytics economically feasible. Hadoop does not support mainframe data natively, but some commercial data integration tools support both mainframes and Hadoop. 6 TDWI RESEARCH tdwi.org

NUMBER SIX LEVERAGE HADOOP FOR RELATIVELY STATIC, REGULARLY REPEATED QUERIES AND OTHER WORKLOADS WITH MASSIVE DATA SETS. Data management professionals depend heavily on structured query language (SQL). Some of them can write optimal SQL code, and they want to leverage this valuable skill. However, most users depend on optimized SQL that is generated by vendor-built tools for reporting, OLAP, data integration, databases, and some forms of analytics. Among the many forms of advanced analytics, query-based analytics is one of the most popular right now; it involves large, complex SQL routines, sometimes consisting of hundreds of lines of standard SQL. Although SQL is more important than ever, Hadoop is currently weak in its support for standard SQL. The Hive Query Language (HiveQL) resembles SQL, so data professionals can learn it easily. The catch is that HiveQL is not as feature-rich as SQL. A bigger challenge is one of tool compatibility, because most of the SQL that users need is generated by vendor tools. Many tools can now translate generated SQL into HiveQL and pass it to Hive, but this entails overhead. A better approach is for a tool to interoperate natively with the MapReduce framework, thereby avoiding code generation and the additional Hadoop tool layers needed to translate HiveQL, Pig, or Java into MapReduce calls. In a related issue, MapReduce is inherently a batch process, which amounts to high data latency for queries, ETL processing, and analytics executed via MapReduce. Data latency is exacerbated by the gargantuan volumes managed in HDFS. Furthermore, HDFS is a file system, so it scans large amounts of data when queried, instead of the selective random access to data we expect from a query run in a relational DBMS. Despite high latency and weak SQL support, queries are very powerful on Hadoop, whether working solely with Hive or also involving SQL-based vendor tools. TDWI has interviewed users who have complex queries in production and who also perform ad hoc queries against Hadoop data. The problem is that Hadoop queries tend to run slowly, some taking hours or days. The development process can also be slow, especially when it involves hand coding and the coordination of multiple tools. Speed issues aside, some queries are ideally suited to Hadoop, especially those that sum instances of entities at scale. For example, some of the first production queries on Hadoop were at Internet firms that needed to count hits on Web pages, as seen in thousands of Web logs, each with thousands of click records. A leading telco scans millions of file-based call detail records in HDFS, summarizing traffic and switch activities. A smartphone manufacturer queries a few billion quality assurance records to correlate suppliers with bad supplies. Some queries are not such a good fit with HDFS and Hadoop tools, particularly those that are iterative. For example, a data analyst practicing query-based analytics starts with an ad hoc query, looks at the results, then revises the query and runs it again. Many iterative revisions later, the analyst has a query result set that summarizes the thing he or she hoped to discover, such as the attributes of profitable customers, a list of problematic suppliers, or a leak in bottom-line costs. Ad hoc and iterative queries are possible in Hadoop, but the data analyst must expect long time frames between result sets. To avoid these delays, an emerging best practice is to manage source data in Hadoop (similar to the data lake discussed earlier) but extract subsets of potentially relevant data and load them into a relational DBMS. That DBMS may be in the core warehouse, a data warehouse appliance, or a standalone system (such as a columnar database). Moving a subset of Hadoop data to a relational platform enables the data analyst to leverage existing tools and skills for SQL, and it makes the analyst more agile and productive by greatly reducing the waiting periods between iterative runs of queries. 7 TDWI RESEARCH tdwi.org

NUMBER SEVEN GET MANY USES FROM A FEW HDFS CLUSTERS. NUMBER EIGHT AUGMENT YOUR HADOOP ENVIRONMENT WITH SPECIAL TECHNOLOGIES FOR REAL-TIME DATA. Beware of analytic silos. Analytic applications tend to be departmental by nature. Sales and marketing need to own and control customer analytics, procurement owns supply chain analytics, and so on. As analytic applications flourish and department requirements become more important, TDWI is seeing HDFS clusters deployed per department. This results in the age-old data silo problem, but with new big data. Consolidating most analytic data into one HDFS cluster can reduce costs for redundant clusters and nodes. It pools data for a data lake approach, as well as provides the single version of the truth that credible decision-making is based on. Big data is too big to move. Leading goals for a Hadoop implementation should be to reduce the number of places data is stored and to establish methods that minimally transform or replicate data as with the data lake, data virtualization, and the logical data warehouse. Moving big data across multiple clusters works against these goals, and is not likely to succeed once data volumes swell. YARN will improve Hadoop s concurrency. Version 1.0 of Hadoop offers limited concurrency in the sense of multiple processes running simultaneously (from MapReduce, Hive, and vendor tools). Version 2.0 is now available, and it includes a new layer called YARN, which provides greater concurrency and administration for multiple tools. In turn, YARN fosters a strategy of more simultaneous processes on fewer HDFS clusters. Imagine Hadoop as shared infrastructure. In many organizations, IT has evolved into an infrastructure provider, and other teams deploy applications atop that infrastructure. In that spirit, central IT could supply HDFS, similar to the way it provides networks, storage subsystems, and racks of servers. HDFS would then be shared by teams for applications, warehousing, integration, analytics, and so on. Hadoop isn t free, despite its open source origins. Setting up and maintaining an HDFS cluster is complex and has administrative payroll costs. Although HDFS runs well on commodity-priced hardware, acquisition and maintenance costs for hardware go up as the cluster grows. Organizations can control costs by consolidating redundant clusters. Similarly, they can reduce the number of nodes by using tools that are optimized for Hadoop. This will allow them to get more out of individual nodes and therefore perform well with fewer nodes. Manage streaming data for new business insights. One of the toughest types of big data to process is streaming data, because it comes at you relentlessly from sensors, machines, devices, and applications. Yet, streaming data is very promising for businesses, because it represents new, untapped data sources that can be analyzed to understand and improve operational efficiencies, Web behaviors, logistics, machine maintenance, and more. Correlate real-time data with Hadoop data and enterprise data. Real-time data represents now an event that just happened or the state of an entity that just changed, such as a customer touch point, the current location of a delivery truck, or a machine that suddenly needs maintenance. To understand the full relevance of an event that happened a moment ago, it s best to correlate data about the event with historical or seasonal data about that entity, as found in a warehouse, Hadoop, or an operational application. Batch-oriented, high-latency HDFS can barely capture data that streams in real time, much less process it in real time. However, special technologies for complex event processing (CEP), operational intelligence (OI), or in-memory analytics are known to make such correlations in true real time in seconds or milliseconds. Capture and store streaming data for offline analytics later. Streaming data should be captured and stored en masse for offline analytics later. Most events, messages, transactions, alerts, clicks, and so on in a stream have a record structure, and these can be captured and appended to a flat file. Hadoop excels in the management and analysis of such file-based data. Enable many right-time speeds and frequencies. Most organizations today need fresher data that is collected, processed, and delivered more frequently than it was in the past, but that doesn t always mean true real time. For example, moving the refresh of reports and analytic models from overnight only to three times daily provides executives with data at a freshness level that s just right for the processes they manage. Hadoop can be adapted to some of the data techniques in use today for various right-time speeds, namely microbatches, data virtualization, data federation, and changed data capture. 8 TDWI RESEARCH tdwi.org

ABOUT OUR SPONSORS ABOUT THE AUTHOR www.sap.com As market leader in enterprise application software, SAP (NYSE: SAP) helps companies of all sizes and industries run better. From back office to boardroom, warehouse to storefront, desktop to mobile device SAP empowers people and organizations to work together more efficiently and use business insight more effectively to stay ahead of the competition. SAP applications and services enable more than 248,500 customers to operate profitably, adapt continuously, and grow sustainably. Philip Russom is the research director for data management at The Data Warehousing Institute (TDWI), where he oversees many of TDWI s research-oriented publications, services, and events. He s been an industry analyst at Forrester Research and Giga Information Group, where he researched, wrote, spoke, and consulted about BI issues. Before that, Russom worked in technical and marketing positions for various database vendors. Over the years, Russom has produced more than 500 publications and speeches. You can reach him at prussom@tdwi.org. ABOUT TDWI RESEARCH www.syncsort.com/bigdata Syncsort provides data-intensive organizations across the big data continuum with a smarter way to collect, process, and distribute the ever-expanding data avalanche. With thousands of deployments across all major platforms, including mainframe, Syncsort helps customers around the world to overcome the architectural limits of today s ETL and Hadoop environments, empowering their organizations to drive better business outcomes in less time, with fewer resources and lower TCO. For decades, Syncsort has been the undisputed leader in highperformance data processing technology for the mainframe and the fastest ETL software for Windows, Unix, and Linux. Thanks to breakthrough innovations and ongoing contributions to the Apache Hadoop open source community, organizations can now run the same technology natively within the MapReduce framework. The result is Syncsort DMX-h, high-performance software to collect, transform, and distribute all your data with Hadoop. DMX-h turns Hadoop into a more robust and feature-rich ETL solution, enabling users to maximize the benefits of MapReduce without compromising on capabilities, ease of use, and typical use cases of conventional ETL tools. Accelerate your data integration initiatives and unleash Hadoop s potential with the only architecture that runs ETL processes natively within Hadoop. TDWI Research provides research and advice for business intelligence and data warehousing professionals worldwide. TDWI Research focuses exclusively on BI/DW issues and teams up with industry thought leaders and practitioners to deliver both broad and deep understanding of the business and technical challenges surrounding the deployment and use of business intelligence and data warehousing solutions. TDWI Research offers in-depth research reports, commentary, inquiry services, and topical conferences as well as strategic planning services to user and vendor organizations. ABOUT THE TDWI CHECKLIST REPORT SERIES TDWI Checklist Reports provide an overview of success factors for a specific project in business intelligence, data warehousing, or a related data management discipline. Companies may use this overview to get organized before beginning a project or to identify goals and areas of improvement for current projects. 9 TDWI RESEARCH tdwi.org