However most organisations are expressing frustrations with their data warehousing solutions due to:



Similar documents
Big Data at Cloud Scale

DATA ANALYTICS SERVICES. G-CLOUD SERVICE DEFINITION.

Luncheon Webinar Series May 13, 2013

Management Accountants and IT Professionals providing Better Information = BI = Business Intelligence. Peter Simons peter.simons@cimaglobal.

Data Virtualization A Potential Antidote for Big Data Growing Pains

Datenverwaltung im Wandel - Building an Enterprise Data Hub with

Oracle Data Integrator 12c (ODI12c) - Powering Big Data and Real-Time Business Analytics. An Oracle White Paper October 2013

How To Turn Big Data Into An Insight

How To Handle Big Data With A Data Scientist

Protecting Big Data Data Protection Solutions for the Business Data Lake

The Future of Data Management

BIG DATA: FROM HYPE TO REALITY. Leandro Ruiz Presales Partner for C&LA Teradata

Gain Contextual Awareness for a Smarter Digital Enterprise with SAP HANA Vora

More Data in Less Time

INTELLIGENT BUSINESS STRATEGIES WHITE PAPER

A TECHNICAL WHITE PAPER ATTUNITY VISIBILITY

ARCHITECTURE SERVICES. G-CLOUD SERVICE DEFINITION.

Build a Streamlined Data Refinery. An enterprise solution for blended data that is governed, analytics-ready, and on-demand

Getting Started Practical Input For Your Roadmap

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics

Escape from Data Jail: Getting business value out of your data warehouse

Navigating Big Data business analytics

Five Technology Trends for Improved Business Intelligence Performance

BIG DATA: FIVE TACTICS TO MODERNIZE YOUR DATA WAREHOUSE

Evolution to Revolution: Big Data 2.0

Big Data Comes of Age: Shifting to a Real-time Data Platform

How to Enhance Traditional BI Architecture to Leverage Big Data

The Enterprise Data Hub and The Modern Information Architecture

SQL Server 2012 Parallel Data Warehouse. Solution Brief

IBM AND NEXT GENERATION ARCHITECTURE FOR BIG DATA & ANALYTICS!

Traditional BI vs. Business Data Lake A comparison

Investor Presentation. Second Quarter 2015

BIG Data Analytics Move to Competitive Advantage

How the oil and gas industry can gain value from Big Data?

Big Data Integration: A Buyer's Guide

Effective Data Integration - where to begin. Bryte Systems

Apache Hadoop: The Big Data Refinery

HGST Object Storage for a New Generation of IT

Big Data for the Rest of Us Technical White Paper

/ WHITEPAPER / THE BIMODAL IT

UNIFY YOUR (BIG) DATA

Independent process platform

Hadoop Data Hubs and BI. Supporting the migration from siloed reporting and BI to centralized services with Hadoop

High-Performance Business Analytics: SAS and IBM Netezza Data Warehouse Appliances

Dell Cloudera Syncsort Data Warehouse Optimization ETL Offload

FROM DATA STORE TO DATA SERVICES - DEVELOPING SCALABLE DATA ARCHITECTURE AT SURS. Summary

Your Path to. Big Data A Visual Guide

Microsoft Big Data. Solution Brief

VMware Hybrid Cloud. Accelerate Your Time to Value

Deploying an Operational Data Store Designed for Big Data

The Principles of the Business Data Lake

ECM Migration Without Disrupting Your Business: Seven Steps to Effectively Move Your Documents

TRENDS IN THE DEVELOPMENT OF BUSINESS INTELLIGENCE SYSTEMS

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ

EMC ADVERTISING ANALYTICS SERVICE FOR MEDIA & ENTERTAINMENT

VIEWPOINT. High Performance Analytics. Industry Context and Trends

OPEN MODERN DATA ARCHITECTURE FOR FINANCIAL SERVICES RISK MANAGEMENT

MDM and Data Warehousing Complement Each Other

5 Keys to Unlocking the Big Data Analytics Puzzle. Anurag Tandon Director, Product Marketing March 26, 2014

BEYOND BI: Big Data Analytic Use Cases

Converged, Real-time Analytics Enabling Faster Decision Making and New Business Opportunities

I N T E R S Y S T E M S W H I T E P A P E R F O R F I N A N C I A L SERVICES EXECUTIVES. Deploying an elastic Data Fabric with caché

Integrated Big Data: Hadoop + DBMS + Discovery for SAS High Performance Analytics

DATAMEER WHITE PAPER. Beyond BI. Big Data Analytic Use Cases

The 2-Tier Business Intelligence Imperative

The Worksoft Suite. Automated Business Process Discovery & Validation ENSURING THE SUCCESS OF DIGITAL BUSINESS. Worksoft Differentiators

MANAGEMENT SOLUTIONS SAFEGUARD BUSINESS CONTINUITY AND PRODUCTIVITY WITH MIMECAST

Using and Choosing a Cloud Solution for Data Warehousing

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

Key Issues for Data Management and Integration, 2006

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

Why Big Data in the Cloud?

Cloud Computing. What does it really mean for your business?

Data Discovery, Analytics, and the Enterprise Data Hub

The Business Analyst s Guide to Hadoop

Converging Technologies: Real-Time Business Intelligence and Big Data

Business Intelligence

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

IBM Analytics. Just the facts: Four critical concepts for planning the logical data warehouse

A Modern Data Architecture with Apache Hadoop

A Comprehensive Solution for API Management

Maximize strategic flexibility by building an open hybrid cloud Gordon Haff

Apache Hadoop in the Enterprise. Dr. Amr Awadallah,

TRANSFORM YOUR BUSINESS: BIG DATA AND ANALYTICS WITH VCE AND EMC

Transcription:

EXECUTIVE SUMMARY We live in a time of uncertainty for the traditional Enterprise Warehouse (EDW). The long-standing requirement to operationalise Business Intelligence (BI) has been accelerated by the needs of real-time operational decisioning. At the same time the EDW must cope with an explosion of volume, high user expectations and the demands of data discovery. Enter big data technology which promises scalability, flexibility and lower cost to serve. Enterprises have begun by experimenting with big data platforms supporting an increasing number of point solutions. This is the origin of our big data refinery architecture, which allows for experimentation without disrupting the existing EDW. Many customers have successfully used the big data refinery approach for their specific challenges but as big data platforms develop enterprise-grade features, it becomes feasible to use big data to play a bigger role in the EDW. This is the origin of the Lake or Reservoir. So what should a Lake look like? What is the best blend of big data and traditional database tools for an organisation? How do you avoid being left behind by increasingly agile competition? This paper considers four architecture models which put big data technology and the data asset itself increasingly at the centre of the enterprise. We look at the challenges each model can help to solve and the potential pitfalls business leaders will need to consider as they determine how best to embrace big data in the digital enterprise. So what should a Lake look like? What is the best blend of big data and traditional database tools for an organisation? How do you avoid being left behind by increasingly agile competition?

BAE Systems Applied Intelligence The Enterprise Warehouse (EDW) has become a long-serving, business critical capability for many organisations. The successful EDW should provide consistent, trusted data for accurate decision-making. However most organisations are expressing frustrations with their data warehousing solutions due to: Inflexibility of Evolved Platforms: An inability to deal with a changing business quickly and get data to where it is needed most. The shorter time-to-market for new products and services demands flexible solutions and rapid software development methods. Similarly data discovery, which forms a key engine of innovation in the digital age, demands more responsive, higher capacity platforms for raw analysis. Heightened User Expectations: Our personal use of smartphones and tablets has created heightened user expectations about the immediacy and intuitiveness of technology. Technology-driven EDW projects are continuing to fail with a string of expensive projects that adopted a build it and they will come approach, failing to properly engage end users. The Volume, Variety & Velocity of the : Moving into a world of interaction data e.g. machine logs, clickstreams, sensor and appliance data (the Internet of Things ), as well as semi-structured and unstructured data, has opened up opportunities for greater customer and operational insights but poses data processing and storage challenges for traditional database platforms. At the same time, the increasing desire for organisations to be able to respond to events quickly has fuelled requirements for personalised real-time decision support systems which still need to draw on a centralised single version of the truth. Total Cost of Ownership: IT projects have come under increasing scrutiny to deliver lower cost to serve where software licensing, data storage, high performance infrastructure and system maintenance are large contributors to TCO. Organisations are challenging their suppliers to step up or move aside where cloud and open source tools continue to mature their enterprise-readiness.

THE PROMISE OF BIG DATA The emergence of big data technologies has offered some potential solutions to these challenges: Scale-out storage in petabytes and beyond means that data which may contain critical business advantage can be retained cost-effectively Massively Parallel Processing (MPP) on grid-computing enables data integration and processing of many sources with very high throughput rates, at a fraction of the cost of traditional MPP platforms Schema-on-Read offers the ability to define the data structure at query time, as opposed to load time. This means new data sources can be loaded in their native format and quickly made available for self-service discovery Complex data structures can be stored and processed efficiently, alleviating limitations of relational data Open Source software on commodity infrastructure helps to relieve license and support costs. Moreover, two of the major concerns around big data technology: read consistency and transaction support, do not typically apply to the EDW. The Big Refinery Our Big Refinery (BDR) model is a starting point that illustrates how a big data platform can be employed to unlock hidden value in a wide variety of data sources, and sit alongside the traditional EDW. Big data sources can be presented to data scientists and summarised as a source to the data warehouse. The primary intent of the BDR is to enhance an established well-functioning warehouse, rather than offering a complete answer to the challenges posed above. Sources Storage & Processing Presentation & Exploitation DATA REFINERY ACCESS AND EXPLOITATION TOOLS BIG DATA SOURCES Filter / pre-processing Real time analytics platform Fast search and query platform Batch processing platform Big Specialists Bulk analytics Fixed and dynamic reporting Search Online transaction processing Legacy Systems Enterprise Warehouse Traditional BI Analyst Traditional Sources BI Tools Warehouse Marts 4

BAE Systems Applied Intelligence Unsurprisingly vendors of traditional tools are pushing this approach, with ever improving integration with big data technologies. This will be the right choice for many customers - for example those looking to make a small investment to test the returns. This is far and away the most popular approach at present - more than 70% of organisations Gartner surveyed in 2013 were using big data for marts 1. Big data solutions have evolved rapidly over recent years with increasingly mature enterprise grade features, strengthening their ability to support BI and analytics solutions. Rise of the Lake The Lake goes at least one stage beyond the BDR and becomes the initial landing point for enterprise data sources and externally gathered data. Like the BDR, it is underpinned by big data technologies - typically starting with Hadoop. This naturally raises questions of the role the Lake can play in the enterprise such as: Can I host my whole Warehouse on Hadoop? and in what parts of the EDW will big data technology be most effective? We therefore suggest four models through which the Lake increasingly encompasses the EDW. There is a logical progression but that does not imply the same model is ideal for every organisation and the choice depends on several factors. 1. The Active Archive Lake undertakes some responsibility for Extract, Transform and Load (ETL) and provides online access to historic data - both raw source information and data archived from the conventional relational stores. Through retention of source data in its native format, business questions can be asked in ways which were not envisaged when the data was written. Replicating the data with low latency to the Lake is a cheap way to alleviate the query load of Discovery from source systems. At the same time this brings the advantages of a distributed platform for the purpose of advanced analytics. For many years, we have implemented Historical Stores (HDS) in traditional EDWs. However, even then there is an analysis and development cost and lag to acquiring new data sources. Furthermore, the HDS pattern is generally unsuited to unstructured data. BIG DATA SOURCES CONVENTIONAL DATA SOURCES DATA LAKE Decision Automation Archive ETL Staging Search Archive Load Discovery / Science CONVENTIONAL RELATIONAL DATA STORES Reporting Traditional EDW Marts OLAP Cube Dashboard 1 Gartner Magic Quadrant Warehouse DBMS Survey, Nov 2012 and Nov 2013 and Gartner Presentation, What About the Date Warehouse? Start? Stop? Continue? - Mark Beyer, October 2014 5

2. The Dual Warehouse Lake continues to act as an archive, but in addition presents a replica and extension of existing reporting structures to broaden the use cases it can fulfil. Since the Dual Warehouse replicates data, it can represent a transition step to one of the later models. BIG DATA SOURCES CONVENTIONAL DATA SOURCES DATA LAKE Search Decision Automation Archive Hadoop DWH Replicate Load Discovery / Science CONVENTIONAL RELATIONAL DATA STORES Reporting Traditional EDW Marts OLAP Cube Dashboard 3. The Hybrid Lake is an evolution of the Dual Warehouse; in this case the Operational Store, common to many EDW patterns, resides in the Lake and a traditional relational database and OLAP tools are used for data marts. BIG DATA SOURCES CONVENTIONAL DATA SOURCES DATA LAKE Decision Automation Hadoop DWH Search BI stores Summarise Discovery / Science Reporting Marts OLAP Cube Dashboard 6

BAE Systems Applied Intelligence 4. The Enterprise Lake as the endpoint of the evolution serves all the BI and analytics needs of the organisation. This model is declining in popularity - Gartner s Warehouse inquiry data shows the replacement idea is disappearing : 17% of organisations were considering replacing the EDW with a Big solution in 2010 but this had dropped to 3% by 2013 2. BIG DATA SOURCES CONVENTIONAL DATA SOURCES DATA LAKE Hadoop DWH Decision Automation Marts Olap Cube DATA LAKE EDW Search Discovery / Science Reporting Dashboard Bridging the gap with Virtualisation The role of Virtualisation (DV) is a vital consideration when deciding on an architecture. DV is a form of data integration that allows multiple data sources to be treated as one logical source, but this does not constitute a Lake. It offers a means to leverage capabilities of different underlying technologies by presenting an abstracted data access layer that reduces time-to-insight for BI and analytics solutions by accessing data directly at source. DV recognises the reality of a heterogeneous data landscape and allows for optimum tooling to be used in each case. It can be used in any of the models presented so far to hide the implementation of the Lake (and its evolution) to Discovery consumers and potentially for reporting. It also allows for access to sources yet to be migrated into the Lake or where low latency is an important requirement. DV solutions are not a panacea however and may be prone to mixed performance results depending on the query/workload introduced. Moreover, certain types of advanced analytics can only be run effectively by bringing the data into the Lake. 2 Gartner Presentation, What About the Date Warehouse? Start? Stop? Continue? - Mark Beyer, October 2014 7

Comparison of the models The benefits of each model are shown below. The risks reflect those of placing an increasing reliance on big data platforms. These are discussed in the next section. Model Benefits Where most effective Active Archive Enables online access to historical data, retained for long periods Capacity pressures on conventional relational stores are alleviated by offloading some ETL processing to the Lake therefore capitalising on specialist BI infrastructure investment Minimal disruption to traditional BI solutions An existing, successful enterprise warehouse solution exists, critical to business operations and the appetite for risk of complete replatform of the existing solution is low Existing ETL processing is under pressure to satisfy batch windows Dual Warehouse Analytics migrated to the lake, enabling selfservice insight over a wide variety of data formats Greater flexibility to choose the right tool for the right job leveraging the strengths of each on a case-by-case basis Option to migrate conventional capabilities on demand as Lake technologies mature over time Hybrid Separate solutions optimised for different workload types (e.g. batch vs interactive query) Reduces infrastructure cost by offloading high volume storage completely to the Lake Organisations are committed to a strategy for the Lake in the enterprise, but desire the ability to selectively transition capabilities from the conventional relational data stores There is an ambition to reduce infrastructure costs Strong technical expertise exists to deliver, maintain and support Hadoop based solutions Enterprise Lake Maintains conventional options for enterprise applications and dashboards Centralised data warehouse on single architecture for self-service analytics and BI solutions Single data storage platform for enforcement of governance policies and controls Reduced complexity Complex data access functionality is implemented in existing BI applications that is non-trivial to port on to Lake Organisations have a significant appetite and skills to embrace emerging technologies The requirement is for a greenfield site with no legacy system replacement or risk to existing capabilities 8

BAE Systems Applied Intelligence So What? Before embarking on either a warehouse enhancement, a Hadoop-based experiment or a major new data strategy the following need to be considered: Risks All BI projects come with the same notorious risks of failure which the Lake doesn t change, such as attention to business sponsorship and user engagement. However the use of big data technologies at the centre of the corporate IT estate brings a number of new considerations. The open source community is enormously creative in plugging capability gaps, so many of these challenges are diminishing. However commercial support options which offer greater stability inevitably lag behind open source developments. access SQL interfaces to Hadoop are evolving extremely quickly as those facing the greatest challenges to embrace the Lake are BI vendors. On some Hadoop platforms only a subset of SQL functionality is supported however. A schema-on-read approach is not necessarily straightforward to implement, especially if schemas change over time. access is simplified where SQL interfaces to Hadoop can be used, although some Hadoop-based platforms only offer a subset of SQL functionality. governance The conventional data warehouse is a proven enabler for enterprise data governance processes through its tight controls and relational database functionality. The Lake has even greater responsibility to enforce governance policies given its flexibility to receive, process and store data in a variety of forms. In an environment that includes multiple teams of data scientists, the enforcement of data governance policies including data retention, access controls, audit, data quality, ownership and stewardship is critical. We recommend drawing a distinction between the level of governance required for data services that are used to run the business and those used to discover new transformational business opportunities. Availability For the Lake to support the enterprise it will need to satisfy similar service levels expected of relational databases (e.g. high availability, monitoring, vendor response times). Hadoopbased solutions are still maturing in this area - the answer is to match the service level to the use case. 9

model Many standard data models aligned to industry sectors are available as accelerators for traditional database systems. Implementation of these rely on standard relational database features such as data integrity constraints, individual record updates and highly structured data formats to support data quality standards. Organisations wanting to use the Enterprise Lake model will need to consider the cost of translating these onto big data platforms, if this is really needed. Likewise there are standard patterns for ETL, such as change data capture or customer matching, which will need to be re-invented for big data. The costs of this needs to be recognised. skills Big data solutions implemented in emerging technologies face a greater barrier to entry because of limited availability of skilled resources. Over time this will be mitigated by wider adoption of big data and the emergence of more user friendly technologies. Professional Services A cited benefit of big data technologies is cost saving through deployment on commodity infrastructure, but on-premise mission critical deployments will still require support services from infrastructure vendors. Cloud provision can defer some of these costs, in fact some cloud providers are moving up the stack from infrastructure to platform and cluster provisioning. While this is an attractive alternative, the increasing variety of cloud offerings necessitates yet another skill set. Taking the Plunge The Lake is well placed to tackle some of the frustrations currently experienced with the traditional data warehouse, while leveraging new opportunities the digital age demands. Big data platforms however bring their own risks and nervousness for architects, developers, administrators and analysts in an emerging technology space. Today, many organisations are testing the water with big data capabilities in the form of proof-of-concept initiatives and point solutions. Thus big data skills and expertise will naturally continue to evolve; as they do so, the case for embracing the Lake in the enterprise is strengthened. Choosing which lake to swim in has never been more important. 10

ABOUT US BAE Systems Applied Intelligence delivers solutions which help our clients to protect and enhance their critical assets in the connected world. Leading enterprises and government departments use our solutions to protect and enhance their physical infrastructure, nations and people, mission-critical systems, valuable intellectual property, corporate information, reputation and customer relationships, and competitive advantage and financial success. We operate in four key domains of expertise: Cyber Security helping our clients across the complete cyber security risk lifecycle Financial Crime identifying, combating and preventing financial threats, risk, loss or penalties Communications Intelligence providing sophisticated network intelligence, protection and controls Digital Transformation creating competitive advantage and enhancing operating performance by exploiting data and digital connectivity We enable organisations to be more agile, increase trust and operate more confidently. Our solutions help to strengthen national security and resilience, for a safer world. They enable enterprises to manage their business risks, optimise their operations and comply with regulatory obligations. We are part of BAE Systems, a global defence, aerospace and security company delivering a wide range of products and services including advanced electronics, security and information technology solutions. Global Headquarters BAE Systems Applied Intelligence Surrey Research Park Guildford Surrey GU2 7RQ United Kingdom T: +44 (0) 1483 816000 BAE Systems Applied Intelligence Australia Level 1220 Bridge Street Sydney NSW 2000 Australia T: +61 (2) 9255 0400 BAE Systems Applied Intelligence Dubai Dubai Internet City Building 17 Office Ground Floor 53 PO Box 500523 Dubai T: +971 4369 4369 BAE Systems Applied Intelligence Malaysia Level 28 Menara Binjai 2 Jalan Binjai, 50450 Kuala Lumpur T: +60 3 2191 3000 BAE Systems Applied Intelligence USA 265 Franklin Street Boston MA 02110 USA T: +1 (617) 737 4170 E: learn@baesystems.com W: www.baesystems.com/ai www.twitter.com/baesystems_ai www.linkedin.com/company/baesystemsai Copyright BAE Systems plc 2014. All rights reserved. BAE SYSTEMS, the BAE SYSTEMS Logo and the product names referenced herein are trademarks of BAE Systems plc. BAE Systems Applied Intelligence Limited registered in England & Wales (No.1337451) with its registered office at Surrey Research Park, Guildford, England, GU2 7RQ. No part of this document may be copied, reproduced, adapted or redistributed in any form or by any means without the express prior written consent of BAE Systems Applied Intelligence.