How to Run a Successful Big Data POC in 6 Weeks



Similar documents
From Lab to Factory: The Big Data Management Workbook

Ganzheitliches Datenmanagement

Informatica Data Quality Product Family

Data Governance Baseline Deployment

Informatica Data Quality Product Family

WHITEPAPER. A Data Analytics Plan: Do you have one? Five factors to consider on your analytics journey.

Ten Mistakes to Avoid

Deploying an Operational Data Store Designed for Big Data

5 Ways Informatica Cloud Data Integration Extends PowerCenter and Enables Hybrid IT. White Paper

A Road Map to Successful Customer Centricity in Financial Services. White Paper

Informatica for Tableau Best Practices to Derive Maximum Value

Informatica and the Vibe Virtual Data Machine

Digital Business Platform for SAP

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

Informatica PowerCenter Data Virtualization Edition

Data Governance in the Hadoop Data Lake. Kiran Kamreddy May 2015

Integrating a Big Data Platform into Government:

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ

Big Data Services From Hitachi Data Systems

Understanding Your Customer Journey by Extending Adobe Analytics with Big Data

WHITE PAPER SPLUNK SOFTWARE AS A SIEM

How to avoid building a data swamp

CA Technologies Big Data Infrastructure Management Unified Management and Visibility of Big Data

White Paper. Successful Legacy Systems Modernization for the Insurance Industry

Transform HR into a Best-Run Business Best People and Talent: Gain a Trusted Partner in the Business Transformation Services Group

Building Your Big Data Team

Traditional BI vs. Business Data Lake A comparison

Databricks. A Primer

Databricks. A Primer

EMC ADVERTISING ANALYTICS SERVICE FOR MEDIA & ENTERTAINMENT

Hadoop Data Hubs and BI. Supporting the migration from siloed reporting and BI to centralized services with Hadoop

Bringing Strategy to Life Using an Intelligent Data Platform to Become Data Ready. Informatica Government Summit April 23, 2015

Data Governance in the Hadoop Data Lake. Michael Lang May 2015

Data Governance Implementation

The Five Most Common Big Data Integration Mistakes To Avoid O R A C L E W H I T E P A P E R A P R I L

BANKING ON CUSTOMER BEHAVIOR

Test Data Management for Security and Compliance

Informatica PowerCenter The Foundation of Enterprise Data Integration

HDP Hadoop From concept to deployment.

Buying vs. Building Business Analytics. A decision resource for technology and product teams

Enterprise Data Integration

The SAS Transformation Project Deploying SAS Customer Intelligence for a Single View of the Customer

Three Open Blueprints For Big Data Success

Qlik UKI Consulting Services Catalogue

Data Governance Implementation

Cisco Data Preparation

ETPL Extract, Transform, Predict and Load

VIEWPOINT. High Performance Analytics. Industry Context and Trends

The Informatica Solution for Improper Payments

End Small Thinking about Big Data

ENTERPRISE BI AND DATA DISCOVERY, FINALLY

Your guide to DevOps. Bring developers, IT, and the latest tools together to create a smarter, leaner, more successful coding machine

See the Big Picture. Make Better Decisions. The Armanta Technology Advantage. Technology Whitepaper

Top Five Reasons Not to Master Your Data in SAP ERP. White Paper

Informatica Version 10 Features and Advancements

IBM Software Integrating and governing big data

Are You Big Data Ready?

How to Build a Service Management Hub for Digital Service Innovation

The Future of Data Management

Big Data at Cloud Scale

Integrating Big Data into Business Processes and Enterprise Systems

Why Big Data Analytics?

Informatica Master Data Management

CrossPoint for Managed Collaboration and Data Quality Analytics

The Requirements for Universal Master Data Management (MDM) White Paper

QLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM

What to Look for When Selecting a Master Data Management Solution

Detecting Anomalous Behavior with the Business Data Lake. Reference Architecture and Enterprise Approaches.

Infor10 Corporate Performance Management (PM10)

Before You Buy: A Checklist for Evaluating Your Analytics Vendor

Beyond the Data Lake

Optimized for the Industrial Internet: GE s Industrial Data Lake Platform

Addressing Risk Data Aggregation and Risk Reporting Ben Sharma, CEO. Big Data Everywhere Conference, NYC November 2015

May 2015 Robert Gibbon & Jochen Stroobants

Informatica PowerCenter Real Time

DISCOVERING AND SECURING SENSITIVE DATA IN HADOOP DATA STORES

RO-Why: The business value of a modern intranet

Datenverwaltung im Wandel - Building an Enterprise Data Hub with

How to Manage Your Data as a Strategic Information Asset

Offload Enterprise Data Warehouse (EDW) to Big Data Lake. Ample White Paper

Preparing for the Big Data Journey

90% of your Big Data problem isn t Big Data.

Mike Maxey. Senior Director Product Marketing Greenplum A Division of EMC. Copyright 2011 EMC Corporation. All rights reserved.

Business Intelligence

Management Excellence Framework: Record to Report

Next-Generation Cloud Analytics with Amazon Redshift

White Paper. An Introduction to Informatica s Approach to Enterprise Architecture and the Business Transformation Toolkit

Big Data Comes of Age: Shifting to a Real-time Data Platform

Data Governance for Regulated Industries

Building Confidence in Big Data Innovations in Information Integration & Governance for Big Data

Analance Data Integration Technical Whitepaper

The Role of Data & Analytics in Your Digital Transformation. Michael Corcoran Sr. Vice President & CMO

Apache Hadoop: The Big Data Refinery

Informatica Application Information Lifecycle Management

!!!!! White Paper. Understanding The Role of Data Governance To Support A Self-Service Environment. Sponsored by

Cisco IT Hadoop Journey

Luncheon Webinar Series May 13, 2013

You Rely On Software To Run Your Business Learn Why Your Software Should Rely on Software Analytics

Actian SQL in Hadoop Buyer s Guide

Modern IT Operations Management. Why a New Approach is Required, and How Boundary Delivers

Transcription:

Executive Summary How to Run a Successful Big Data POC in 6 Weeks A Practical Workbook to Deploy Your First Proof of Concept and Avoid Early Failure Executive Summary As big data technologies move into the mainstream, more and more enterprises are starting to implement their own big data projects. But before you can have a successful big data project, you have to have a successful big data Proof of Concept. So we re compiling all the best practice, lessons learned, and advice we ve gathered over the years to create a practical guide to running a successful POC. With one important feature: the key to POC success is proving value in a short amount of time. Our workbook aims to help you complete your POC in just six weeks. It s an ambitious but feasible timeframe. And it s crucial if you re trying to prove your big data project can cut it in a real-world enterprise. As we develop the workbook, we d like to invite you to review the outline below and tell us if we re addressing the issues that are important to you. We want the workbook to be absolutely useful for you. Click the link below to leave your feedback and to pre-register for the workbook, which will be published early 2016. Leave Feedback

Workbook Outline Introduction: The Importance of Starting Small Eating Our Own Dog Food The marketing team at Informatica recently ran a successful POC for a marketing data lake to fuel account-based marketing. So in addition to examples and stories from our experience with other companies, we ll also be sharing some of the lessons we learned along our own journey. At this point there s little doubt that big data represents real and tangible value for enterprises. But the journey from idea to operation is a tricky one. A phased approach to big data isn t just about making projects more manageable. It s about ensuring they can adequately prove the value of big data to the enterprise. And without a well-planned proof of concept to get the ball rolling, it s far too easy for enterprises to miss out on big data s true potential. We ve written this workbook to help you run a POC that: 1. Is small enough that it can be completed in six weeks. 2. Is important enough that it can prove the value of big data to the rest of the organization. 3. Is intelligently designed so you can grow the project incrementally. Part I. General Lessons 1. Why POCs Fail Far too often, POCs run over budget, behind schedule or fail to live up to considerable expectations. So before we dive in to what makes a POC succeed, let s look at some of the common reasons for failure. A lack of alignment with executives: Either by going beyond the scope or by promising too much, POCs fail when the executives backing them aren t involved and their expectations aren t managed. Lack of planning and design: You don t want to spend three of your six weeks just reconfiguring your Hadoop cluster and architecting an enterprise-wide Hadoop-based data lake. Scope creep: As excitement around the project grows, it s common for people to add requirements and alter priorities. Ignoring data management: It s easy to forget the basic principles of data management and overestimate the tools in the Hadoop ecosystem. 2. Big Data Management Critical Success Factors Big data management has a crucial role to play in the success of big data initiatives, even at an early stage. Abstraction matters: The key to scale is leveraging all available hardware platforms and production artifacts (e.g. existing data pipelines). So when it comes to data management, it s important for the logic and metadata to be abstracted away from the execution platform. This will help preserve all the work you put into a project as systems and technologies change Leverage your existing talent: Rather than trying to invest in outside talent for every new skill, it helps to leverage your existing data management expertise to start sooner, keep costs down, and best practices consistent. Leverage automation to operationalize: Data management in a POC is hard. It may appear you can finish a POC with a small team of experts writing a lot of hand coding. But it really only proves you can get Hadoop to work in a POC environment. It is no guarantee that what you

built will continue to work in a production data center, comply with industry regulations, scale to production and future workloads, or be maintainable as technologies change and new data is onboarded. The key to a successful POC is appreciating what comes next. At this stage you need automation to scale your project. So make the future scale of your initiative part of your planning and design decisions now. Metadata matters: Repositories of metadata, rules and logic make it a whole lot easier to use a rationalized set of integration patterns and transformations in a repeatable way. Data quality matters: Data needs to be fit-for-purpose. The only thing worse than no data is bad data. Start thinking about security now: Security at scale is a whole other ball game when it comes to big data projects. You don t want your Hadoop cluster to be a free-for-all at the point of operationalization that poses a security risk. Part II. Picking the First Use Case 1. Limiting the Scope of Your POC The real challenge around picking the first use case lies in finding something that s important enough that the business will value it, and small enough that you can solve the problem in six weeks with a low starting cost. So once you ve identified a serious business challenge, your next step should be to minimize the scope of the project. The smaller the better. You can limit your scope along these dimensions: The business unit: In a six week POC, it s smarter to serve one business unit than it is to try and serve the whole enterprise. Additionally, the smaller the team of business stakeholders the easier it ll be to meet and manage expectations. The executive: Executive support is crucial. But trying to cater to the needs of three executives takes much-needed focus and resources away from your six-week timeframe. The team form, storm, and norm: Identify champions in business and IT to forge collaborative partnerships of mixed personas. Include only the most essential team members such as subject-matter experts, data scientists, architects, data engineers, data stewards, and DevOps. The data sources: Aim for as few data sources as possible. Go back and forth between what needs to be achieved and what s possible in six weeks till you have to manage only a few sources. You can always add new ones once you ve proved the value. The data: Focus only on the dimensions and measures that actually matter to your specific POC. If web analytics data from Adobe Analytics (previously SiteCatalyst) has 400 fields but you only need 10 of them, you can rationalize your efforts by focusing on those 10 only. Cluster size: Typically, your starting cluster should only be as big as 5-10 nodes. However, you should work with your Hadoop vendor to appropriately size your cluster for the POC. Hadoop s already proven its linear scalability so focus on solving core challenges related to your business use-case for your six-week time frame.

Building a Marketing Data Lake for Account- Based Marketing Explaining our marketing team s starting use case 2. Sample Use Cases The following use cases are from POCs that successfully proved value and limited their scope in a way that ensured they could be added to once the POC was complete. i. Business Use Cases Using general ledger data for real-time fraud processing Analyzing website visitor behavior with clickstream data Processing a subset of customer data for a customer 360 initiative ii. Technical Use Cases In some cases, IT teams have the resources and authority to invest in technical use cases for big data. This may be to identify technical challenges and possibilities around building an enterprise data lake or analytics cloud, for instance. Some of the most common ones are: Building a staging environment to offload data storage and ETL workloads from a data warehouse Extending your data warehouse with a Hadoop-based ODS (operational data store) Building a centralized data lake that stores all enterprise data required for big data analytic projects Part III. Proving Value 1. Defining the Right Metrics For your POC to succeed in six weeks, you need to translate expectations of success into clear metrics for success. Capturing Usage Metrics to Prove Value One big indicator of success for our marketing team was the number of people actually using the tools they created. By recording usage metrics, they could prove adoption and make sure they were on the right track. Based on your first use case, list out your metrics below: Quantitative measures: E.g. Marketing campaign uplift E.g. Impact on sales pipeline E.g. Increase lead conversion rates Qualitative measures: E.g. Agility E.g. Transparency E.g. Productivity and efficiency

Part IV. The Technical Components of a Successful Big Data POC 1. The Three Layers of a Big Data POC Big data projects rely on a foundation of three crucial layers, even though only two of the layers get most of the attention. Visualization and analysis This includes self-service analytics and visualization tools like Tableau and Qlik. Should Big Data POCs Be Done On-Premise or in the Cloud? Evaluating the pros and cons of both approaches and explaining when it makes sense to run a Hadoop cluster on your own hardware. Big data management Storage persistence layer This includes the technology needed to integrate, govern, and secure big data such as pre-built connectors and transformations, data quality, data lineage, and data masking. This includes persistence technologies and distributed frameworks like Hadoop. 2. The Three Pillars of Big Data Management i. Big Data Integration To deliver high-throughput ingestion and at-scale processing for analysts. It s crucial to: Speed up development and maintenance with pre-build transformations and design pattern templates Handle all types of data including those with dynamic schemas Optimize performance across platforms And make it easier to add new sources ii. Big Data Quality (and Governance) To manage data at-scale and ensure different users get the data quality they need. It s crucial to: Detect anomalies and make data quality rules scalable Manage technical and operational metadata Match and link entities from disparate data sets Provide end-to-end lineage (for auditability and compliance)

Documentation Even if your POC is successful, you ll need to be able to share the details of it to both expand the scope of your project and socialize your new capabilities. As such, it s essential that you document the following: The goals of your POC The success of your POC as defined by clear metrics for success How data flows between different systems Your reference architecture and explanations of design choices Development support iii. Big data security Data security is usually a checkbox during POCs. But the second your project succeeds, security becomes a whole other ball game. Foundational security is necessary to: Monitor sensitive data wherever it lives and wherever its used Carry out risk assessments and alert stakeholders of potential security risks Mask data in different environments to de-identify sensitive data 3. A Checklist for Big Data POCs i. Visualization and Analytics 1. List the users (and their job roles) who will be working with this layer 2. If you re aiming for self-service discovery, which tools will your users need (e.g. Selfservice data prep) 3. If you re aiming for reporting or predictive analytics, which tools will you be deploying? 4. Which format will you be delivering data in? (Review sample data in the required format) 5. ii. Big Data Integration 1. Will you be ingesting real-time streaming data? 2. Which source systems are you ingesting from? 3. Outline the process for accessing data from those systems (who do you need to speak to? Will it be a flat file dump?) 4. How many integration patterns will you need for your use case? (Try to get this number down to 2-3) 5. How many transformations will your use case call for? 6. What type of files are you ingesting? (Review sample data in the required format) 7. How many dimensions and measures are necessary for your six week timeframe? 8. How much data is needed for your use case? 9. iii. Big Data Quality and Governance 1. Will your end-users need business-intelligence levels of data quality? 2. Or will they be able to work with only a few modifications to the raw data and then prepare the data themselves? 3. Will you need to record the provenance of the data for either documentation, auditing, or security needs? 4. What system can you use for metadata management? 5. Can you track and record transformations as workflows? 6. Will your first use case require data stewards? If so, list out their names and job titles. 7.

About Informatica Informatica is a leading independent software provider focused on delivering transformative innovation for the future of all things data. Organizations around the world rely on Informatica to realize their information potential and drive top business imperatives. More than 5,800 enterprises depend on Informatica to fully leverage their information assets residing onpremise, in the Cloud and on the internet, including social networks. iv. Big Data Security 1. Will you need different data masking rules for different environments (e.g. development and production)? 2. Which tools will you use for encryption? 3. Have you set up profiles for Kerberos authentication? 4. Have you set your users up to only access data within your network? 5. Do you need application security policies to work within your POC environment? 6. v. Big Data Storage Persistence Layer 1. Will you be using Hadoop for pre-processing before moving data to your data warehouse? 2. Or will you be using Hadoop to both store and analyse large volumes of unstructured data? 3. 4. Reference Architectures for Big Data Management Conclusion: Making the Most of Your POC Big data is a big opportunity for enterprises. Poorly planned POCs aren t just failures in and of themselves. They re also a failure to prove the value of big data. Use the advice and lessons shared in this workbook to hit the ground running and complete a successful POC in six weeks. A successful POC is just the start to a constant, iterative and valuable journey towards innovation and insight. Leave Feedback Worldwide Headquarters, 2100 Seaport Blvd, Redwood City, CA 94063, USA Phone: 650.385.5000 Fax: 650.385.5500 Toll-free in the US: 1.800.653.3871 informatica.com linkedin.com/company/informatica twitter.com/informatica 2015 Informatica LLC. All rights reserved. Informatica and Put potential to work are trademarks or registered trademarks of Informatica in the United States and in jurisdictions throughout the world. All other company and product names may be trade names or trademarks. IN08-1115-03021