From Lab to Factory: The Big Data Management Workbook



Similar documents
How to Run a Successful Big Data POC in 6 Weeks

Informatica Data Quality Product Family

White Paper. Successful Legacy Systems Modernization for the Insurance Industry

Informatica Data Quality Product Family

Ganzheitliches Datenmanagement

Informatica and the Vibe Virtual Data Machine

Data Governance Baseline Deployment

Bringing Strategy to Life Using an Intelligent Data Platform to Become Data Ready. Informatica Government Summit April 23, 2015

The Requirements for Universal Master Data Management (MDM) White Paper

A Road Map to Successful Customer Centricity in Financial Services. White Paper

Informatica for Tableau Best Practices to Derive Maximum Value

Data Governance Implementation

Top Five Reasons Not to Master Your Data in SAP ERP. White Paper

Data Governance Implementation

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

Transform HR into a Best-Run Business Best People and Talent: Gain a Trusted Partner in the Business Transformation Services Group

Informatica PowerCenter Data Virtualization Edition

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ

Informatica PowerCenter The Foundation of Enterprise Data Integration

Databricks. A Primer

Driving the Business Forward with Human Capital Management. Five key points to consider before you invest

Informatica PowerCenter Real Time

Boosting Customer Loyalty and Bottom Line Results

Informatica Master Data Management

Test Data Management for Security and Compliance

Informatica Best Practice Guide for Salesforce Wave Integration: Building a Global View of Top Customers

VIEWPOINT. High Performance Analytics. Industry Context and Trends

Databricks. A Primer

Data Masking. Cost-Effectively Protect Data Privacy in Production and Nonproduction Systems. brochure

WHITE PAPER SPLUNK SOFTWARE AS A SIEM

Data Governance in the Hadoop Data Lake. Michael Lang May 2015

5 Ways Informatica Cloud Data Integration Extends PowerCenter and Enables Hybrid IT. White Paper

Are You Big Data Ready?

IBM Software Integrating and governing big data

Digital Business Platform for SAP

Datalogix. Using IBM Netezza data warehouse appliances to drive online sales with offline data. Overview. IBM Software Information Management

VMworld 2015 Track Names and Descriptions

Deploying an Operational Data Store Designed for Big Data

Ten Mistakes to Avoid

The Informatica Solution for Improper Payments

JOURNAL OF OBJECT TECHNOLOGY

Accenture and Oracle: Leading the IoT Revolution

IBM System x reference architecture solutions for big data

What Your CEO Should Know About Master Data Management

Informatica Master Data Management

Preparing for the Big Data Journey

A Comprehensive Solution for API Management

Big Data Analytics for Every Organization

Evolution of Information Management Architecture and Development

Informatica Application Information Lifecycle Management

The Informatica Platform for Data Driven Healthcare

Data Discovery, Analytics, and the Enterprise Data Hub

IBM Analytics Make sense of your data

Extend your analytic capabilities with SAP Predictive Analysis

Roadmap Talend : découvrez les futures fonctionnalités de Talend

How to avoid building a data swamp

Enable Business Agility and Speed Empower your business with proven multidomain master data management (MDM)

Informatica Solutions for Healthcare Providers. Unlock the Potential of Data Driven Healthcare

Klarna Tech Talk: Mind the Data! Jeff Pollock InfoSphere Information Integration & Governance

How To Improve Your Communication With An Informatica Ultra Messaging Streaming Edition

The Informatica Platform for Data Driven Healthcare

Buying vs. Building Business Analytics. A decision resource for technology and product teams

Informatica Version 10 Features and Advancements

Delivering Customer Value Faster With Big Data Analytics

Welcome to today s training on how to Effectively Sell SAP ERP! In this training, you will learn how SAP ERP addresses market trends and

How To Create A Healthcare Data Management For Providers Solution From An Informatica Data Management Solution

Healthcare Data Management

Building Data-Driven Internet of Things (IoT) Applications

Datameer Cloud. End-to-End Big Data Analytics in the Cloud

Choosing the Right Master Data Management Solution for Your Organization

3 Reasons Enterprises Struggle with Storm & Spark Streaming and Adopt DataTorrent RTS

Three Approaches to Managing Database Growth

Building Your Big Data Team

PLATFORM-AS-A-SERVICE, DEVOPS, AND APPLICATION INTEGRATION. An introduction to delivering applications faster

A Practical Guide to Legacy Application Retirement

Achieving Business Agility Through An Agile Data Center

The Role of Data & Analytics in Your Digital Transformation. Michael Corcoran Sr. Vice President & CMO

WHITE PAPER. The Five Fundamentals of a Successful FCR Program

Architecting an Industrial Sensor Data Platform for Big Data Analytics

Oracle Big Data SQL Technical Update

Optimized for the Industrial Internet: GE s Industrial Data Lake Platform

Management Excellence Framework: Record to Report

MSP Relevance. MSP Relevance. the Era of Cloud Computing. the Era of Cloud Computing. Brought to You By: A Channel Company White White Paper Paper

White Paper. Thirsting for Insight? Quench It With 5 Data Management for Analytics Best Practices.

ERP. Key Initiative Overview

Empowering the Masses with Analytics

Now Leverage Big Data for Successful Customer Engagements

Cisco Data Preparation

Mastering Big Data. Steve Hoskin, VP and Chief Architect INFORMATICA MDM. October 2015

Teradata s Big Data Technology Strategy & Roadmap

Analance Data Integration Technical Whitepaper

Data Management Emerging Trends. Sourabh Mukherjee Data Management Practice Head, India Accenture

Infor10 Corporate Performance Management (PM10)

Transcription:

Executive Summary From Lab to Factory: The Big Data Management Workbook How to Operationalize Big Data Experiments in a Repeatable Way and Avoid Failures Executive Summary Businesses looking to uncover new opportunities for growth and efficiency are quickly realizing two things. First, they need to define a big data strategy that delivers real business impact. Second, after years of hype and uncertainty, there is a clear path to defining such a strategy. Over the past few years, businesses have been creating big data laboratory and big data factory environments. But it s only now, with the maturation of technologies and establishment of best practices, that enterprises can successfully deploy both. Central to this strategy is a solid foundation of big data management. We re compiling actionable best practices, lessons learned, and advice into a practical workbook to help businesses like yours to develop this dual-strategy. As we develop the workbook, we d like to invite you to review the outline below and tell us if we re addressing the issues that are important to you as you operationalize big data experiments. We want the workbook to be as useful for you as it is for others. Please click on the link below to leave your comments and to pre-register for the completed workbook. We thank you for your comments! Leave Feedback

Workbook Outline Introduction: From Experimentation to Monetization Beyond the hype, it s now clear big data s an important route to growth and invention for large companies. But big data success depends on your ability to get at the data, interrogate and productize data assets. That means enterprise architectures must serve two purposes. The first is to make data available for analysis in a lab environment. The second is to make data production-ready for more specific, strategic projects and products. The good news is some early movers have already started building out these dual capabilities. But the key is that by relying on common architectural components for both environments, they ve been able to streamline and consolidate their investments in technology. Even better, they ve been able to move their experiments from lab to factory environments in a repeatable, efficient and intelligent way. This workbook aims to share their lessons and insights. Part I. Riding the Elephant in the Room 1. Confronting Hadoop s Limitations There s a lot of investment and interest in Hadoop. That s great. Because Hadoop is a big opportunity to simultaneously reduce data storage & processing costs and allow for scale. But it isn t a silver bullet. In fact, it has some non-trivial limitations: There is a significant shortage of Hadoop (and related big data tech like Pig, Hive, etc.) skills. This makes big data projects risky and expensive. Hadoop ecosystem lacks crucial data management capabilities such as data governance, data quality (DQ), Master Data Management (MDM), metadata management, and data security. Security is still limited as Kerberos encryption and access controls aren t enough for production-grade environments that need to push and pull data from multiple geographies. The ecosystem is still rapidly evolving and new releases, new technologies, etc. emerge all the time. This isn t always good news as a direction can quickly turn into a dead end: many who bet on MapReduce are struggling to leverage Spark. At the point of operationalization, staffing, maintaining and managing your environments is a whole other ball game. New realm of costs, and new scale makes early hand-coding expensive and challenging to augment. 2. Why Big Data Management Matters Considering the non-trivial limitations of Hadoop, data management has a crucial role to play in the success of big data initiatives. It helps you leverage your existing DM and developer talent in big data environments because everyone can still use the DM tools and interfaces they re used to. Additionally, self-service data prep tools ensure your analysts and scientists don t have to learn new skills like Java.

Data integration, governance and quality, security, and master data are actually even more important when dealing with large volumes of data. In a factory environment, encryption and access control isn t enough security. Depending on requirements, data profiling, data masking and even protection for data-in-motion is necessary. Smart metadata management and use of repositories of rules and logic is needed to ensure teams can apply flexible standards regardless of the runtime environment. By automating data management you can save time and money at the point of operationalization. Time, because developers don t have to go through twenty pages of code every time they need to add or change a transformation. Money, because you don t have to rely on manual labour for constant data management and auditing. Part II. The Lab, the Factory and the Strategy (Here, we ll look at the requirements for a lab environment, a factory environment and the data management requirements to enable both in a repeatable way) 1. Building a Big Data Lab Innovation starts with experimentation. So before enterprises can really turn big data into operational assets as data products or even as analytical environments to streamline and inform operations they need to understand what s possible and what isn t. The good news is that the starting costs of a big data lab are relatively low. And once your analysts can roll their sleeves up and start turning insights into viable actions, a big data lab comfortably pays for itself by fueling increased productivity, new opportunities and smarter products. A big data lab environment has some specific requirements: People: The focus is on leveraging the intelligence of analysts and the domain knowledge of data stewards. Central IT needs to get out of the way. Process: Here central IT needs to be able to provision good enough data for analysts to interrogate and experiment on. That means ensuring mass ingestion of multiple data sources for data discovery and self-service data-prep. Technology: Analysts rely on search and spreadsheets more than any other tool. So any self-service tools made available to streamline analysis should rely on semantic search, a spreadsheet-based interface equipped with automated profiling and discovery capabilities. 2. Building a Big Data Factory Operationalizing (and potentially monetizing) data assets and products has a different set of requirements. People: Here central IT and developers take over. Analysts and stewards are still needed to oversee production and make sure requirements for specific projects are met. Process: Central IT manage DevOps, production support, and the necessary capital investment to operationalize projects, ensuring data is ingested, cleaned and secured in a reliable way. Technology: In addition to scalable and flexible data pipelines that turn raw source data into actionable information, security and governance have to be ramped up.

3. Why it Makes Sense to Build a Common Infrastructure Consolidated investment Shared metadata Common set of tools make it easier to add new staff 4. The Three Pillars of Big Data Management Here s what a common infrastructure needs in order to serve both lab and factory environments in the context of big data. i. Big Data Integration Universal connectivity: High throughput DI for multiple schemas, as well as low-latency for real-time streams How One Multinational Stages Its Data [Description of one of our client s four data layers in Hadoop: Raw, published, core and projects.] Pre-built tools: Connectors, transformations and parsers to avoid reinventing the wheel every time analysts need to access new sources Abstraction: So that the rules, logic and metadata is abstracted away from the execution platform. The key to flexible and maintainable production deployments is leveraging all available infrastructure and hardware, regardless of runtime environment. A brokerage model: Leveraging a hub-based orchestration of data flows to reduce redundant efforts, standardize and govern Staging: So that depending on the requirements of different users, only the necessary amount of modifications are made to the data. ii. Big Data Quality (and Governance) Automated data quality: In order to manage the data at scale and maintain consistency, reliability of data being fed to analysts. Give them bad data, they ll give you bad insight. Business context: So that stewards can provision standard business terms and definitions through glossaries, etc. Easy discovery of exceptions: Normalcy scorecards and exceptions records management so that stewards don t have to eyeball huge volumes of data across multiple sources. Data lineage: To provide both auditability and transparency into and the provenance of data and who s used it. Relationship management: So that the stack can infer and detect relationships between entities and domains at scale. Four Dimensions of Big Data Security 1. Encryption 2. Data masking 3. Data-in-motion 4. Application security iii. Big Data Security Profiling and identification: a 360 view of sensitive data is crucial for a risk-centric approach that holistically protects your data assets even in the case of a perimeter security breach by de-identifying sensitive data Risk scorecards and analytics: To automate the detection of high risk scenarios and exceptions based on modelling and score trends. Universal protection: To provide masking, encryption and access control across data types (both live and stored data) and across environments (in production and nonproduction environments).

About Informatica Informatica is a leading independent software provider focused on delivering transformative innovation for the future of all things data. Organizations around the world rely on Informatica to realize their information potential and drive top business imperatives. More than 5,800 enterprises depend on Informatica to fully leverage their information assets residing onpremise, in the Cloud and on the internet, including social networks. Centralized policy-based security: To deliver necessary security and compliance at scale (e.g. Some privacy laws mandate location and role based data controls) iv. A Reference Architecture for Big Data Management Conclusion: Interrogate, Invent, Invest Big data experimentation is essential to corporate innovation. But a big data lab that doesn t rapidly implement innovative solutions by streamlining them into a production-grade, factory environment is only half complete. By making smart architectural and infrastructural decisions, you can de-risk experimentation and streamline production all at once. The key is leveraging the big data management capabilities necessary to maximize investments and interest in new storage technologies. Leave Feedback Worldwide Headquarters, 2100 Seaport Blvd, Redwood City, CA 94063, USA Phone: 650.385.5000 Fax: 650.385.5500 Toll-free in the US: 1.800.653.3871 informatica.com linkedin.com/company/informatica twitter.com/informatica 2015 Informatica LLC. All rights reserved. Informatica and Put potential to work are trademarks or registered trademarks of Informatica in the United States and in jurisdictions throughout the world. All other company and product names may be trade names or trademarks. IN08-1115-03020