Data Discovery, Analytics, and the Enterprise Data Hub



Similar documents
Operational Analytics

An Enterprise Data Hub, the Next Gen Operational Data Store

Cloudera Enterprise Data Hub in Telecom:

INDUSTRY BRIEF DATA CONSOLIDATION AND MULTI-TENANCY IN FINANCIAL SERVICES

Deploying an Operational Data Store Designed for Big Data

Accelerate your Big Data Strategy. Execute faster with Capgemini and Cloudera s Enterprise Data Hub Accelerator

Empowering the Masses with Analytics

locuz.com Big Data Services

UNLEASHING THE VALUE OF THE TERADATA UNIFIED DATA ARCHITECTURE WITH ALTERYX

ENHANCING INTELLIGENCE SUCCESS: DATA CHARACTERIZATION Francine Forney, Senior Management Consultant, Fuel Consulting, LLC May 2013

Data Doesn t Communicate Itself Using Visualization to Tell Better Stories

Driving Growth in Insurance With a Big Data Architecture

Data Catalogs for Hadoop Achieving Shared Knowledge and Re-usable Data Prep. Neil Raden Hired Brains Research, LLC

Unleash your intuition

Analance Data Integration Technical Whitepaper

Ignite Your Creative Ideas with Fast and Engaging Data Discovery

Delivering Smart Answers!

ENZO UNIFIED SOLVES THE CHALLENGES OF REAL-TIME DATA INTEGRATION

Bringing Strategy to Life Using an Intelligent Data Platform to Become Data Ready. Informatica Government Summit April 23, 2015

WHITEPAPER. A Data Analytics Plan: Do you have one? Five factors to consider on your analytics journey.

Safe Harbor Statement

ETPL Extract, Transform, Predict and Load

Analance Data Integration Technical Whitepaper

Oracle Big Data Discovery The Visual Face of Hadoop

Enterprise Resource Planning Analysis of Business Intelligence & Emergence of Mining Objects

Delivering Business-Critical Solutions with SharePoint 2010

Why Big Data Analytics?

The Business Analyst s Guide to Hadoop

Gain Contextual Awareness for a Smarter Digital Enterprise with SAP HANA Vora

Datenverwaltung im Wandel - Building an Enterprise Data Hub with

More Data in Less Time

Qlik Sense Enterprise

The Enterprise Data Hub and The Modern Information Architecture

Hadoop Data Hubs and BI. Supporting the migration from siloed reporting and BI to centralized services with Hadoop

Innovate and Grow: SAP and Teradata

Cloudera in the Public Cloud

Scalable Enterprise Data Integration Your business agility depends on how fast you can access your complex data

Agile Business Intelligence Data Lake Architecture

BIG DATA & ANALYTICS. Transforming the business and driving revenue through big data and analytics

Three Open Blueprints For Big Data Success

Navigating Big Data business analytics

Oracle Big Data Discovery Unlock Potential in Big Data Reservoir

Improve Your Energy Data Infrastructure:

The Top Challenges in Big Data and Analytics

From Lab to Factory: The Big Data Management Workbook

Database Marketing, Business Intelligence and Knowledge Discovery

Analytics With Hadoop. SAS and Cloudera Starter Services: Visual Analytics and Visual Statistics

TIBCO Spotfire Guided Analytics. Transferring Best Practice Analytics from Experts to Everyone

White Paper: Enhancing Functionality and Security of Enterprise Data Holdings

White Paper: Datameer s User-Focused Big Data Solutions

Databricks. A Primer

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

INDUSTRY BRIEF THREE FACTORS ENTRENCHING BIG DATA IN FINANCIAL SERVICES

BIM. the way we see it. Mastering Big Data. Why taking control of the little things matters when looking at the big picture

Digital Business Platform for SAP

Understanding Your Customer Journey by Extending Adobe Analytics with Big Data

Using Tableau Software with Hortonworks Data Platform

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

IDW -- The Next Generation Data Warehouse. Larry Bramblett, Data Warehouse Solutions, LLC, San Ramon, CA

ENTERPRISE BI AND DATA DISCOVERY, FINALLY

CONNECTING DATA WITH BUSINESS

Customer Insight Appliance. Enabling retailers to understand and serve their customer

The IBM Cognos Platform for Enterprise Business Intelligence

TAMING THE BIG CHALLENGE OF BIG DATA MICROSOFT HADOOP

Automated Data Ingestion. Bernhard Disselhoff Enterprise Sales Engineer

Data Governance for Regulated Industries

Information-Driven Transformation in Retail with the Enterprise Data Hub Accelerator

International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: Vol. 1, Issue 6, October Big Data and Hadoop

High-Performance Business Analytics: SAS and IBM Netezza Data Warehouse Appliances

How CFOs and their teams are supercharging financial reporting

The Definitive Guide to Data Blending. White Paper

Apache Hadoop in the Enterprise. Dr. Amr Awadallah,

Visualization Starter Pack from SAP Overview Enabling Self-Service Data Exploration and Visualization

DATAOPT SOLUTIONS. What Is Big Data?

Integrating a Big Data Platform into Government:

ADVANTAGE YOU. Be more. Do more. With Infosys and Microsoft on your side!

White Paper: SAS and Apache Hadoop For Government. Inside: Unlocking Higher Value From Business Analytics to Further the Mission

Are You Big Data Ready?

Accelerate BI Initiatives With Self-Service Data Discovery And Integration

Enterprise Data Integration

Making big data simple with Databricks

Course DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Chapter 1 DECISION SUPPORT SYSTEMS AND BUSINESS INTELLIGENCE

Senior Business Intelligence/Engineering Analyst

How To Make Data Streaming A Real Time Intelligence

Implementation of Big Data and Analytics Projects with Big Data Discovery and BICS March 2015

Five Technology Trends for Improved Business Intelligence Performance

A Getronics Whitepaper NEW WORLD NEW BEHAVIOUR NEW SUPPORT

Socialprise: Leveraging Social Data in the Enterprise Rev 0109

UNDERSTAND YOUR CLIENTS BETTER WITH DATA How Data-Driven Decision Making Improves the Way Advisors Do Business

White Paper. Thirsting for Insight? Quench It With 5 Data Management for Analytics Best Practices.

Accenture and Oracle: Leading the IoT Revolution

MDM and Data Warehousing Complement Each Other

The Modern Data Warehouse: Agile, Automated, Adaptive

IBM Analytics. Just the facts: Four critical concepts for planning the logical data warehouse

Big Data Analytics: The Art of the Data Scientist

Microsoft Dynamics NAV

ORACLE PROJECT ANALYTICS

Making confident decisions with the full spectrum of analysis capabilities

Transcription:

Data Discovery, Analytics, and the Enterprise Data Hub Version: 101

Table of Contents Summary 3 Used Data and Limitations of Legacy Analytic Architecture 3 The Meaning of Data Discovery & Analytics 4 Machine Learning in Data Discovery and Analytics 5 Conclusion 5 About the Author 5 2

We think too small, like the frog at the bottom of the well. He thinks the sky is only as big as the top of the well. If he surfaced, he would have an entirely different view. -Mao Zedong Summary There are two kinds of reporting and analytical environments in organizations today. Until recently, most organizations provided structured, cleansed and integrated data, summarized at levels convenient for conventional platforms. Data Warehousing and Business Intelligence dominate in these architectures. Other organizations, notably those that are primarily internet-centric, developed alternative ways to manage and analyze very large amounts of data from their own websites, search engines, and social physics (the analysis of external data from social media), now generally referred to as big data. Only in the latter case can true data discovery and analytics be enabled, but the tools and techniques of big data are rapidly becoming the accepted architecture in organizations. Used Data and Limitations of Legacy Analytic Architecture Operational Systems typically support a constrained set of functions, even if that set is vast, such as an ERP system. Data is captured and stored in a logical way that fits the functions of the system, often in structures and semantics that are understandable only to those familiar with the system internals. Many systems provide only perfunctory reporting and, of course, do not provide integration of their data with other systems. In most cases, it is not feasible to access this data directly for reporting and analytical purposes because: Analytical queries tend to be large and can affect performance of the system There are security issues that are typically enforced through the application software and could be compromised by direct access to the data Performance is critical in operational systems; therefore physical design of databases favor performance over separation from the application logic making retrieval difficult Most analytical work involves working with data from more than one operational system. Abstraction techniques that provide a single view of multiple data structures such as federation/virtualization have proven (with the current technology) to perform poorly and are difficult to set up and maintain For those reasons, there is always a need to work with secondhand or used data for purposes that go beyond the operational system. For example, a system may keep track of inventory and contractual compliance, but linking this information with financial information to determine customer profitability is not possible. This was the reason that Decision Support Systems (DSS), data warehouses, and Business Intelligence emerged. They provide tools for knowledge workers to access information from various systems to support all of their needs and processes. But the gathering of all of the data never really happens as previous technologies are too costly and not agile enough to handle the scale and variety of data that is needed today. An enterprise data hub (EDH), provides not only a cost-effective container of big data, it supports a myriad of tools and applications to optimize your use and understanding of data. Those who deal with used data have a need to discover and analyze information by formulating queries to discover patterns or underlying relationships in the data. This process spans multiple systems and operations. These interrogations and discoveries can take many forms from simple data set discovery with search, to point-and-click queries, to machine learning and esoteric ensemble techniques. But the platforms and data stores they use must mask the scale and complexity of the data, allowing the knowledge workers to seamlessly pursue their thought process and not have their productivity dragged down by platforms, tools, and approaches. 3 Unfortunately, old habits die hard. When it comes to BI, the industry is largely constrained by a drag of technology. What passes as acceptable BI in organizations today is rarely much more than re-platforming reports and queries that are ten- to twenty-years old. For analytics and BI to truly pay off in organizations, IT needs to shift its focus from deciding the informational needs of the organization through technical architecture and discipline, to one of responding to those needs as quickly as they arise by creating an agile data environment.

The Meaning of Data Discovery & Analytics This is the meaning of Data Discovery & Analytics rather than pre-arranging data and structures to address known informational needs; data discovery and analytics involves the combination of massive repositories of all kinds of data with the tools and computing power enabling knowledge workers to find patterns, build models, and create new value from used data. Not just data from an organization s operational system, but all forms of external data as well. Big data opened up the possibility of managing Social Physics, the ability to capture and use data from social media and a host of other non-traditional data sources. What does the term Data Discovery mean in today s landscape of tools and services? It is an imprecise term, but the industry adopted it, despite its often various meanings. Even the word discovery is a little misleading. Discovering data is not the desired outcome it is just one step in the process. Discovering an insight that leads to value is the main point of Data Discovery. The term arose as an alternative to highly structured business intelligence. This approach provides the ability to explore and analyze data more or less free of the constraining models of data warehouses and other data sources. With a Data Hub, analysts can use tools that profile data sources in the EDH. These tools include machine learning applications to automate the search for interesting patterns and correlations that are not obvious with the volumes and variety of data now available. Beyond the initial efforts; analysts filter, transform, clean, enrich, and manipulate the data, all without pre-designed structures and queries (though there are many situations where that is necessary and appropriate). What can you expect to see in Data Discovery and Analytics mode in an EDH? In a collaborative environment, it is typical for analysts to create new data in the hub, such as: Predictions, time series, descriptions (metadata) and narratives of their investigations Derived and blended data from existing data sets never before seen including additional attributes adding richness to the data Predictive models and other codes for quantitative analysis. These iterative data sets were once ignored due to the scarcity of storage space and rigid nature of systems. Data Hub s built on Hadoop have solved this by enabling: Larger sample sizes to create a complete view Access archived/historic data because of linear scalability of Hadoop Access to full fidelity data so that adding a new dimensions doesn t take months A system with integrated search/ SQL/ machine learning capabilities instead of just SQL Ability to reduce data preparation time through parallel processing In addition to data itself, the data discovery process is enhanced by tools and insights in what is generally an iterative and ongoing process: Weather data Rules engines and decision models Recommendation engines, both developed and licensed Broad quantitative tools including statistics Streaming data capture and real-time analysis Graphing/Charting tools 4

Machine Learning in Data Discovery and Analytics There are two primary techniques for data discovery: manual development of queries and guided or unguided machine learning. In the latter case, data scientists can provide various parameters to a machine learning algorithm, but as long as there is a person seeding the algorithms there is the problem of unintentional bias. This issue is more pronounced when the specialists are more informed about the tools than about the domain they are examining. The preferred method for minimizing the risk of introducing bias is during the detection phase of machine learning. Data scientists can then analyze the output of the machine learning process for patterns, issues and anomalies that are still best observed by a person, not a machine. The Hadoop ecosystem enables critical and highly sophisticated analytic algorithms to be applied in the background. This allows users to find or predict issues by sifting through enormous amounts of heterogeneous data minimizing bias, elapsed time, and excessive false positives. The goal of unattended machine learning is to derive useful, accurate and timely results for a wide range of requirements and investigations without much manual intervention. Data scientists are a scarce commodity, and anything that can make them more productive can reduce the costs (and error) of data discovery by replacing expensive development efforts with packaged algorithms. The EDH provides a single source of data, relieving the data scientists from extracting and cataloging many data sources for each analytic model. It provides not only access to the data, but can employ metadata schemes to make identifying and using the data in the EDH far simpler and less error-prone. And finally, there is a growing and already robust set of analytical tools that work directly with the EDH, efficiently. Conclusion The adoption of analytics will move an organization s efforts from simply informing decisions to taking action and tracking the effectiveness of those actions, thereby closing the loop. A giant leap in analytics is possible with the implementation of a modern architecture for managing and analyzing a broad collection of data with a rapidly developing community of tools and methods. About the Author Neil Raden, based in Santa Fe, NM, is an industry analyst and active consultant, widely published author and speaker and the founder of Hired Brains Research LLC, http://www.hiredbrains.com. Hired Brains provides research, advisory and consulting services in Analytics, Big Data, and Decision Management for clients worldwide. Neil is also the co-author of the Dresner Advisory Services Wisdom of BI series on Advanced and Predictive Analytics. Neil was a contributing author to one of the first (1995) books on designing data warehouses and he is more recently the co-author of Smart (Enough) Systems: How to Deliver Competitive Advantage by Automating Hidden Decisions, Prentice-Hall. He is a contributor to publications such as Wall Street Week, Forbes, Information Week and ComputerWorld. He welcomes your comments at nraden@hiredbrains.com or his blog at http://hiredbrains.wordpress.com. 5

About Cloudera Cloudera is revolutionizing enterprise data management by offering the first unified Platform for big data, an enterprise data hub built on Apache Hadoop. Cloudera offers enterprises one place to store, access, process, secure, and analyze all their data, empowering them to extend the value of existing investments while enabling fundamental new ways to derive value from their data. Cloudera s open source big data platform is the most widely adopted in the world, and Cloudera is the most prolific contributor to the open source Hadoop ecosystem. As the leading educator of Hadoop professionals, Cloudera has trained over 22,000 individuals worldwide. Over 1,400 partners and a seasoned professional services team help deliver greater time to value. Finally, only Cloudera provides proactive and predictive support to run an enterprise data hub with confidence. Leading organizations in every industry plus top public sector organizations globally run Cloudera in production. For additional information, please visit us at: www.cloudera.com cloudera.com 1-888-789-1488 or 1-650-362-0488 Cloudera, Inc. 1001 Page Mill Road, Palo Alto, CA 94304, USA 2015 Cloudera, Inc. All rights reserved. Cloudera and the Cloudera logo are trademarks or registered trademarks of Cloudera Inc. in the USA and other countries. All other trademarks are the property of their respective companies. Information is subject to change without notice.