Data Discovery, Analytics, and the Enterprise Data Hub Version: 101
Table of Contents Summary 3 Used Data and Limitations of Legacy Analytic Architecture 3 The Meaning of Data Discovery & Analytics 4 Machine Learning in Data Discovery and Analytics 5 Conclusion 5 About the Author 5 2
We think too small, like the frog at the bottom of the well. He thinks the sky is only as big as the top of the well. If he surfaced, he would have an entirely different view. -Mao Zedong Summary There are two kinds of reporting and analytical environments in organizations today. Until recently, most organizations provided structured, cleansed and integrated data, summarized at levels convenient for conventional platforms. Data Warehousing and Business Intelligence dominate in these architectures. Other organizations, notably those that are primarily internet-centric, developed alternative ways to manage and analyze very large amounts of data from their own websites, search engines, and social physics (the analysis of external data from social media), now generally referred to as big data. Only in the latter case can true data discovery and analytics be enabled, but the tools and techniques of big data are rapidly becoming the accepted architecture in organizations. Used Data and Limitations of Legacy Analytic Architecture Operational Systems typically support a constrained set of functions, even if that set is vast, such as an ERP system. Data is captured and stored in a logical way that fits the functions of the system, often in structures and semantics that are understandable only to those familiar with the system internals. Many systems provide only perfunctory reporting and, of course, do not provide integration of their data with other systems. In most cases, it is not feasible to access this data directly for reporting and analytical purposes because: Analytical queries tend to be large and can affect performance of the system There are security issues that are typically enforced through the application software and could be compromised by direct access to the data Performance is critical in operational systems; therefore physical design of databases favor performance over separation from the application logic making retrieval difficult Most analytical work involves working with data from more than one operational system. Abstraction techniques that provide a single view of multiple data structures such as federation/virtualization have proven (with the current technology) to perform poorly and are difficult to set up and maintain For those reasons, there is always a need to work with secondhand or used data for purposes that go beyond the operational system. For example, a system may keep track of inventory and contractual compliance, but linking this information with financial information to determine customer profitability is not possible. This was the reason that Decision Support Systems (DSS), data warehouses, and Business Intelligence emerged. They provide tools for knowledge workers to access information from various systems to support all of their needs and processes. But the gathering of all of the data never really happens as previous technologies are too costly and not agile enough to handle the scale and variety of data that is needed today. An enterprise data hub (EDH), provides not only a cost-effective container of big data, it supports a myriad of tools and applications to optimize your use and understanding of data. Those who deal with used data have a need to discover and analyze information by formulating queries to discover patterns or underlying relationships in the data. This process spans multiple systems and operations. These interrogations and discoveries can take many forms from simple data set discovery with search, to point-and-click queries, to machine learning and esoteric ensemble techniques. But the platforms and data stores they use must mask the scale and complexity of the data, allowing the knowledge workers to seamlessly pursue their thought process and not have their productivity dragged down by platforms, tools, and approaches. 3 Unfortunately, old habits die hard. When it comes to BI, the industry is largely constrained by a drag of technology. What passes as acceptable BI in organizations today is rarely much more than re-platforming reports and queries that are ten- to twenty-years old. For analytics and BI to truly pay off in organizations, IT needs to shift its focus from deciding the informational needs of the organization through technical architecture and discipline, to one of responding to those needs as quickly as they arise by creating an agile data environment.
The Meaning of Data Discovery & Analytics This is the meaning of Data Discovery & Analytics rather than pre-arranging data and structures to address known informational needs; data discovery and analytics involves the combination of massive repositories of all kinds of data with the tools and computing power enabling knowledge workers to find patterns, build models, and create new value from used data. Not just data from an organization s operational system, but all forms of external data as well. Big data opened up the possibility of managing Social Physics, the ability to capture and use data from social media and a host of other non-traditional data sources. What does the term Data Discovery mean in today s landscape of tools and services? It is an imprecise term, but the industry adopted it, despite its often various meanings. Even the word discovery is a little misleading. Discovering data is not the desired outcome it is just one step in the process. Discovering an insight that leads to value is the main point of Data Discovery. The term arose as an alternative to highly structured business intelligence. This approach provides the ability to explore and analyze data more or less free of the constraining models of data warehouses and other data sources. With a Data Hub, analysts can use tools that profile data sources in the EDH. These tools include machine learning applications to automate the search for interesting patterns and correlations that are not obvious with the volumes and variety of data now available. Beyond the initial efforts; analysts filter, transform, clean, enrich, and manipulate the data, all without pre-designed structures and queries (though there are many situations where that is necessary and appropriate). What can you expect to see in Data Discovery and Analytics mode in an EDH? In a collaborative environment, it is typical for analysts to create new data in the hub, such as: Predictions, time series, descriptions (metadata) and narratives of their investigations Derived and blended data from existing data sets never before seen including additional attributes adding richness to the data Predictive models and other codes for quantitative analysis. These iterative data sets were once ignored due to the scarcity of storage space and rigid nature of systems. Data Hub s built on Hadoop have solved this by enabling: Larger sample sizes to create a complete view Access archived/historic data because of linear scalability of Hadoop Access to full fidelity data so that adding a new dimensions doesn t take months A system with integrated search/ SQL/ machine learning capabilities instead of just SQL Ability to reduce data preparation time through parallel processing In addition to data itself, the data discovery process is enhanced by tools and insights in what is generally an iterative and ongoing process: Weather data Rules engines and decision models Recommendation engines, both developed and licensed Broad quantitative tools including statistics Streaming data capture and real-time analysis Graphing/Charting tools 4
Machine Learning in Data Discovery and Analytics There are two primary techniques for data discovery: manual development of queries and guided or unguided machine learning. In the latter case, data scientists can provide various parameters to a machine learning algorithm, but as long as there is a person seeding the algorithms there is the problem of unintentional bias. This issue is more pronounced when the specialists are more informed about the tools than about the domain they are examining. The preferred method for minimizing the risk of introducing bias is during the detection phase of machine learning. Data scientists can then analyze the output of the machine learning process for patterns, issues and anomalies that are still best observed by a person, not a machine. The Hadoop ecosystem enables critical and highly sophisticated analytic algorithms to be applied in the background. This allows users to find or predict issues by sifting through enormous amounts of heterogeneous data minimizing bias, elapsed time, and excessive false positives. The goal of unattended machine learning is to derive useful, accurate and timely results for a wide range of requirements and investigations without much manual intervention. Data scientists are a scarce commodity, and anything that can make them more productive can reduce the costs (and error) of data discovery by replacing expensive development efforts with packaged algorithms. The EDH provides a single source of data, relieving the data scientists from extracting and cataloging many data sources for each analytic model. It provides not only access to the data, but can employ metadata schemes to make identifying and using the data in the EDH far simpler and less error-prone. And finally, there is a growing and already robust set of analytical tools that work directly with the EDH, efficiently. Conclusion The adoption of analytics will move an organization s efforts from simply informing decisions to taking action and tracking the effectiveness of those actions, thereby closing the loop. A giant leap in analytics is possible with the implementation of a modern architecture for managing and analyzing a broad collection of data with a rapidly developing community of tools and methods. About the Author Neil Raden, based in Santa Fe, NM, is an industry analyst and active consultant, widely published author and speaker and the founder of Hired Brains Research LLC, http://www.hiredbrains.com. Hired Brains provides research, advisory and consulting services in Analytics, Big Data, and Decision Management for clients worldwide. Neil is also the co-author of the Dresner Advisory Services Wisdom of BI series on Advanced and Predictive Analytics. Neil was a contributing author to one of the first (1995) books on designing data warehouses and he is more recently the co-author of Smart (Enough) Systems: How to Deliver Competitive Advantage by Automating Hidden Decisions, Prentice-Hall. He is a contributor to publications such as Wall Street Week, Forbes, Information Week and ComputerWorld. He welcomes your comments at nraden@hiredbrains.com or his blog at http://hiredbrains.wordpress.com. 5
About Cloudera Cloudera is revolutionizing enterprise data management by offering the first unified Platform for big data, an enterprise data hub built on Apache Hadoop. Cloudera offers enterprises one place to store, access, process, secure, and analyze all their data, empowering them to extend the value of existing investments while enabling fundamental new ways to derive value from their data. Cloudera s open source big data platform is the most widely adopted in the world, and Cloudera is the most prolific contributor to the open source Hadoop ecosystem. As the leading educator of Hadoop professionals, Cloudera has trained over 22,000 individuals worldwide. Over 1,400 partners and a seasoned professional services team help deliver greater time to value. Finally, only Cloudera provides proactive and predictive support to run an enterprise data hub with confidence. Leading organizations in every industry plus top public sector organizations globally run Cloudera in production. For additional information, please visit us at: www.cloudera.com cloudera.com 1-888-789-1488 or 1-650-362-0488 Cloudera, Inc. 1001 Page Mill Road, Palo Alto, CA 94304, USA 2015 Cloudera, Inc. All rights reserved. Cloudera and the Cloudera logo are trademarks or registered trademarks of Cloudera Inc. in the USA and other countries. All other trademarks are the property of their respective companies. Information is subject to change without notice.