Data Catalogs for Hadoop Achieving Shared Knowledge and Re-usable Data Prep Neil Raden Hired Brains Research, LLC
Traditionally, the job of gathering and integrating data for analytics fell on data warehouses. Data warehouses cleaned and aggregated the operational exhaust of business transactions. They created precision, accuracy and consistency out of messy transactional structures, so that business people could look back and make sense of the past. As data warehousing evolved, so did our operational applications and with the advent of the Internet, a whole new challenge to precision, accuracy and consistency emerged. Why All the Work Before Analysis is Getting the Attention Sharing results and persisting new views of data and models is core to success. For business analysts and data scientists alike, their work now involves much more searching and discovery of data rather than navigating through steady structures. It is no longer a singular effort. Sharing results and persisting new views of data and models is core to success. And collaboratively governing data catalogs is the approach that works. It is important to make the distinction between tools that provide connectors that operate on a physical level (either generating SQL from structural metadata or using API s, Web Services, etc.), and those that have a rich understanding of the data based on its content, structure and use. Connectors are not sufficient. Someone, or some thing, has to understand the meaning of the data to provide a catalog for analysts to do their work. And collaboratively governing data catalogs is the approach that works. 2 alation.com
Data Scripts Re-Defined for More Agile Analysis Keep in mind that all of the data used by data scientists, analysts and other applications is essentially used. In other words, it was originally created for purposes other than analysis: these purposes include supporting operations, capturing events, and recording transactions. The data at the source (even when it is clear, consistent and error-free, which is rare in an integrated context) will still contain semantic errors, missing entries, or inconsistent formatting for the secondary context of analytics. It must be handled before it can be used. This is especially true if the goal is integration with data from other sources. Connectors are not sufficient. Someone, or some thing, has to understand the meaning of the data. Data warehousing addressed this problem of dealing with used data long ago with processes that executed before data was stored. This is where data warehousing and Hadoop differ. ETL tools, methodologies and best practices ensured that any analyst working with data in a data warehouse, accessed a single source of truth that was already cleaned and aggregated to produce a predefined business metric, often labeled a key performance indicator. Though in fairness, applications from data warehouses are often quite creative. These solutions are also partially useful for data scientists, but because the data was pre-processed to fit a specific model or schema, the richness of the data is lost when it comes to hypothesis testing in a more exploratory manner. In addition, typically a data warehouse is too slow to implement when experiments and discoveries are happening in an unplanned fashion (Our competitor just released a press release about a new pricing model. How should we respond? ) The innovation of Hadoop introduced a better storage mechanism and best practices for hypothesis testing on rich, raw data at low cost. Hadoop evolved out of a 3 alation.com
system designed to capture digital exhaust. Initially the byproduct of online activities, today Hadoop also includes a wealth of machine-generated data events gathered from sensors in real-time. While data warehouses typically stored aggregated information based on application transactions, digital exhaust can be found in a multitude of forms like XML and JSON that are typically referred to as unstructured, though more accurately defined as not highly structured. There is a challenge to working with data stored in its most raw form: any manipulation of that data done for hypothesis testing results in a new, unique data structure. While the data may be in one place physically (typically a Hadoop cluster), a collection of data silos is logically created whereby any new analysis creates a new data structure silo of its own. This provides tremendous agility to analysts who can now ask any question that they d like to of the data. But it has the downside of making it harder to find a single source of truth. In the big data world, most of this agility is accomplished by hand coding, writing scripts or by manually editing data in a spreadsheet, a tedious and time-consuming effort, made worse by slow network connections and under-powered platforms. The only upside to data preparation, if it doesn t consume so much valuable time, is that the process itself often yields new insights about the meaning of the data, what assumptions are safe to make about the data, what idiosyncrasies exist in the collection process, and what models and analyses are appropriate to apply. While the data may be in one place physically, a collection of data silos is still logically created. But that is the work of analysts and data scientists. As an enterprise platform supporting many needs and uses, there are data preparation tools. The Hadoop argument was Why do it in expensive cycles on your RDBMS data warehouse when you can do it in Hadoop? The reason is that writing code is not quite the same as an intelligent tool that provides data preparation assistance with 4 alation.com
versioning, collaboration, reuse, metadata and lots of existing transforms built in. It s also a little contradictory. If Hadoop is for completely flexible and novel analysis, who is going to write transformation code for every project? This approach involves mechanical, time consuming data preparation and filtering that is often one-off, consumes a large percentage of the data scientist s' time and provides no reference to the content or meaning of the data. Re-thinking the Model for Organizational Scale Is there a solution? Let s use a metaphor to describe it. Let the gorillas write the script. In the 1988 film, Gorillas in the Mist, a biopic about the life and work of the late Dian Fossey with the highland gorillas in Rwanda, there was a crisis in the filming that required an emergency meeting back at the studio. The full cast and crew were filming on 12,000-foot mountain in Rwanda and had burned through half of the budget with little progress because, as the director said, The gorillas can t learn their lines. The suggestion was made to return a skeleton crew to just film the gorillas, bring the film back and write a new script around. The film was huge success. From this was born the expression, Let the gorillas write the script. We hear a lot today about democratizing analytics or pervasive analytics, but as the producers of Gorillas in the Mist learned almost thirty years ago, there is great utility in letting things roll and learning as you go, especially if you have tools to capture those interactions, and to provide all of the features that a big data catalog 5 alation.com
platform needs to be effective with both operational and digital exhaust alike. The relevant questions today are not, Is this a single version of the truth? or Who has access to what parts of the schema? Instead, it s the (excuse the big word) phenomenology of how analysts actually work that matters: How do people find and use information in their work? How do they collaborate with others and how do they share insights? How do we make use of today s ample resources in hardware and algorithms to create information agents/ advisers tailored to people s (changing) needs? How do we stitch together data discovery (of what s there, not the market segment), of data and modeling and presentation without boiling the ocean for each new analysis? How do we make the computer understand me, anticipate, in a non-trivial way, what I do and what I need? Help me. and a hundred other things Moving the task of data integration and data extraction to more advanced knowledge integration and knowledge extraction which includes not only input from machine learning but also human collaboration Alation's applications for collaboration help capture knowledge from subject matter experts (knowledge captured by Alation automatically during the act of composing and executing the queries themselves) and help encourage documentation by making it easy for analysts to tag, annotate and share within their existing SQL-based workflows for analytics. What Alation does is to take the guesswork out of what the data means, what it s related to and how it can be dynamically linked together without endless data modeling and remodeling. Machine-learning algorithms, on the other hand, are invaluable for discovering patterns and relationships. The behavior of human analysts, captured in logs of queries and BI tools, provides crucial guidance. 6 alation.com
Both Human Collaboration and Machine Learning Are the Solution The behavior of human analysts, captured in logs of queries and the reports and analyses of BI tools, provides crucial guidance to the work of machine-learning algorithms. Machine-learning algorithms, on the other hand, are invaluable for discovering patterns and relationships that the business analyst may never perceive. But letting the gorillas (e.g. the analysts) write the script instead of waiting for incremental mapping of datasets, ensures that critical context based on use and implied semantic meaning is captured next to the data. Data unification approaches that tie various schema together by column names or column content, are useful to a point, but the approach lacks the critical weighting of the data implied by its previous and ongoing use. A great deal of our problems in both the past and the current approaches to analysis are in understanding data. This was already the case when we had orders of magnitude less of it. And it is an issue only emphasized with today s greater volumes of data. The semantics of data, how it is captured, how it is modeled, what is the gap between the real phenomena we think we re seeing and what an application actually captures and encodes are all core to understanding data. Data mining, predictive models, and machine learning are just that, models of imperfect used data. The process of understanding data and preparing it for analysis demands the input from both machine models and people, assisted by software like Alation that brings both together. About the Author Neil Raden is an author, consultant and industry analyst, a featured figure internationally and the founder and Principal Analyst at Hired Brains Research, firm specializing in the application of data management and analytics. His focus is enabling better informed decision-making. nraden@hiredbrains.com hiredbrains.wordpress.com About Alation Alation is the first data catalog built for collaboration. With Alation, analysts are empowered to search, query and collaborate to achieve faster, more accurate insights. Contact us info@alation.com (650) 799-4440 alation.com