Infosys Labs Briefings VOL 11 NO 1 2013 Metadata Management in Big Data By Gautham Vemuganti Big data analytics must reckon the importance and criticality of metadata Big data, true to its name, deals with large volumes of data characterized by volume, variety and velocity. Any enterprise that is in the process of or considering a Big data applications deployment has to address the metadata management problem. Traditionally, much of the data that business users use is structured. This however is changing with the exponential growth of data or Big data. Metadata defining this data, however, is spread across the enterprise in spreadsheets, databases, applications and even in people s minds (the so-called tribal knowledge ). Most enterprises do not have a formal metadata management process in place because of the misconception that it is an Information Technology (IT) imperative and it does not have an impact on the business. However, the converse is true. It has been proven that a robust metadata management process is not only necessary but required for successful information management. Big data introduces large volumes of unstructured data for analysis. This data could be in the form of a text file or any multimedia file (for e.g., audio, video). To bring this data into the fold of an information management solution, its metadata should be correctly defined. Metadata management solutions provided by various vendors usually have a narrow focus.an ETL vendor will capture metadata for the ETL process.a BI vendor will provide metadata management capabilities for their BI solution. The silo-ed nature of metadata does not provide business users an opportunity to have a say and actively engage in metadata management. A good metadata management solution must provide visibility across multiple solutions and bring business users into the fold for a collaborative, active metadata management process. METADATA MANAGEMENT CHALLENGES Metadata, simply defined, is data about data. In the context of analytics some common examples of metadata are report definitions, table definitions, meaning of a particular master data entity (sold-to customer, for example), ETL mappings and formulas and computations. The importance of metadata cannot be overstated. Metadata drives the accuracy of reports, validates data transformations, ensures 3
Single monolithic governance process Multiple governance process Figure 1: Data Governance Shift with Big Data Analytics Source: Infosys Research accuracy of calculations and enforces consistent definition of business terms across multiple business users. In a typical large enterprise which has grown by mergers, acquisitions and divestitures, metadata is scattered across the enterprise in various forms as noted in the introduction. In large enterprises, there is wide acknowledgement that metadata management is critical but most of the time there is no enterprise level sponsorship of a metadata management initiative.even if there is, it is only focused either for one specific project sponsored by one specific business. The impact of good metadata management practices are not consistently understood across the various levels of the enterprise. Conversely, the impact of poorly managed metadata comes to light only after the fact i.e., a certain transformation happens, a report or a calculation is run or two divisional data sources are merged. Metadata is typically viewed as the exclusive responsibility of the IT organization with business having little or no input or say in its management. The primary reason is that there are multiple layers of organization between IT and business. This introduces communication barriers between IT and business. Finally, metadata is not viewed as a very exciting area of opportunity.it is only addressed as an after-thought. DIFFERENCES BETWEEN TRADITIONAL AND BIG DATA ANALYTICS In traditional analytics, implementations data is typically stored in a data warehouse. The data warehouse is modeled using one of several techniques, developed over time and is a constantly evolving entity. Analytics 4
application developed using the data in a data warehouse are also long-lived. Data governance in traditional analytics is a centralized process. Metadata is managed as part of the data governance process. In traditional analytics, data is discovered, collected, governed, stored and distributed. Big data introduces large volumes of unstructured data.this data changes is highly dynamic and therefore needs to be ingested quickly for analysis. Big data analytics applications, however, are characterized by short-lived, quick implementations focused on solving a specific business problem.the emphasis of Big data analytics applications is more on experimentation and speed as opposed to long drawn out modeling exercise. The need to experiment and derive insights quickly using data changes the way data is governed. In traditional analytics there is usually one central governance team focused on governing the way data is used and distributed in the enterprise.in Big data analytics, there are multiple governance processes in play simultaneously, each geared towards answering a specific business question. Figure 1 illustrates this. Most of the metadata management challenges we referred to in the previous section alluded to typical enterprise data that is highly structured. To analyze unstructured data, additional metadata definitions are necessary. To illustrate the need to enhance metadata to support Big data analytics, consider sentiment analysis using social media conversations as an example. Say someone posts a message on Facebook I do not like my cell-phone reception. My wireless carrier promised wide cell coverage but it is spotty at best.i think I will switch carriers. To infer the intent of this customer, the inference engine has to rely on metadata as well as the supporting domain ontology. The metadata will define Wireless Carrier, Customer, Sentiment and Intent.The inference engine will leverage the ontology dependent on this metadata to infer that this customer wants to switch cell phone carriers. Big data is not just restricted to text.it could also contain images, videos, and voice files. Understanding, categorizing and creating metadata to analyze this kind of non-traditional content is critical. It is evident that Big data introduces additional challenges in metadata management.it is clear that there is a need for a robust metadata management process which will govern metadata with the same rigor as data for enterprises to be successful with Big data analytics. To summarize, a metadata management process specific to Big data should incorporate the context and intent of data, support nontraditional sources of data and be robust to handle the velocity of Big data. ILLUSTRATIVE EXAMPLE Consider an existing master data management system in a large enterprise.this master data system has been developed over time.this has specific master data entities like product, customer, vendor, employee etc.the master data system is tightly governed and data is processed (cleansed, enriched and augmented) before it is loaded into the master data repository. This specific enterprise is considering bringing in social media data for enhanced customer analytics.this social media data is to be sourced from multiple sources and incorporated into the master data management system. As noted earlier, social media conversations have context, intent and sentiment.the context refers to the situation 5
in which a customer was mentioned, the intent refers to the action that an individual is likely to take and the sentiment refers to the state of being of the individual. For example, if an individual sent a tweet or a starts a Facebook conversation about a retailer from a football game. The context would then be a sports venue. If the tweet or conversation consisted of positive comments about the retailer then the sentiment would be determined as positive. If the update consisted of highlighting a promotion by the retailer then the intent would be to collaborate or share with the individual s network. If such social media updates have to be incorporated into any solution within the enterprise then the master data management solution has to be enhanced with metadata about Context, Sentiment and Intent. Static lookup information will need to be generated and stored so that an inference engine can leverage this information to provide inputs for analysis. This will also necessitate a change in the back-end.the ETL processes that are responsible METADATADISCOVERY for this master data will now have to incorporate the social media data as well. Furthermore, the customer information extracted from these feeds need to be standardized before being loaded into any transaction system. FRAMEWORK FOR METADATA MANAGEMENT IN BIG DATA ANALYTICS We propose that metadata be managed using 5 components shown in Figure 2. Metadata Discovery Discovering metadata is critical in Big data for the reasons of context and intent noted in the prior section. Social data is typically sourced from multiple sources.all these sources will have different formats. Once metadata for a certain entity is discovered for one source it needs to be harmonized across all sources of interest. This process for Big data will need to be formalized using metadata governance. Metadata Collection A metadata collection mechanism should be implemented. A robust collection mechanism should aim to minimize or eliminate metadata silos. Once again, a technology or a process for metadata collection should be implemented. Collect METADATA COLLECTION METADATA GOVERNANCE METADATASTORAGE METADATADISTRIBUTION Figure 2: Metadata Management Framework for Big Data Analytics Source: Infosys Research Metadata Governance Metadata creation and maintenance needs to be governed. Governance should include resources from both the business and IT teams. A collaborative framework between business and IT should be established to provide this governance. Appropriate processes (manual or technical) should be utilized for this purpose. For example, on-boarding a new Big data source should be a collaborative effort between business users and IT. IT will provide the technology to enable business users discover metadata. 6
METADATA DISCOVERY DATA DISCOVERY Collect METADATA COLLECTION Collect DATA COLLECTION METADATA GOVERNANCE DATA GOVERNANCE METADATA STORAGE DATA STORAGE METADATA DISTRIBUTION DATA DISTRIBUTION BIG DATA DISTRIBUTION Figure 3: Equal Importance of Metadata & Data ing for Big Data Analytics Source: Infosys Research Metadata Storage Multiple models for enterprise metadata storage exist.the Common Warehouse Meta-model (CWM) is one example. A similar model or its extension thereof can be utilized for this purpose.if one such model will not fit the requirements of an enterprise then suitable custom models can be developed. Metadata Distribution This is the final component. Metadata, once stored will need to be distributed to consuming applications.a formal distribution model should be put into place to enable this distribution. For example, some applications can directly integrate to the metadata storage layer while others will need some specialized interfaces to be able to leverage this metadata. We note that in traditional analytics implementation, a framework similar to the one we propose exists but with data. The metadata management framework should be implemented alongside a data management framework to realize Big data analytics. THE PARADIGM SHIFT The discussion in this paper brings to light the importance of metadata and the impact it has not only on Big data analytics but traditional analytics as well.we are of the opinion that if enterprises want to get value out of their data assets and leverage the Big data tidal wave then the time is right to shift the paradigm from data governance to metadata governance and make data management part of the metadata governance process. A framework is as good as how it is viewed and implemented within the enterprise. The metadata management framework is successful if there is sponsorship for this effort from the highest levels of management.this 7
include both business and IT leadership within the enterprise. The framework can be viewed as being very generic. Change is a constant in any enterprise.the framework can be made flexible to adapt to changing needs and requirements of the business. All the participants and personas in engaged in the data management function within an enterprise should participate in the process. This will promote and foster collaboration between business and IT.This should be made sustainable and followed diligently by all the participants until this framework is used to onboard not only new data sources but also new participants in the process. Metadata and its management is an oft ignored area in enterprises with multiple consequences.the absence of robust metadata management processes lead to erroneous results, project delays and multiple interpretations of business data entities. These are all avoidable with a good metadata management framework. The consequences affect the entire enterprise either directly or indirectly.from the lowest level employee to the senior most executive, incorrect or poorly managed metadata not only will affect operations but also directly contribute to the top-line growth and bottom-line profitability of an enterprise. Big data is viewed as the most important innovation that brings tremendous value to enterprises. Without a proper metadata management framework, this value might not be realized. CONCLUSION Big data has created quite a bit of buzz in the market place.pioneers like Yahoo and Google created the foundations of what is today called Hadoop.There are multiple players in the Big data market today developing everything from technology to manage Big data to applications needed to analyze Big data to companies engaged in Big data analysis and selling that content. In the midst of all the innovation in the Big data space, metadata is often forgotten. It is important for us to recognize and realize the importance of metadata management and the critical impact it has on enterprises. If enterprises wish to remain competitive, they have to embark on Big data analytics initiatives.in this journey, enterprises cannot afford to ignore the metadata management problem. REFERENCES 1. Davenport, T., and Harris, J., (2007), Competing on Analytics The New Science of Winning, Harvard Business School Press. 2. Jennings, M., What role does metadata management play in enterprise information management (EIM)?. Available at http:// searchbusinessanalytics.techtarget.com/ answer/the-importance-of-metadatamanagement-in-eim. 3. Metadata Management Foundation Capabilities Component (2011). http:// mike2.openmethodology.org/wiki/ Metadata_Management_Foundation_ Capabilities_Component. 4. Rogers, D. (2010), Database Management: Metadata is more important than you think. Available at http://www.databasejournal. com/sqletc/article.php/3870756/ Database-Management-Metadata-is-moreimportant-than-you-think.htm. 5. Data Governance Institute, (2012), The DGI Data Governance Framework. Available a t http://datagovernance. com/fw_the_dgi_data_governance_ framework.html. 8
Author s Profile GAUTHAM VEMUGANTI is a Senior Technology Architect with the Corp PPS unit of Infosys. He can be contacted at Gautham_Vemuganti@infosys.com. For information on obtaining additional copies, reprinting or translating articles, and all other correspondence, please contact: Email: InfosyslabsBriefings@infosys.com Infosys Limited, 2013 Infosys acknowledges the proprietary rights of the trademarks and product names of the other companies mentioned in this issue of Infosys Labs Briefings. The information provided in this document is intended for the sole use of the recipient and for educational purposes only. Infosys makes no express or implied warranties relating to the information contained in this document or to any derived results obtained by the recipient from the use of the information in the document. Infosys further does not guarantee the sequence, timeliness, accuracy or completeness of the information and will not be liable in any way to the recipient for any delays, inaccuracies, errors in, or omissions of, any of the information or in the transmission thereof, or for any damages arising there from. Opinions and forecasts constitute our judgment at the time of release and are subject to change without notice. This document does not contain information provided to us in confidence by our clients.