Building Governance into Big Data A metadata-based approach for ensuring visibility and control for your Hadoop data architecture A Hortonworks White Paper SEPTEMBER 2015 Building Governance into Big Data 2015 Hortonworks www.hortonworks.com
2 Contents Overview 3 Why data governance matters 4 Four essential elements of Hadoop data governance 4 Why metadata and taxonomy hold the key to comprehensive data governance 5 The Data Governance Initiative (DGI) building cross-industry metadata services in Hadoop 6 Addressing cross-industry use cases 7 DGI becomes Apache Atlas 8 Supporting data governance across industries through a flexible-type system 10 Key characteristics and capabilities of Atlas 11 Competitive analysis 13 Summary 15
3 Overview As organizations pursue Hadoop initiatives in order to capture new opportunities for data-driven insight, data governance requirements can pose a key challenge. The management of information to identify its value and enable effective control, security and compliance for customer and enterprise data is a core requirement for both traditional and Modern Data Architectures. However, it hasn t yet been clear how to easily address these requirements using Hadoop. Traditional data governance tools either treat Hadoop as a black box with no visibility or access into internal data manipulation (aka ETL, etc.), or impose significant restrictions in order to meet these requirements, such as requiring every job run to be authored within a single tool, undermining the value of the breadth of tooling across the Hadoop Modern Data Architecture. While Hadoop produces a large amount of operational and application related data that can be used for auditing purposes, the attempt to discern meaning from this information using a forensic or rear view mirror kind of approach can result in inconsistency and inaccurate results. As a result of these challenges, a Data Lake can easily become a data swamp as users lose track of what data it contains, where it came from and the processes used to shape it. Hortonworks, committed to innovation at the core, has been a leader in industry efforts to weave data governance into the fabric of the Modern Data Architecture. Recognizing that Hadoop isn t an island of data, our approach has been to ensure that everything we build is open and can integrate within the context of the Modern Data Architecture. This approach provides our customers with a comprehensive view of data as it moves between systems and is transformed and accessed. The realization of this approach has revolved around a common set of metadata services, helpful information that describes and provides information about other data. Through an open, collaborative initiative with a small number of industry thought leaders, Hortonworks has helped develop capabilities and frameworks that can be applied across industries to ensure effective management and governance of Big Data environments. Working with other industry thought leaders, Hortonworks launched Apache Atlas to apply consistent metadata and taxonomy across the data ecosystem. Hortonworks empowers data managers to ensure the transparency, reproducibility, auditability and consistency of the Data Lake and the assets it contains. Hadoopcentric information can be leveraged in this broader context using third-party products to form a comprehensive view. In this way, Apache Atlas sits at the core of data governance for Hadoop and makes it possible for enterprises to capitalize on the power of Big Data to drive growth, differentiation and competitive advantage while maintaining full control and oversight.
4 Why data governance matters Data governance is a matter of critical importance for every organization that relies on data to drive business value in other words, virtually every organization today. Businesses in highly regulated industries such as finance and healthcare must maintain effective control and visibility over data to ensure auditability and compliance. For other companies, data governance is crucial for securing sensitive information and protecting customer privacy while helping employees leverage the full value of information to drive growth and differentiation. But every company, as it grows and expand its data lake beyond the first few use cases and applications, needs an easy way to explore the data sets that exist within the lake. At the same time, data governance needs to be built-in and automated as much as possible. The approach should support the process of bringing data into Hadoop and applied consistently across every subsequent access point to the data itself. What enterprises need is an approach to data governance for Big Data that creates value by: Enabling rapid discovery of datasets already contained within Hadoop, eliminating requests for duplicate data to be curated or ingested Addressing compliance reporting requirements for Hadoop related to data access and lineage, to reduce both cost and regulatory risk Supporting comprehensive data governance initiatives that span Hadoop and traditional data systems As Hadoop enables enterprises to grow the volume, velocity and variety of data that can be leveraged for insight, the importance of governance grows in tandem with the scale of the Data Lake. By building effective data governance into the architecture that powers Big Data, businesses can realize the full value of their information assets while ensuring effective risk management. Four essential elements of Hadoop data governance Critics of the Data Lake approach have characterized it as throw all the data into cluster now, and worry about cleansing, reconciliation and enrichment later. Hadoop s schema-on-read functionality allows users to forgo the definition and organization of data as it enters the system, while its distributed architecture facilitates the persistence of data. As a result, organizations have unchecked permission to store virtually any type of data while delegating data management and governance to application layers operating at the top of the platform. This approach is all too likely to transform an organization s Data Lake into a data swamp while fostering additional governance risks. To realize the full value of Hadoop, enterprises must reconcile data management realities when they bring existing and new data from disparate sources into the Hadoop platform. Metadata and its use in the context of data governance are vital parts of any enterprise-ready Data Lake, and must be built into the ecosystem from the outset to prevent increasingly complex data management challenges moving forward.
5 The Hortonworks philosophy for data governance in the enterprise revolves around four tenets: Auditability: All relevant events and assets must be traceable with appropriate lineage Transparency: Governance standards and protocols must be clearly defined, consistently applied and available to all Reproducibility: Relevant data landscapes should be reproducible at any given point in time Consistency: Compliance programs must be policy-driven As Hadoop enables enterprises to grow the volume, velocity and variety of data that can be leveraged for insight, the importance of governance grows in tandem with the scale of the Data Lake. By building effective data governance into the architecture that powers Big Data, businesses can realize the full value of their information assets while ensuring effective risk management. Why metadata and taxonomy hold the key to comprehensive data governance The success of data governance fundamentally revolves around capturing metadata and defining meaningful taxonomies for data. A definition of these concepts can provide a useful context for understanding their value within Hadoop and the broader data ecosystem. Metadata is information that describes and provides information about other data. This may include data models, schemas and administrative information in addition to attributes such as title, author, subject, tags, date created and description. Once defined and documented, these attributes can be used to search, link, aggregate and grant access to the associated dataset. TECHNICAL METADATA BUSINESS METADATA OPERATIONAL METADATA Database name Table name Column name Data type Business names Business definition Business classification Sensitivity tags Who? (security access) What? (job information) When? (logs/audit trails) Where? (location) Taxonomy refers to any structure that is used to organize and classify information. Taxonomies are used as part of metadata fields to support consistent and accurate indexing of data structures, and to define the relationships among them. Taxonomy may include a standardized list of terms (vocabulary) that can be used to consistently order data classification structures and/ or hierarchies into parent-child relationships. One can think of metadata as a framework or filing cabinet for data, and taxonomy as a mechanism for organizing it into folders. This approach makes it possible to organize even vast amounts of information consistently, just as a similar hierarchical approach is used to categorize the millions of different life forms on earth into a rational and manageable structure of families, genuses and species. This can be contrasted with the simple name-value pairs used elsewhere, which are really free-form labels with no hierarchy structure and a vulnerability to error and duplication. It should also be noted that taxonomies can and do change with time. So accounting for changes that can occur within taxonomies over time is critical to the success of any system which leverages them.
6 Combining technical and business taxonomical metadata is the key for consistent data governance within Hadoop and the broader data ecosystem. A common metadata and classification framework ensures that all applications operating on top of Hadoop infrastructure will relate to and treat data in the same way. DATA + METADATA + BUSINESS TAXONOMY = AUDIT & GOVERNANCE HDFS files HCatalog definitions Falcon pipelines Ranger set of users Title Description Author Subject Date created Date modified Data sensitivity Organizational hierarchy Customer/ industry vocabulary Industry compliance standards Who did what, where and when and how The Data Governance Initiative (DGI) building cross-industry metadata services in Hadoop The application of data governance best practices for Hadoop is complicated by its current lack of a comprehensive approach to deliver visibility and control into workflows that require audit, lineage and security. While a number of available vendor solutions seek to fill this gap, their solutions are not integrated into the broader Hadoop ecosystem and require a siloed, monolithic workflow. Governance vendors support for multi-tenancy and concurrency are less than ideal as current offerings do not have visibility into activity outside their own narrow focus. Hortonworks has been a leader in industry efforts to address these challenges for Open Enterprise Hadoop. As part of our promise to drive enterprise readiness for Hadoop, Hortonworks established the Data Governance Initiative (DGI) in collaboration with Aetna, Merck, Target, and SAS. The charter of this initiative was to introduce a common, metadatapowered approach to data governance into the open source community, and to establish a framework with the flexibility to be applied across industries. Since its inception, this co-development effort has grown to include Schlumberger and a global financial institution. DGI members set forth two guiding principles: The Hadoop data governance framework must integrate seamlessly with existing frameworks and exchange metadata with them The framework must also address governance across all the components or data engines that operate on top of the Hadoop platform
7 Figure 1: Data Governance Initiative (DGI) laid the foundation for a common, metadata-powered approach to data governance. DGI members worked on this shared framework to determine how users access data within Hadoop while interoperating with and extending its capability to existing third-party data governance and management tools. Addressing cross-industry use cases By bringing together leading companies with deep expertise across a range of industries, DGI made it possible to develop a truly cross-industry, extensible framework. DGI members actively worked to materialize real industry data governance solutions through the open source community at an unprecedented rate. The expertise brought to the DGI by the members manifested itself in addressing the following industry use cases across financial services, healthcare, pharmaceutical and telecommunications. Chain of custody (compliance): The financial services sector operates under strict regulations that require detailed audit tracking of every event s origin, access and transformation in order to comply with customer and governmental inquiries. This involves tracking every copy, backup and derivation of each dataset, in addition to actions with regards to data access or denial. Financial services companies must be able to recreate the narrative for every dataset from its creation through its disposition at any given time. Healthcare ad hoc reporting (30-day measures): Reimbursements by the Centers for Medicare & Medicaid Services (CMS) represent a significant portion of healthcare provider revenues. A healthcare institution s bottom line can be adversely affected if they are penalized by the CMS for poor patient recidivism rates, making it essential to be able to assess and track patient outcomes over their entire history. This involves analyzing a wide set of sensitive patient data from disparate data sources on an ad-hoc basis for timely remediation. The work that was done as part of the DGI can be used to discover, catalog and score patient data rapidly and accurately and present it in the relevant context.
8 Licensing of research data (data masking): To optimize return on investment for product development cycles that can stretch 10 20 years, pharmaceutical companies often license research data to other companies or partners. Each licensing agreement has specific requirements, often requiring data to be shared in its entirety with licensing customers or partners. To complicate matters, this data may contain sensitive personally identifiable information (PII), protected health information (PHI) data or both. To prevent regulatory violations, the licensing company must mask this sensitive information, while still making the entire dataset available to users based on their roles or data attributes. All these factors must be managed and coordinated in an efficient way. Energy companies often rely on similar licensing deals to monetize their own research data; while the regulatory environment differs, some of the same challenges come into play. Log analysis (customer experience): Data from telephony, networked devices, set-top boxes and websites hold vast quantities of information about the experience of individual telecommunications customers. This information is highly valuable to telecom companies, as inconsistent customer service can easily increase customer attrition and lower service margins. However, current data technologies make it extremely difficult to correlate customer events spread across a number of years and petabytes of data, making insights more difficult to expose. Opt-in customer data is specific to device, subscribed product, time and geography, and the lineage of all these attributes must be tracked to enable effective analysis. To mitigate subpar customer experience, providers must perform both real-time and predictive analysis of live streaming data, correlated by deep historical analysis. This analysis must be performed based on compliant methods that are grounded in established data governance practices. DGI laid the foundation to provide true visibility in Hadoop business processes such as these and other key use cases across industries. DGI becomes Apache Atlas Building on the success of DGI, Hortonworks, Aetna, Merck, SAS, Schlumberger, Target and others leveraged their groundbreaking co-development efforts into a new Apache project. In April 2015, they submitted a proposal for a new incubator project called Apache Atlas to the Apache Software Foundation. The founding members of the project include all the members of the DGI and others from the Hadoop community. Apache Atlas was proposed to provide governance capabilities in Hadoop. At its core, Atlas is designed to exchange metadata both within and outside of the Hadoop stack. By reconciling both logical data models and forensic events, enriched by business taxonomy metadata, Atlas enables a scalable set of core governance services. These services enable enterprises to effectively and efficiently address their compliance requirements by providing: Search and lineage for datasets Metadata-driven data access control Indexed and searchable centralized audit for operational events Comprehensive data lifecycle management from ingestion to disposition Metadata interchange with other metadata tools
9 In this way, Atlas allows for organizations to establish reliable and safe information products and better utilize information assets to generate revenue. By helping to eliminate duplicate data along with their associated cost, Atlas makes it easier for IT to support data exploration and compliance. As Hadoop enables enterprises to grow the volume, velocity and variety of data an enterprise can leverage for insight, the importance of governance grows with the Data Lake. A common metadata store provides the foundation for addressing these requirements and delivering a broad range of data governance capabilities for Hadoop. It also provides a focal point for interoperability to any metadata consumer within the ecosystem and within the Modern Data Architecture, rather than requiring each project or component within the Hadoop stack to provide its own unique interface. This further reduces cost and complexity for IT while enabling a holistic approach to data governance across the Data Lake. Rather than requiring each third-party product (ETL tools, broader data governance tools, etc.) to understand which projects and components are within the Hadoop ecosystem, Atlas provides a focal point for interoperability and information exchange. Of course, this isn t delivered in a big bang approach, but rather as a sustained open source effort. The community has decided to take a gradual approach to delivering comprehensive interoperability capability and has come together to define and build the core of Apache Atlas. The community has also outlined a clear roadmap to integrate a number of Hadoop ecosystem components with the common metadata store. Hive was chosen as the starting point due to its maturity, existing footprint among current Hadoop users and the fact that it is similar in concept to existing enterprise data warehouse technologies that are subject to these same data governance challenges. Figure 2: Atlas delivers out-of the-box integration with Apache Hive as its starting point with plans to expand from there.
10 Supporting data governance across industries through a flexible-type system The Apache community built Atlas with the realization that when it comes to data governance, one size doesn t fit all. It would be impractical for the community to attempt to create a super data model that would satisfy the unique requirements of all the diverse industries and business processes. This approach would also result in duplicate data models, given that enterprises across industries have already invested significant resources in building and refining the data models that reflect the unique ways in which they do business. A much more effective and efficient approach is to provide enterprises with the ability to import and export metadata, as it currently exists in non-hadoop systems such as ETL tools, ERP systems or data warehouses. The Atlas adaptive model streamlines compliance efforts by allowing companies to import existing metadata structures via REST-based APIs from other sources to leverage legacy investments, or to pre-load a taxonomy-rule combination for a specific industry or line of business. This approach is especially relevant for companies in the payment card industry (PCI), where a consistent metadata vocabulary ensures that downstream audit and compliance processes will match perfectly with metadata tags and access rules. With Atlas, data stewards also have the ability to define, annotate and automate the capture of relationships between data sets and underlying elements including source, target and derivation processes. Atlas ensures downstream metadata consistency across the ecosystem by enabling enterprises to easily export it to third-party systems. The advantages of the flexible-type system can be seen in its day-to-day use. Atlas empowers IT to model business organizations as well as technical metadata about enterprise data. Administrators can create ad-hoc or bulk structures that allow users assign a business tag (taxonomy) to physical data structures including database, tables or columns. For example, a data steward can assign a PII (personally identifiable information) tag to a column in a Hive table that contains employee's social security numbers. Whenever that column is used as part of a business workflow or is queried against for analysis purposes, it carries the PII tag with it and the user is notified of its appropriate use. Since Atlas is aware of how and when a tagged data structure was accessed, copied or modified, it can construct its lineage at any given time based on actual data events. This approach provides the enterprises with the confidence that its data governance processes are comprehensive enough to pass independent audit. This approach is also applicable to logical data structures (business taxonomy) such as hierarchies of departments or products. A data administrator can tag a data structure once at the parent level and all the associated child elements automatically inherit that tag. For example, a human resources data asset group can be tagged sensitive or PII and all child groups inside that parent group such as Drivers or Timesheets would inherit this attribute. Figure 3: Apache Atlas enables business tags applied to the parent entity to be automatically inherited by child entities.
11 Key characteristics and capabilities of Atlas As a result of the collaborative approach to its development, Atlas provides a robust and comprehensive framework for addressing governance for Big Data. The following attributes contribute to its unique effectiveness in this regard. Prescriptive lineage Lineage typically refers to the steps a dataset took to arrive to its current state, as well as any copies that may have been created. However, simply looking at audit or log correlations alone to determine if the lineage is flawed is not enough. As it is not possible to determine with certainty whether the route a data workflow took was correct or in compliance. Data governance approaches based on time-based algorithms are especially problematic as this inaccurate process can lead to misplaced confidence in a method, which would never pass serious compliance scrutiny. Without a more comprehensive understanding, it is impossible to take any action that might be warranted. The correct approach is to combine logical models of workflow with log events for validation and completeness, an approach called prescriptive lineage. This is the path that Atlas takes. Dynamic, metadata-based access policies for real-time policy enforcement Governance control cannot be passive or simply forensic; reports on who did what, when, are not enough. Apache Ranger is an open source project that provides authorization and authentication to the Hadoop ecosystem. By integrating with Ranger, Atlas empowers enterprises to rationalize compliance policy at runtime based on Atlas s data classification schemes by leveraging Ranger to enforce a flexible attribute-based policy that prevents violations from occurring. Ranger s centralized platform empowers data administrators to define security policy once based on Atlas metadata tags or attributes defined by a data steward or administrators, and apply this policy in real time to the entire hierarchy of assets. Data stewards can focus on discovery and tagging while another group can manage compliance policy. This decoupling of explicit policy offers two important benefits: Dynamic policy enforcement: data analysis-driven tags can be enforced immediately Reusability: One policy can be applied to many assets, simplifying management Apache Ranger enforces both role-based (RBAC) and attribute-based (ABAC) access control to create a flexible security profile that meets the needs of data-driven enterprises. The initial set of policies being constructed within the community are defined as: 1. Attribute-based access controls: For example, a column in a particular Hive table is marked with the metadata tag PII. This tag is then used to assign multiple entitles to a group. This is an evolution from role-based entitlements, which require discrete and static one-to-one mappings. 2. Prohibition against dataset combinations: It s possible for two data sets for example, one consisting of account numbers and the other of customer names to be in compliance individually, but pose a violation if combined. Administrators can apply a metadata tag to both sets to prevent them from being combined, helping avoid such a violation. 3. Time-based access policies: Administrators can use metadata to define access according to time windows in order to enforce compliance with regulations such as SOX 90-day reporting rules. 4. Location-specific access policies: Similar to time-based access policies, administrators can define entitlements differently by geography. For example, a U.S.-based user might be granted access to data while still in a domestic office, and then travel to Switzerland. Although the same user may be trying to access the same data, the different geographical context would apply, triggering a different set of privacy rules to be evaluated.
12 These policies can be used in combination to create a very sophisticated security access policy for each user at that point in time and location. Of course, the reach that Apache Ranger provides in terms of authorization for an ever growing number of Hadoop ecosystem components (currently eight at the time of this writing) allows organizations to consistently define and apply data access policies based on metadata regardless of the route by which the user or application attempts to the data itself. Audit and reporting Atlas leverages a common metadata store and policy rules, and the community plans to leverage this further with centralized log data for advanced reporting and analysis. Customers can recreate the data landscape at any given time by capturing security access information for every application, process and interaction with data, thereby providing insight into operational information for completed tasks as well as intermediate steps and activities. In the future, combining the capabilities of the HDP log search with a cross-component globally unique identifier (GUID), Atlas will strive to provide greater visibility to the entire HDP stack. RESTful APIs Atlas facilitates exploration of audit information by providing pre-defined navigation paths to data classification and audit information. Text-based search features in Atlas locate relevant data and audit event across the Data Lake quickly and accurately. Data stewards have the power to visualize a data set s lineage and then drill down into operational, security and provenance-related details. Native connector for Hive integration HDP 2.3 saw the initial release of Atlas. Included is a native connector to automatically capture all SQL activity on HiveServer2. All activity through HiveServer2 is tracked, providing lineage of both the data and the schema. This is then combined with business taxonomy to provide an enriched search and discovery capability. Governance-ready certification Atlas strives to foster a vibrant ecosystem to address Hadoop application integration requirements based on a centralized metadata store. A certification program aims to create a curated group of partners that contribute a rich set of data management features encompassing data preparation, integration, cleansing, tagging, ETL visualization and collaboration. Certified partners will define a set of metadata standards to exchange data and contribute conforming data integration features to the metadata store. Customers can then subscribe to features that they want to deploy with low switching costs and faster ramp-up times. Smaller firms can differentiate themselves by contributing innovative features to the program and benefit from other features to devise end-to-end workflow processing.
13 Competitive analysis As a result of the collaborative development of Atlas following the principles of Open Enterprise Hadoop, HDP offers key advantages over solutions developed through a proprietary approach to Hadoop. METADATA SERVICES HORTONWORKS DATA PLATFORM Metadata built around a core flexible type system that can model any organizational and data structure. Support for hierarchies and inheritance of attributes (parent-to-child elements). PROPRIETARY HADOOP Flat modeling using name-value pairs. Coarse and inelegant data modeling. No hierarchy or inheritance support. Open, platform-wide metadata integration to provide cross component lineage and dependencies. Lineage support for Hcatalog, Hive and HDFS. No support for Kafka or Storm. Open metadata services coordinate and support the entire platform, including complete SQL lineage, tag based real-time policy protection and common taxonomy for data pipelines. Custom connection supported through rich REST API set. Limited proprietary point integrations for certain components only Hcatalog, Hive and HDFS. PRESCRIPTIVE LINEAGE HORTONWORKS DATA PLATFORM Business and Operational: Combine logical models of workflow and log events for validation and completeness. PROPRIETARY HADOOP Operational event data lineage assembled by algorithm. Backward-looking only, no validation for missing elements. Taxonomy: Lineage searchable by both hierarchical business taxonomy (classification) and tags (traits) such as PII, as well as by data type (Hive table, column, etc.) Search only on operational data and flat labels; no validation against taxonomy for duplications or typos. Advanced search: Domain-specific language (DSL-SQL like search) that supports keyword and full text search Full text search only.
14 DATA LIFE CYCLE HORTONWORKS DATA PLATFORM Reusable: Logical model to create reusable and repeatable workflows. PROPRIETARY HADOOP Manually create each job and schedule. Built-in data management policies: Late data handling, replication (both HDFS and Hive) and eviction (disposition). Manually create each job and schedule. THIRD-PARTY SUPPORT HORTONWORKS DATA PLATFORM Gov.-ready certification: Certification that partners are being good citizens. Common metadata store, no proprietary formats, must use open APIs and SLA for lineage commits. PROPRIETARY HADOOP Not available. Low cost and no vendor lock-in: Common metadata stores to allow HDP users to change vendor with minimum switching cost. Customer metadata control and ownership. Not available. Vendor lock-in with typical cycle of configuration and migration. Agility and rapid customization: Common metadata to allow rapid deployment of new vendor or features with minimal downtime and risk. Data management tools available a la carte instead of only in rigid suites. Vendor-specific proprietary point solutions. No shared metadata. Not open.
15 Summary The transformative value of Big Data has driven the rapid adoption of Hadoop across businesses and industries of all kinds, but to be a truly enterprise-ready technology, its implications for data governance must be recognized and addressed. To manage risk, organizations need a comprehensive and effective way to ensure full visibility, control and compliance for the corporate and customer information in the Data Lake. Recognizing data governance as an essential element of Open Enterprise Hadoop, Hortonworks has collaborated with industry partners to create a flexible, open framework based on metadata and taxonomy to ensure the auditability, transparency, reproducibility and consistency of the Data Lake and the information it contains. This metadata-based approach is embodied in Apache Atlas, a project developed collaboratively by Hortonworks and a diverse group of large enterprises. Atlas will allow a single gateway to interface with all the diverse components in the HDP stack and harmonizes hem with the rest of the enterprise data ecosystem. A core flexible type system allows modeling of any organizational or data structures with built-in support for hierarchies and inheritance of attributes or tags (parent-to-child elements). Administrators also benefit from rich capabilities to define and enforce policies flexibly to support a wide range of industry use cases, and to take action quickly when data governance policies are violated. The power and versatility of Hadoop is the direct result of its open and collaborative development. By continuing this approach to address key enterprise requirements for data governance, Hortonworks helps companies leverage the strength of the open source community to manage risk without compromising productivity or data accessibility. In this way, customers can be confident that their Big Data strategy is built on a foundation of visibility, control and compliance. About Hortonworks Hortonworks develops, distributes and supports the only 100% open source Apache Hadoop data platform. Our team comprises the largest contingent of builders and architects within the Hadoop ecosystem who represent and lead the broader enterprise requirements within these communities. Hortonworks Data Platform deeply integrates with existing IT investments upon which enterprises can build and deploy Hadoop-based applications. Hortonworks has deep relationships with the key strategic data center partners that enable our customers to unlock the broadest opportunities from Hadoop. For more information, visit www.hortonworks.com.