1 WHITE PAPER SECURING YOUR ENTERPRISE HADOOP ECOSYSTEM Realizing Data Security for the Enterprise with Cloudera
2 Securing Your Enterprise Hadoop Ecosystem CLOUDERA WHITE PAPER 2 Table of Contents Introduction 3 Importance of Security 3 Growing Pains 3 Security Requirements of the Enterprise 4 Security Capabilities and Hadoop 5 Authentication 6 Authorization 7 Data Protection 9 Audit 11 Conclusion 13 cloudera-securing-enterprise-hadoop-ecosystem-whitepaper v6
3 Securing Your Enterprise Hadoop Ecosystem CLOUDERA WHITE PAPER 3 ABSTRACT Enterprises are transforming data management with unified systems that combine distributed storage and computation at limitless scale, for any amount of data and any type, and with powerful, flexible analytics, from batch processing to interactive SQL to full-text search. Yet to realize their full potential, these Enterprise Data Hubs require authentication, authorization, audit, and data protection controls. Cloudera makes the Enterprise Data Hub a reality for information-driven business by delivering applications and frameworks for deploying, managing, and integrating the necessary data security controls demanded by today s regulatory environments and compliance policies. With Cloudera, an Enterprise Data Hub powered by Apache Hadoop includes Apache Sentry for role-based authorization, Cloudera Manager for centralized authentication management, and Cloudera Navigator for audit and lineage capabilities. Early Hadoop developers did not prioritize data security, as Hadoop s initial focus was on workloads and data sets not considered sensitive and subject to regulation. Security controls addressed user error, e.g. accidental deletion, and not malicious use. Introduction Information-driven enterprises recognize that the next generation of data management is a single, central system that stores all data in any form or amount and provides business users with a wide array of tools and applications for working with this data. This evolution in data management is the Enterprise Data Hub, and Apache Hadoop is at its core. Yet as organizations increasing store data in this system, a growing portion is business-sensitive and subject to regulations and governance controls. This information requires that Hadoop provide strong capabilities and measures to ensure security and compliance. This paper aims to explore and summarize the common security approaches, tools, and practices found within Hadoop at large, and how Cloudera is uniquely positioned to provide, strengthen, and manage the data security required for an Enterprise Data Hub. Importance of Security Data security is a top business and IT priority for most enterprises. Many organizations operate within a business environment awash in compliance regulations that dictate controls and protection of data. For example, retail merchants and banking service firms must adhere to the Payment Card Industry Data Security Standard (PCI DSS) to protect their clients and their transactions, services, and account information. Healthcare providers and insurers, as well as biotechnology and pharmaceutical firms involved in research, are subject to the Health Insurance Portability and Accountability Act (HIPAA) compliance standards for patient information. And US federal organizations and agencies must comply with the various information security mandates of the Federal Information Security Management Act of 2002 (FISMA). Organizations also have business mandates and objectives that establish internal information security standards to guard their data. For instance, firms may endeavor to maintain strict client privacy as a market differentiator and business asset. Others may view data security as a wall protecting their hard-earned intellectual property or politically sensitive information. These internal policies can also be seen as insurance against negative public relations and press that may come as a result of a breach or accidental exposure. Growing Pains Historically, early Hadoop developers did not prioritize data security, as the use of Hadoop during this time was limited to small, internal audiences and designed for workloads and data sets that were not deemed overly sensitive and subject to regulation. The resulting early controls were designed to address user error to protect against accidental deletion, for example and not to protect against misuse. For stronger protection, Hadoop relied on the surrounding security capabilities inherent to existing data management and networking infrastructure. As Hadoop matured, the various projects within the platform, such as HDFS, MapReduce, Oozie, and Pig, addressed their own particular security needs. As is often the case with distributed, open source efforts, these projects established different methods for configuring the same types of security controls. The data management environment and needs are vastly different today. The flexible, scalable, and cost-efficient storage capabilities of Hadoop now offer businesses the opportunity to acquire a greater range of data for analysis and the ability to retain this data for longer periods of time. This concentration of data assets makes Hadoop a natural target for those with malicious intent, but with the advances in data security and security management in Hadoop today, enterprises have robust and comprehensive capabilities to prevent intrusions and abuse.
4 Securing Your Enterprise Hadoop Ecosystem CLOUDERA WHITE PAPER 4 HADOOP ECOSYSTEM Data Store Hadoop Data File System (HDFS), Apache HBase, Cloudera Search, Apache Accumulo Batch Processing MapReduce, MapReduce Streaming Query Processing Apache Hive, Cloudera, Impala, Apache Spark Data Flow Apache Pig, Apache Crunch Data Acquisition Apache Sqoop, Apache Flume Workflow Management Apache Oozie User Interface Cloudera Hue Enterprise Management Cloudera Manager, Cloudera Navigator The maturity of Hadoop today also means many data paths into and out of this information source. The breadth of ways in which a business can work with data in Hadoop, from interactive SQL and full-text search to advanced machine-learning analysis, does offer organizations unprecedented technical capabilities and insights for their multiple business audiences and constituencies. However, the projects and products of this ecosystem are loosely coupled, and while they do often share the same data, metadata, and resources, data can and does move between them, and often each project will have separate channels for getting data into and out of the platform. Enterprise security standards demand examination of each pathway, and this poses challenges for IT teams, as some channels and methods may have the concept of fields, tables, and views, for example, while other channels might only address the same data as more granular files and directories. Moreover, while all of these capabilities allow organizations the many means to explore data more fully and realize the potential value of their information, these acts of discovery might also uncover previously unknown sensitivity in the data. Organizations can no longer assume that critical and sensitive data has been identified prior to storage within Hadoop. The powerful and far-reaching capabilities inherent in Hadoop also pose substantial challenges to organizations. While Hadoop stands as the core of the platform for big data, employing Hadoop as the foundation of an Enterprise Data Hub requires that the data and processes of this platform meet the security requirements of the enterprise. Without such assurances, organizations simply cannot realize the full potential of and capitalize on an Enterprise Data Hub for data management. Security Requirements of the Enterprise How then can organizations provide these assurances for data security? A common approach in understanding how enterprises meet these challenges is to view data security as a series of business and operational capabilities, including: > > Perimeter Security and Authentication, which focuses on guarding access to the system, its data, and its various services; > > Governance and Transparency, which consists of the reporting and monitoring on the where, when, and how of data usage; > > Entitlement and Access, which includes the definition and enforcement of what users and applications can do with data; and > > Data Protection, which comprises the protection of data from unauthorized access either while at rest or in transit. Enterprise security requirements also demand that these controls are managed centrally and are automated across multiple systems and data sets in order to provide consistency of configuration and application according to IT policies and to ease and streamline operations. This requirement is particularly critical for distributed systems where deployment and management events are common throughout the lifetime of the services and underlying infrastructure. Lastly, enterprises require that systems and data integrate with existing IT tools such as directories, like Microsoft s Active Directory or the Lightweight Directory Access Protocol (LDAP), and logging and governance infrastructures, like security information and event management (SIEM) tools. Without strong integration, data security becomes fragmented. This fragmentation can lead to gaps in coverage and weak spots in the overall security umbrella, and thus creates entry points for breaches and exploitation.
5 Securing Your Enterprise Hadoop Ecosystem CLOUDERA WHITE PAPER 5 Employing Hadoop as the core of an Enterprise Data Hub requires that it meet the security requirements of the enterprise. Without such assurances, organizations cannot realize the full potential of an Enterprise Data Hub for data management. Security Capabilities and Hadoop Security in Hadoop today is very different from what was provided even two years ago. While Hadoop is still faced with the challenges outlined above, Cloudera has responded by providing security controls and management that meet today s enterprise requirements. As a result of this effort, the functions of each of the separate 20+ common Hadoop projects are beginning to consolidate. For example, with Cloudera Manager, security for data-in-motion can be configured across multiple projects. With Cloudera Navigator, IT teams can now automatically gather events from the major Hadoop projects and send them to a single logging tool of their choice. And the Apache Sentry project enables administrators to manage centrally the permissions for data that can be viewed in multiple tools and computing engines. The remainder of this paper focuses on how Hadoop now provides enterprise-grade security and how IT teams can support deployments in environments that face frequent regulatory compliance audits and governance mandates. To do so, it is helpful to explore in greater detail the technical underpinnings, design patterns, and practices manifested in Hadoop for the four common factors of enterprise security: Authentication, Authorization, Data Protection, and Auditing.
6 Securing Your Enterprise Hadoop Ecosystem CLOUDERA WHITE PAPER 6 Kerberos is a centralized network authentication system providing secure, single sign-on and data privacy for users and applications. Developed as a research project at the Massachusetts Institute of Technology (MIT) in the 1980s to tackle mutual authentication and encrypted communication across unsecured networks, Kerberos has become a critical part of any enterprise security plan. The Kerberos protocol offers multiple encryption options that satisfy virtually all current security standards, ranging from single DES and DIGEST-MD5 to RC4-HMAC and the latest NIST encryption standard, the Advanced Encryption Standard (AES). Learn more about Kerberos at http: /web.mit.edu/kerberos/ The Security Assertion Markup Language (SAML) is an XMLbased framework for communicating user authentication, entitlement, and attribute information By defining standardized mechanisms for the communication of security and identity information between business partners, SAML makes federated identity, and the crossdomain transactions that it enables, a reality. Organization for the Advancement of Structured Information Standards (OASIS) Learn more about SAML at http: /saml.xml.org/ Authentication Authentication mitigates the risk of unauthorized usage of services, and the purpose of authentication in Hadoop, as in other systems, is simply to prove that a user or service is who he claims to be. Typically, enterprises manage identities, profiles, and credentials through a single distributed system, such as a Lightweight Directory Access Protocol (LDAP) directory. LDAP authentication consists of straightforward username/password services backed by a variety of storage systems, ranging from file to database. Another common enterprise-grade authentication system is Kerberos. Kerberos provides strong security benefits including capabilities that render intercepted authentication packets unusable by an attacker. Kerberos virtually eliminates the threat of impersonation present in earlier versions of Hadoop and never sends a user s credentials in the clear over the network. Both authentication schemes are widely used for more than 20 years and many enterprise applications and frameworks leverage them at their foundation. For example, Microsoft s Active Directory (AD) is an LDAP directory that also provides Kerberos for additional authentication security. Virtually all of the components of the Hadoop ecosystem are converging to use Kerberos authentication with the option to manage and store credentials in LDAP or AD. Patterns and Practices Hadoop, given its distributed nature, has many touchpoints, services, and operational processes that require robust authentication in order for a cluster to operate securely. Hadoop provides critical facilities for accomplishing this goal. Perimeter Authentication The first common approach concerns secured access to the Hadoop cluster itself; that is, access that is external to or on the perimeter of Hadoop. As mentioned above, there are two principle forms of external authentication commonly available within Hadoop: LDAP and Kerberos. Best practices suggest that client-to-hadoop services use Kerberos for authentication, and, as mentioned earlier, virtually all ecosystem clients now support Kerberos. When Hadoop security is configured to utilize Kerberos, client applications authenticate to the edge of the cluster via Kerberos RPC, and web consoles utilize HTTP SPNEGO Kerberos authentication. Despite the different methods used by clients to handle Kerberos tickets, all leverage a common core Kerberos and associated central directory of users and groups, thus maintaining continuity of identification across all services of the cluster and beyond. To help further aid IT teams with access integration, Cloudera has recently updated Hue and Cloudera Manager to offer additional Single Sign-On (SSO) options by employing Security Assertion Markup Language (SAML) for authentication. Cluster and RPC Authentication Virtually all intra-hadoop services mutually authenticate each other using Kerberos RPC. These internal checks prevent rogue services from introducing themselves into cluster activities via impersonation and subsequently capturing potentially critical data as a member of a distributed job or service. Distributed User Accounts Many Hadoop projects operate at file-level granularity on data sets within HDFS, and this activity requires that a given user s account exist on all processing and storage nodes (NameNode, DataNode, JobTracker/Application Master, and TaskTracker/NodeManager) within a cluster in order to validate access rights and provide resource segmentation. The host operating system provides the mechanics to ensure user account propagation - for example, Pluggable Authentication Manager (PAM) or System Security Services Daemon (SSSD) on Linux hosts - allowing user accounts to be centrally managed via LDAP or AD, which facilitates the required process isolation. (See Authorization: Process Isolation )
7 Securing Your Enterprise Hadoop Ecosystem CLOUDERA WHITE PAPER 7 Authorization Authorization is concerned with who or what has access or control over a given resource or service. Since Hadoop merges together the capabilities of multiple, varied, and previously separate IT systems as an Enterprise Data Hub that stores and works on all data within an organization, it requires multiple authentication controls with varying granularities. Each control is modeled after controls already familiar to IT teams, allowing them to easily select the right tool for the right job. For example, common data flows in both Hadoop and traditional data management environments require different authorization controls at each stage of processing. First, a commercial ETL tool may gather data from a variety of sources and formats, ingesting that data into a common repository. During this stage, an ETL administrator has full access to view all elements of input data in order to verify formats and configure data parsing. Then the processed data is exposed to business intelligence (BI) tools through a SQL interface to the common repository. End users of the BI tools have varying levels of access to the data according to their assigned roles. In the first stage (data ingestion), file-level access control is in place, and in the second stage (data access through BI tools), field and row-level access control applies. In cases where multiple authorization controls are needed, Hadoop management tools can increasingly simplify setup and maintenance by: > > Tying all users to groups, which can be specified in existing LDAP or AD directories; > > Providing a single policy control for similar interaction methods, like batch and interactive SQL and search queries. Apache Sentry permissions apply to Hive (HiveServer2), Impala and Search, for example. This consolidation continues to accelerate with anticipated features in Cloudera Manager that will expose permission configurations through its centralized management interface. Patterns and Practices Authorization for data access in Hadoop typically manifests in three forms: POSIX-style permissions on files and directories, Access Control Lists (ACL) for management of services and resources, and Role- Based Access Control (RBAC) for certain services with advanced access controls to data. These forms do share some similarities. Process Isolation Hadoop, as a shared services environment, relies on the isolation of computing processes from one another within the cluster, thus providing strict assurance that each process and user has explicit authorization to a given set of resources. This is particularly important within MapReduce, as the Tasks of a given Job can execute Unix processes (i.e. MR Streaming), individual Java VMs, and even arbitrary code on a host server. In frameworks like MapReduce, the Task code is executed on the host server using the Job owner s UID, thus providing reliable process isolation and resource segmentation at the OS level of the Hadoop cluster. POSIX Permissions The majority of services within the Hadoop ecosystem, from client applications like the CLI shell to tools written to use the Hadoop API, directly access data stored within HDFS. HDFS uses POSIX-style permissions for directories and files; each directory and file is assigned a single owner and group. Each assignment has a basic set of permissions available; file permissions are simply read, write, and execute, and directories have an additional permission to determine access to child directories.
8 Securing Your Enterprise Hadoop Ecosystem CLOUDERA WHITE PAPER 8 APACHE SENTRY Role-based Access Control Hierarchies MetaStore Objects Server Database Table Partition Columns View Index Function/Routine Lock MetaStore Privileges Insert Select All Search Objects Collection Search Privileges Query Update All Role-Based Access Control (RBAC) For finer-grained access to data accessible via schema -- that is, data structures described by the Apache Hive Metastore and utilized by computing engines like Hive and Impala, as well as collections and indices within Cloudera Search -- Cloudera developed Apache Sentry, which offers a highly modular, role-based privilege model for this data and its given schema. (Cloudera donated Apache Sentry to the Apache Foundation in 2013.) Sentry governs access to each schema object in the Metastore via a set of privileges like SELECT and INSERT. The schema objects are common entities in data management, such as SERVER, DATABASE, TABLE, COLUMN, and URI, i.e. file location within HDFS. Cloudera Search has its own set of privileges, e.g. QUERY, and objects, e.g. COLLECTION. As with other RBAC systems that IT teams are already familiar with, Sentry provides for: > > Hierarchies of objects, with permissions automatically inherited by objects that exist within a larger umbrella object; > > Rules containing a set of multiple object/permission pairs; > > Groups that can be granted one or more roles; > > Users can be assigned to one or more groups. Sentry is normally configured to deny access to services and data by default so that users have limited rights until they are assigned to a group that has explicit access roles. Column-level Security, Row-level Security and Masked Access Using the combination of Sentry-based permissions, SQL views, and User Defined Functions (UDFs), developers can gain a high degree of access control granularity for SQL computing engines through HiveServer2 and Impala, including: > > Column-level security - To limit access to only particular columns of entire tables, uses can access the data through a view, which contains either a subset of columns in the table, or have certain columns masked. For example, a view can filter a column to only the last four digits of a US Social Security number. > > Row-level security - To limit access by particular values, views can employ CASE statements to control rows to which a group of users has access. For example, a broker at a financial services firm may only be able to see data within her managed accounts. Access Control Lists Hadoop also maintains general access controls for the services themselves in addition to the data within each service and in HDFS. Service access control lists (ACL) range from NameNode access to client-to- DataNode communication. In the context of MapReduce and YARN, user and group identifiers form the basis for determining permission for job submission or modification. Apache HBase also uses ACLs for data-level authorization. HBase ACLs authorize various operations (READ, WRITE, CREATE, ADMIN) by column, column family, and column family qualifier. HBase ACLs are granted and revoked to both users and groups.
9 Securing Your Enterprise Hadoop Ecosystem CLOUDERA WHITE PAPER 9 Data Protection The goal of data protection is to ensure that only authorized users can view, use, and contribute to a data set. These security controls not only add another layer of protection against potential threats by endusers, but also thwart attacks from administrators and malicious actors on the network or in the data center. The means to achieving this goal consists of two elements: protecting data when it is persisted to disk or other storage mediums, commonly called data at rest, and protecting data while it moves from one process or system to another, or data in transit. Several common compliance regulations call for data protection in addition to the other security controls discussed in this paper. While Hadoop does provide native security capabilities for data in transit, the storage framework (HDFS) does not provide data at rest protection out-of-the-box. For these needs, Cloudera s certified security partners provide a wide variety of data protection capabilities. Patterns and Practices Hadoop, as a central data hub for many kinds of data within an organization, naturally has different needs for data protection depending on the audience and processes while using a given data set. Hadoop provides a number of facilities for accommodating these objectives. Data Protection Technologies There are several core data protection methods offering various degrees of coverage and security that can be applied within an organization s governance and compliance policies. > > Encryption - A mathematical transformation that can only be reversed through the use of a key. Strong encryption leverages NIST recognized methods such as AES 256bit. Encryption may either use format-preserving (e.g. a 9-digit social security number encrypts to another 9 digit number) modes or other modes that expand the length of the value. > > Tokenization - A reversible transformation usually based on a randomly generated value that substitutes for the original value. Through the use of a lookup table, a given value, such as a credit card number, will always tokenize to the same random value. Tokens are usually format-preserving, so a 16-digit credit card number remains 16 digits. Tokenization has specific advantages for credit card data, for example, which is subject to PCI DSS compliance: systems that contain only tokenized credit card numbers may be removed from an audit scope, thereby simplifying annual audits. > > Data Masking - An irreversible transformation of data using a variety of methods, such as replacing a value with another value from a list of realistic values for a given data type (e.g. replacing the name Steve with Mike ), or simply replacing certain digits with zeros or the letter x (e.g. replacing the first five digits of a US social security number with x, leaving the last four digits unchanged). Key Management Encryption requires encryption keys. All of the above data protection techniques involve encryption even tokenization and masking use internal tables that are protected using encryption. Key generation, storage, and access all need to be carefully managed to avoid a security breach or data loss. For example, keys must never be stored with the data, so that encrypted data remains useless if stolen. Key management is outside the scope of standard, out-of-the-box Hadoop, however, Cloudera s certified security partners provide a number of enterprise-grade key management solutions.
10 Securing Your Enterprise Hadoop Ecosystem CLOUDERA WHITE PAPER 10 Data Granularity Data protection can be applied at a number of levels within the Hadoop data infrastructure. > > Field-level - Encryption, tokenization, and data masking can all be applied to specific sensitive data fields - such as credit card numbers, social security numbers or names. When protection is applied at this level, it is generally applied only to specific sensitive fields, not to all data. > > Filesystem-level - Encryption can be applied at the operating system file system level to cover some or all files in a volume. > > Network-level - Encryption can be applied at this level to encrypt data just before it gets sent across a network and to decrypt it as soon as it is received. In Hadoop this means coverage for data sent from client user interfaces as well as service-to-service communication like remote procedure calls (RPCs). Granularity Advantages Disadvantages Field > > Protection travels with the data; it can protect data in motion and at rest. > > Minimizes CPU overhead since protection is applied only to a fraction of data. > > Protection can serve some of the same purposes as data masking revealing certain digits (e.g. last 4 digits of SSN) while concealing others when used as a reversible format-preserving method. > > Must be able to identify where the sensitive data exists in order to protect it. > > Must find integration points in the data flow to intercept the sensitive data fields to protect and expose them. Due to the large number of data flow options inside or outside Hadoop, this can be a complex operation. Filesystem > > All (or large chunks) of a data set are encrypted, so there are no need to identify specific fields that need protection. > > Encryption is applied only at the center of a Hadoop ecosystem HDFS so there is only one logical place where data needs to be intercepted to encrypt and decrypt. Network > > All data across the network is protected. No need to identify particular fields for protection. > > Available now on virtually all transmissions within the Hadoop ecosystem. > > Uses industry-standard protocols like SSL/TLS and SASL. > > Necessitates CPU overhead on all nodes, since all data is being encrypted and decrypted constantly. This may or may not be an issue, depending on workloads on the cluster. > > Strictly protection for data-at-rest; guards against the recovery of data if a disk or host is removed from the data center for service. Does not protect data if copied from HDFS to a local machine or if exposed when used in report generation. > > Coverage limited to the network; must rely on other controls to protect data before and after transmission. > > CPU overhead required to handle encryption and decrypting data transmissions. Choosing Among the Protection Options In designing a data protection strategy for Hadoop, most companies strive to cover both data-at-rest and data-in-motion to address a large number of threats and regulatory compliance requirements. As the options above illustrate, there are multiple ways to address this challenge. Below are examples of approaches that companies may choose to secure their data, at rest and in transit. Note that while these approaches are common, these are not the only ways to reach a given objective, such as regulatory compliance. Healthcare data governed by HIPAA - often data-in-motion protection at the network level is employed for all Hadoop ecosystem tools used, and tools like Cloudera Manager largely automate the application of these protections. For data-at-rest protection, firms commonly encrypt all file systems, including HDFS, using one of the Cloudera certified security partners. This protection has some CPU overhead, but the protection is completely transparent to end-users. Credit card data governed by PCI DSS - credit card numbers are tokenized upstream before they are stored in HDFS. Most analysis can be done with tokenized data, but occasionally an application needs original credit card numbers and can call an API to de-tokenize. This approach simplifies PCI DSS audits by keeping Hadoop out of scope, but application changes are required to tokenize and de-tokenize data.
11 Securing Your Enterprise Hadoop Ecosystem CLOUDERA WHITE PAPER 11 Hadoop has historically lacked centralized cross-component audit capabilities. Recent advances such as Cloudera Navigator add secured, realtime components to key data and access frameworks in Hadoop. For example, Navigator captures all activity within HDFS, Hive, HBase, Impala, and Sentry. An alternative approach for credit card data involves encryption for data-at-rest (as with the healthcare example above), coupled with masking of credit card data so that the majority of users can only see the first 6 and last 4 digits, while the middle digits are replaced with X or 0. This approach enables PCI DSS compliance but without the audit scope reduction offered by tokenization. insurance data used for modeling - sensitive fields are irreversibly masked upon acquisition into HDFS and typically remain masked. Data is processed to be as realistic as possible in order to still be used to build models with no application changes. However, with this approach, use cases requiring original data are not an option. Audit The goal of auditing is to capture a complete and immutable record of all activity within a system. Auditing plays a central role in three key activities within the enterprise. First, auditing is part of a system s security regime and can explain what happened, when, and to whom or what in case of a breach or other malicious intent. For example, if a rogue administrator deletes a user s data set, auditing provides the details of this action, and the correct data may be retrieved from backup. The second activity is compliance, and auditing participates in satisfying the core requirements of regulations associated with sensitive or personally identifiable data (PII), such as the Health Insurance Portability and Accountability Act (HIPAA) or the Payment Card Industry (PCI) Data Security Standard. Auditing provides the touchpoints necessary to construct the trail of who, how, when, and how often data is produced, viewed, and manipulated. Lastly, auditing provides the historical data and context for data forensics. Audit information leads to the understanding of how various populations use different data sets and can help establish the access patterns of these data sets. This examination, such as trend analysis, is broader in scope than compliance and can assist content and system owners in their data optimization and ROI efforts. The risks facing auditing are the reliable, timely, and tamper-proof capture of all activity, including administrative actions. Until recently, the native Hadoop ecosystem has relied primarily on using log files. Log files are unacceptable for most audit use cases in the enterprise as real-time monitoring is impossible, and log mechanics can be unreliable -- a system crash before or during a write commit can compromise integrity and lose data. Other alternatives have included application-specific databases and other single-purpose data stores, and yet these approaches fail to capture all activity across the entire cluster. In addition, an enterprise audit solution must allow for simple central enforcement of logging policies, such as retention periods, high availability, tamper resistance, and common logging formats. Patterns and Practices The Hadoop ecosystem has multiple applications, services, and processes that constitute the corpus of activity, yet the following standard enterprise auditing approaches and architectures must be applied across this ecosystem. Event Capture The lynchpin to these approaches is the efficient, reliable, and secure capture of event activity as it occurs. Enterprise auditing should range from the creation, view, modification and deletion of data to permission changes and executed queries from a number of systems and frameworks. As mentioned, Hadoop has historically lacked centralized cross-component audit capabilities, which greatly curtailed its use within a broader enterprise context, especially with regulated and sensitive data and workloads.
12 Securing Your Enterprise Hadoop Ecosystem CLOUDERA WHITE PAPER 12 However, recent advances such as Cloudera Navigator add secured, real-time components to key data and access frameworks in Hadoop. For example, Navigator captures all activity within HDFS, Hive, HBase, and now Impala and Sentry to enable an event stream that the Navigator Server captures, commits, and serves to both internal and external reporting systems. External systems could include SEIM products like RSA EnVision. Prior to Cloudera Navigator, to gather a complete set of events, administrators needed to write scripts to gather logs from individual Hadoop project components, and then find ways to search across them and correlate events - a cumbersome and error-prone proposition given the variety of formats of the different log files and the distributed architecture of the platform. Reporting Activity reviews by auditors and systems occur with various scopes, filters, and timeframes, and arguably, the most ubiquitous is regular and ad-hoc report generation for formal audit procedures. Report generation comes in two forms. The first is characterized as an assisted report, where the auditor works with the system administrator to capture the data, screenshots, and other details required for quarterly reports, for example. The other approach provides separate access and user interfaces to the audit data. The auditor builds their own reports with no or limited assistance from the system administrator. Cloudera Navigator is the first native Hadoop application for audit reporting and provides a real-time dashboard and query interface for report generation and data gathering as well as export capabilities to external systems. Monitoring and Alerting Auditing also includes real-time monitoring and alerting of events as they occur. Typical SIEM tools have rule engines to alert appropriate parties to specific events. Integration The monitoring and reporting of Hadoop systems, while critical elements to its enterprise usage, are only a part of an enterprise s total audit infrastructure and data policy. Often these enterprise tools and policies require that all audit information route through a central interface to aid comprehensive reporting, and Hadoop-specific audit data can be integrated with these existing enterprise SIEM applications and other tools. For example, Cloudera Navigator exposes Hadoop audit data through several delivery methods: via syslog, thus acting as a mediator between the raw event streams in Hadoop and the SIEM tools, via a REST API for custom tools, or simply exported to a file, such as a comma-delimited text file. In addition, certain Cloudera partner vendors are actively working to add their event data into the Cloudera Navigator event stream so that applications and services closely tied to Hadoop can share events in a common framework.
13 Securing Your Enterprise Hadoop Ecosystem CLOUDERA WHITE PAPER 13 Conclusion Next-generation data management - the Enterprise Data Hub - requires authentication, authorization, audit, and data protection controls in order to establish a place to store and operate on all data within the enterprise, from batch processing to interactive SQL to advanced analytics. Hadoop is at its foundation, yet the core elements of Hadoop are only part of a complete enterprise solution. As more data and more variety of data is moved to the hub, the more likely that this data will be subject to compliance and security mandates. Comprehensive and integrated security within Hadoop, then, is a keystone to establishing the new center of data management. Cloudera is making it easier for enterprises to deploy, manage, and integrate the necessary data security controls demanded by today s regulatory environments and for realizing the Enterprise Data Hub. With Cloudera, security in Hadoop has matured rapidly, with developments such as Apache Sentry for role-based authorization, Cloudera Manager for centralized authentication management (planned), and Cloudera Navigator for audit capabilities. It is clear that with Cloudera, enterprises can consolidate the security functions of the Hadoop ecosystem, from authentication and authorization to data protection and auditing. Cloudera, Inc Page Mill Road, Palo Alto, CA or cloudera.com 2013 Cloudera, Inc. All rights reserved. Cloudera and the Cloudera logo are trademarks or registered trademarks of Cloudera Inc. in the USA and other countries. All other trademarks are the property of their respective companies. Information is subject to change without notice. cloudera-securing-enterprise-hadoop-ecosystem-whitepaper v6
Data Security For Government Agencies Version: Q115-101 Table of Contents Abstract Agencies are transforming data management with unified systems that combine distributed storage and computation at limitless
Securing Your Enterprise Hadoop Ecosystem Comprehensive Security for the Enterprise with Cloudera Version: 102 Table of Contents Introduction 3 Importance of Security 3 Growing Pains 3 Security Requirements
Securing Your Enterprise Hadoop Ecosystem Comprehensive Security for the Enterprise with Cloudera Version: 103 Table of Contents Introduction 3 Importance of Security 3 Growing Pains 3 Security Requirements
Apache Sentry Prasad Mujumdar email@example.com firstname.lastname@example.org Agenda Various aspects of data security Apache Sentry for authorization Key concepts of Apache Sentry Sentry features Sentry architecture
IBM InfoSphere Guardium Data Activity Monitor for Hadoop-based systems Proactively address regulatory compliance requirements and protect sensitive data in real time Highlights Monitor and audit data activity
Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments Important Notice 2010-2015 Cloudera, Inc. All rights reserved. Cloudera, the Cloudera logo, Cloudera Impala, Impala, and
Data Security in Hadoop Eric Mizell Director, Solution Engineering Page 1 What is Data Security? Data Security for Hadoop allows you to administer a singular policy for authentication of users, authorize
THE DATA PROTECTIO TIO N COMPANY Securing Data in the Virtual Data Center and Cloud: Requirements for Effective Encryption whitepaper Executive Summary Long an important security measure, encryption has
MULTITENANCY AND THE ENTERPRISE DATA HUB: Version: Q414-105 Table of Content Introduction 3 Business Objectives for Multitenant Environments 3 Standard Isolation Models of an EDH 4 Elements of a Multitenant
Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments Important Notice 2010-2016 Cloudera, Inc. All rights reserved. Cloudera, the Cloudera logo, Cloudera Impala, Impala, and
Fighting Cyber Fraud with Hadoop Niel Dunnage Senior Solutions Architect 1 Summary Big Data is an increasingly powerful enterprise asset and this talk will explore the relationship between big data and
The Big Data Security Gap: Protecting the Hadoop Cluster Introduction While the open source framework has enabled the footprint of Hadoop to logically expand, enterprise organizations face deployment and
Achieving PCI Compliance with Red Hat Enterprise Linux June 2009 CONTENTS EXECUTIVE SUMMARY...2 OVERVIEW OF PCI...3 1.1. What is PCI DSS?... 3 1.2. Who is impacted by PCI?... 3 1.3. Requirements for achieving
Encryption and Anonymization in Hadoop Current and Future needs Sept-28-2015 Page 1 ApacheCon, Budapest Agenda Need for data protection Encryption and Anonymization Current State of Encryption in Hadoop
Complying with PCI Data Security Solution BRIEF Retailers, financial institutions, data processors, and any other vendors that manage credit card holder data today must adhere to strict policies for ensuring
Microsoft SQL Server 2008 R2 Enterprise Edition and Microsoft SharePoint Server 2010 Better Together Writer: Bill Baer, Technical Product Manager, SharePoint Product Group Technical Reviewers: Steve Peschka,
Identity and Access Management Integration with PowerBroker Providing Complete Visibility and Auditing of Identities Table of Contents Executive Summary... 3 Identity and Access Management... 4 BeyondTrust
The Future of Data Management with Hadoop and the Enterprise Data Hub Amr Awadallah (@awadallah) Cofounder and CTO Cloudera Snapshot Founded 2008, by former employees of Employees Today ~ 800 World Class
Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes Highly competitive enterprises are increasingly finding ways to maximize and accelerate
WHITE PAPER CDH AND BUSINESS CONTINUITY: An overview of the availability, data protection and disaster recovery features in Hadoop Abstract Using the sophisticated built-in capabilities of CDH for tunable
SafeNet vs. Native Encryption Executive Summary Given the vital records databases hold, these systems often represent one of the most critical areas of exposure for an enterprise. Consequently, as enterprises
Enterprise Key Management: A Strategic Approach ENTERPRISE KEY MANAGEMENT A SRATEGIC APPROACH White Paper February 2010 www.alvandsolutions.com Overview Today s increasing security threats and regulatory
CENTRIFY WHITE PAPER Windows Least Privilege Management and Beyond Abstract Devising an enterprise-wide privilege access scheme for Windows systems is complex (for example, each Window system object has
AtScale Intelligence Platform PUT THE POWER OF HADOOP IN THE HANDS OF BUSINESS USERS. Connect your BI tools directly to Hadoop without compromising scale, performance, or control. TURN HADOOP INTO A HIGH-PERFORMANCE
Overview Password Manager Pro offers a complete solution to control, manage, monitor and audit the entire life-cycle of privileged access. In a single package it offers three solutions - privileged account
Ensure PCI DSS compliance for your Hadoop environment A Hortonworks White Paper October 2015 2 Contents Overview Why PCI matters to your business Building support for PCI compliance into your Hadoop environment
MySQL Security: Best Practices Sastry Vedantam email@example.com Safe Harbor Statement The following is intended to outline our general product direction. It is intended for information purposes
The Comprehensive Guide to PCI Security Standards Compliance Achieving PCI DSS compliance is a process. There are many systems and countless moving parts that all need to come together to keep user payment
Hadoop & Spark Using Amazon EMR Michael Hanisch, AWS Solutions Architecture 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda Why did we build Amazon EMR? What is Amazon EMR?
TECHNICAL BRIEF Data-Centric Security vs. Database-Level Security Contrasting Voltage SecureData to solutions such as Oracle Advanced Security Transparent Data Encryption Introduction This document provides
Certified Big Data and Apache Hadoop Developer VS-1221 Certified Big Data and Apache Hadoop Developer Certification Code VS-1221 Vskills certification for Big Data and Apache Hadoop Developer Certification
Cloudera Enterprise Data Hub in Telecom: Three Customer Case Studies Version: 103 Table of Contents Introduction 3 Cloudera Enterprise Data Hub for Telcos 4 Cloudera Enterprise Data Hub in Telecom: Customer
Complying with Payment Card Industry (PCI-DSS) Requirements with DataStax and Vormetric Table of Contents Table of Contents... 2 Overview... 3 PIN Transaction Security Requirements... 3 Payment Application
solution brief PCI COMPLIANCE ON AWS: HOW TREND MICRO CAN HELP AWS AND PCI DSS COMPLIANCE To ensure an end-to-end secure computing environment, Amazon Web Services (AWS) employs a shared security responsibility
SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP Eva Andreasson Cloudera Most FAQ: Super-Quick Overview! The Apache Hadoop Ecosystem a Zoo! Oozie ZooKeeper Hue Impala Solr Hive Pig Mahout HBase MapReduce
Security management solutions White paper IBM Tivoli and Consul: Facilitating security audit and March 2007 2 Contents 2 Overview 3 Identify today s challenges in security audit and compliance 3 Discover
Enterprise-grade Hadoop: The Building Blocks An Ovum white paper for MapR Publication Date: 24 Sep 2014 Author name Summary Catalyst Hadoop was initially developed for trusted environments that did not
Integrating Cloudera and SAP HANA Version: 103 Table of Contents Introduction/Executive Summary 4 Overview of Cloudera Enterprise 4 Data Access 5 Apache Hive 5 Data Processing 5 Data Integration 5 Partner
ENABLING GLOBAL HADOOP WITH EMC ELASTIC CLOUD STORAGE Hadoop Storage-as-a-Service ABSTRACT This White Paper illustrates how EMC Elastic Cloud Storage (ECS ) can be used to streamline the Hadoop data analytics
WHITE PAPER Drawbacks to Traditional Approaches When Securing Cloud Environments Drawbacks to Traditional Approaches When Securing Cloud Environments Exec Summary Exec Summary Securing the VMware vsphere
Copyright 2014 Stone Bond Technologies, L.P. All rights reserved. The information contained in this document represents the current view of Stone Bond Technologies on the issue discussed as of the date
Data Governance in the Hadoop Data Lake Kiran Kamreddy May 2015 One Data Lake: Many Definitions A centralized repository of raw data into which many data-producing streams flow and from which downstream
Seven Things To Consider When Evaluating Privileged Account Security Solutions Contents Introduction 1 Seven questions to ask every privileged account security provider 4 1. Is the solution really secure?
Alliance Key Manager Solution Brief KEY MANAGEMENT Enterprise Encryption Key Management On the road to protecting sensitive data assets, data encryption remains one of the most difficult goals. A major
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
White Paper Solutions for Health Insurance Portability and Accountability Act (HIPAA) Compliance Troy Herrera Sr. Field Solutions Manager Juniper Networks, Inc. 1194 North Mathilda Avenue Sunnyvale, CA
Service management White paper Manage access control effectively across the enterprise with IBM solutions. July 2008 2 Contents 2 Overview 2 Understand today s requirements for developing effective access
RSA Solution Brief Streamlining Security Operations with Managing RSA the Lifecycle of Data Loss Prevention and Encryption RSA envision Keys with Solutions RSA Key Manager RSA Solution Brief 1 Who is asking
Netwrix Auditor Сomplete visibility into who changed what, when and where and who has access to what across the entire IT infrastructure netwrix.com netwrix.com/social 01 Product Overview Netwrix Auditor
Vulnerability Management Buyer s Guide Buyer s Guide 01 Introduction 02 Key Components 03 Other Considerations About Rapid7 01 INTRODUCTION Exploiting weaknesses in browsers, operating systems and other
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK BIG DATA HOLDS BIG PROMISE FOR SECURITY NEHA S. PAWAR, PROF. S. P. AKARTE Computer
Overview West Virginia Department of Education (WVDE) is required by law to collect and store student and educator records, and takes seriously its obligations to secure information systems and protect
Navigating Endpoint Encryption Technologies Whitepaper November 2010 THIS WHITE PAPER IS FOR INFORMATIONAL PURPOSES ONLY, AND MAY CONTAIN TYPOGRAPHICAL ERRORS AND TECHNICAL INACCURACIES. THE CONTENT IS
IBM Software Top tips for securing big data environments Why big data doesn t have to mean big security challenges 2 Top Comprehensive tips for securing data big protection data environments for physical,
Oracle Enterprise Single Sign-on Technical Guide An Oracle White Paper June 2009 EXECUTIVE OVERVIEW Enterprises these days generally have Microsoft Windows desktop users accessing diverse enterprise applications
Global Architecture and Technology Enablement Practice Hadoop with Kerberos Architecture Considerations Document Type: Best Practice Note: The content of this paper refers exclusively to the second maintenance
Four Best Practices for Passing Privileged Account Audits October 2014 1 Table of Contents... 4 1. Discover All Privileged Accounts in Your Environment... 4 2. Remove Privileged Access / Implement Least
PRIVACY, SECURITY AND THE VOLLY SERVICE Delight Delivered by EXECUTIVE SUMMARY The Volly secure digital delivery service from Pitney Bowes is a closed, secure, end-to-end system that consolidates and delivers
The Health Insurance Portability & Accountability Act (HIPAA), enacted 8/21/96, was created to protect the use, storage and transmission of patients healthcare information. This protects all forms of patients
Paper SAS034-2014 What's New in SAS Data Management Nancy Rausch, SAS Institute Inc., Cary, NC; Mike Frost, SAS Institute Inc., Cary, NC, Mike Ames, SAS Institute Inc., Cary ABSTRACT The latest releases
VORMETRIC CLOUD ENCRYPTION GATEWAY Enabling Security and Compliance of Sensitive Data in Cloud Storage Vormetric, Inc. 2545 N. 1st Street, San Jose, CA 95131 United States: 888.267.3732 United Kingdom:
Deploying an Operational Data Store Designed for Big Data A fast, secure, and scalable data staging environment with no data volume or variety constraints Sponsored by: Version: 102 Table of Contents Introduction
Big Data Security Kevvie Fowler kpmg.ca About myself Kevvie Fowler, CISSP, GCFA Partner, Advisory Services KPMG Canada Industry contributions Big data security definitions Definitions Big data Datasets
Securing and protecting the organization s most sensitive data A comprehensive solution using IBM InfoSphere Guardium Data Activity Monitoring and InfoSphere Guardium Data Encryption to provide layered
Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open
An Oracle White Paper June 2009 Oracle Database 11g: Cost-Effective Solutions for Security and Compliance Protecting Sensitive Information Information ranging from trade secrets to financial data to privacy
Cloudera Security Important Notice (c) 2010-2015 Cloudera, Inc. All rights reserved. Cloudera, the Cloudera logo, Cloudera Impala, and any other product or service names or slogans contained in this document
Teradata and Protegrity High-Value Protection for High-Value Data 03.16 EB7178 DATA SECURITY Table of Contents 2 Data-Centric Security: Providing High-Value Protection for High-Value Data 3 Visibility:
API Architecture for the Data Interoperability at OSU initiative Introduction Principles and Standards OSU s current approach to data interoperability consists of low level access and custom data models
The Future of Big Data SAS Automotive Roundtable Los Angeles, CA 5 March 2015 Mike Olson Chief Strategy Officer, Cofounder @mikeolson 1 A New Platform for Pervasive Analytics Multiple big data opportunities
TM Enforcive / Enterprise Security End to End Security and Compliance Management for the IBM i Enterprise Enforcive / Enterprise Security is the single most comprehensive and easy to use security and compliance
Driving Company Security is Challenging. Centralized Management Makes it Simple. Overview - P3 Security Threats, Downtime and High Costs - P3 Threats to Company Security and Profitability - P4 A Revolutionary
Enterprise Security Solutions World-class technical solutions, professional services and training from experts you can trust ISOCORP is a Value-Added Reseller (VAR) and services provider for best in class
White Paper Key Management Best Practices Data encryption is a fundamental component of strategies to address security threats and satisfy regulatory mandates. While encryption is not in itself difficult
TECHNICAL BRIEF Datameer Big Data Governance Bringing open-architected and forward-compatible governance controls to Hadoop analytics As big data moves toward greater mainstream adoption, its compliance
A Clear View of Challenges, Solutions and Business Benefits Introduction Cloud environments are widely adopted because of the powerful, flexible infrastructure and efficient use of resources they provide
is currently used by many large organizations including banks, health care organizations, educational institutions and government agencies. Thousands of organizations rely on File- Cloud for their file
Key Steps to Meeting PCI DSS 2.0 Requirements Using Sensitive Data Discovery and Masking SUMMARY The Payment Card Industry Data Security Standard (PCI DSS) defines 12 high-level security requirements directed
CHAPTER - 3 WEB APPLICATION AND SECURITY 3.1 Introduction Web application or Wepapp is the general term that is normally used to refer to all distributed web-based applications. According to the more technical
Getting Started with Hadoop Raanan Dagan Paul Tibaldi What is Apache Hadoop? Hadoop is a platform for data storage and processing that is Scalable Fault tolerant Open source CORE HADOOP COMPONENTS Hadoop
Vormetric Encryption Architecture Overview Protecting Enterprise Data at Rest with Encryption, Access Controls and Auditing Vormetric, Inc. 2545 N. 1st Street, San Jose, CA 95131 United States: 888.267.3732
SecureTrack Supporting Compliance with PCI DSS 2.0 March 2012 www.tufin.com Table of Contents Introduction... 3 The Importance of Network Security Operations... 3 Supporting PCI DSS with Automated Solutions...