1 Planning a data security and auditing deployment for Hadoop
2 Introduction Architecture Plan Implement Operationalize Conclusion Key requirements for detecting data breaches and addressing compliance. In-depth look at the architecture throughout the Hadoop stack. Best practices to consider when building out your data security plan. Building blocks for implementing effective data monitoring. Operationalize your processes with extra emphasis given to handling security breaches and forensic investigations. With InfoSphere Guardium, jump start your organization s use of Hadoop for enhanced business value.
3 3 Monitoring and auditing: Another step toward enterprise readiness for Hadoop Hadoop is delivering insights for many organizations that are using it. However, the security risks remain high. Although some Hadoop distributions do support various security and authentication solutions, there has not been a comprehensive data activity monitoring solution for Hadoop until now. Considering that even robust and mature enterprise relational database systems are often the target of attacks, the relative lack of controls around Hadoop makes it an attractive target, especially as more sensitive and valuable data from a wide variety of sources moves into the Hadoop cluster. Organizations who tackle this issue head-on, sooner rather than later, position themselves to expand their use of Hadoop for enhanced business value. They can proceed with the confidence that they can address regulatory requirements and detect breaches quickly, thus reducing overall business risk for the Hadoop project. Ideally, organizations should be able to integrate big data applications and analysis into an existing data security infrastructure, rather than relying on homegrown scripts and monitors, which can be labor-intensive, error-prone and subject to misuse. With IBM InfoSphere Guardium data security solutions, much of the heavy-lifting is taken care of for you. You define security policies that specify what data needs to be retained and how to react to policy violations. Data events are written directly to a hardened appliance, leaving no opportunity for even privileged users to access that data and hide their tracks. Out-of-the-box reports and policies get you up and running quickly, and those reports and policies are easily customized to align with your audit requirements. Using can dramatically simplify your path to audit-readiness by providing targeted, actionable information. A scalable architecture and support for a wide variety of data platforms make it possible to integrate Hadoop activity monitoring with other database activity, giving security administrators an enterprise-wide monitoring and alerting system for detecting and preventing threats.
4 4 Comprehensive monitoring for Hadoop helps you make sense of what s going on by actively monitoring activity throughtout the Cloudera or IBM InfoSphere BigInsights Hadoop stack (see Figure 1), including Hue/Beeswax or BigInsights Web Console, MapReduce, Hive, HBase and HDFS. Not only does this comprehensive monitoring help with data protection, it can also help you find and react to breaches or unauthorized access quicker by making it easier to see what is happening. Even though much of the activity in Hadoop breaks down to MapReduce and HDFS, at that level, you may not be able to tell what a user higher up in the stack was really trying to do, or even who the user was. It is similar to showing disk segment I/O operations instead of an audit trail of a database. User Interface Application Storage Oozie HBase BigInsights Hue MapReduce HDFS Hive Figure 1. can capture activity as it flows through the Hadoop stack Who submitted the job/query? What jobs? What queries? Is this an authorized job? Permission exceptions? What files accessed? What HBase tables accessed?
5 5 Comprehensive monitoring for Hadoop (continued) By providing monitoring at different levels, you are more likely to understand the activity, as well as being able to audit activities that come in directly through lower points in the stack. Server Type Server IP Hive Parsed SQL CREATE EXTERNAL TABLE demo22 (a int, b int, c int) location HADOOP /user/syonoa HADOOP select * from JoeD222 HADOOP SELECT * FROM DavidTest Hive User david david Hive Command Hive Database cloudera create_table default get_table get_table default default Hive Table Name demo22 JoeD2222 DavidTest Hive Error NoSuchObjectException(message;default.JoeD2222 table not found) For example, the Hue/Beeswax report included with will show you the actual Hive queries that were run, as shown in Figure 2. A report in the same time period for HDFS would show you that activity at a file-system level. HADOOP SELECT * FROM DavidTest HADOOP DROP TABLE sample_07 HADOOP SELECT * FROM sample_08 Figure 2. See commands, users, exceptions and more. david get_table cloudera get_table cloudera get_table default default default DavidTest SAMPLE_07 SAMPLE_08
6 6 Architecture of the solution As shown in Figure 3, continuously monitors data activity using lightweight software probes called S-TAPs without relying on logs. The S-TAPs also do not require any changes to the Hadoop servers or applications. The S-TAP was originally designed for performance with low overhead; after all, the S-TAP is also Hive queries used to monitor production database environments. S-TAP Because privileged users can delete or modify logs, helps ensure separation of duties by immediately intercepting and forwarding data activity to a separate hardened appliance, known as a Collector. There, the activity messages are compared to previously defined policies to detect violations that could, for example, generate an alert in real time. The relevant activity is stored in the Guardium repository from which you can also do forensic analysis and schedule regular audit reports. MapReduce jobs Clients HDFS and HBase commands Cluster InfoSphere Guardium collector reporting and alerting Figure 3. Architecture enforces separation of duties
7 7 Make a plan Data activity monitoring for Hadoop is newer than Hadoop itself, but with, a wide variety of enterprise data sources can be monitored using the same scalable environment. If you are already monitoring relational databases, the planning concepts will be similar, even if the specifics are different. Here are some questions to aid in planning a monitoring and auditing solution for Hadoop: Who needs to be involved? Where is the monitoring software installed? Where should the appliances be located? How should the deployment be rolled out? What are the business requirements for monitoring? Who needs to be involved? Rolling out an auditing solution requires the cooperation of a cross-disciplinary team, both during and after implementation. For the implementation stage, Table 1 provides some ideas of the roles and responsibilities involved. Adjust it for your own organization. Recommendation: For new implementations of, it is always a good idea to consider using services. The IBM services team can provide you with best practices and hands-on assistance to get up and running successfully.
8 8 Make a plan (continued) Where is software installed? Where should appliances be located? consists of software components that sit on the Hadoop cluster servers (the S-TAPs and the optional installation manager agents) and separate hardware or software appliances. The appliances can be fully configured software solutions delivered on physical appliances provided by IBM or software images that you deploy on your own hardware. scalable architecture The distributed architecture is built to scale from small to very large using a graduated system of collectors and aggregators, as well as the ability to perform load balancing (see Figure 4). Primary team members Business Analyst Data Monitoring Architecture team Contributing team members System Administrator Project Manager Network Engineer Storage and Backup engineers Security Escalation Security Team Technology Group Application Managers Collects and documents business requirements for auditing, monitoring and logging. Responsible for defining reports, policies and audit processes. To properly observe segregation of duties requirements, members of this team should not have privileges to install policies or modify the contents of groups that are defined for use in Guardium policies and reports, such as authorized users, privileged users or sensitive data. Typically installs software on operating systems. They would install components such as S-TAPs. Manages product implementations and upgrades. Assigns IP addresses to the appliance, and ensures connectivity through network infrastructure including firewalls. Ensure that retention period policies are in compliance, and proper operational procedures are in place. Performs/activates forensic analysis if a data security breach is reported. Produces standards for monitoring; stays up-to-date on industry data security requirements and government regulations. Evaluates, tests and certifies new software releases and patches; produces technical documentation. Keep application administrator informed of non-bau activity and implementation of new modules that may impact data collection. Hadoop Administrator Keeps application administrator informed of changes in platform environment, such as upgrades of OS and introductions of new servers. Table 1. Team members for an deployment
9 9 Make a plan (continued) A Collector is used to collect data activity, analyze it in real time, and log it in the internal repository for further analysis and/or reacting in real-time (alerting). Depending on how much audit data you collect (which is determined by your business requirements for auditing), you may need multiple Collectors, which should be co-located in the same data center as the Hadoop cluster. The Aggregator is used to collect and merge information from multiple appliances (collectors and other aggregators) to produce a holistic view of the entire environment and generate enterprise-level reports. The Aggregator does not collect data itself; it just aggregates data from multiple sources. A single Aggregator can support up to ten Collectors. The Aggregator can be located anywhere, but requires network connectivity to the Collector units. The Central Manager is used to manage the entire deployment (all the collectors and aggregators) from a single console, including patch installation, software updates, and the management Policies, groups, users pushed down from Central Manager. Aggregator Collectors Figure 4. Scalable, distributed architecture Central Manager and configuration of queries, reports, groups, users and policies. The Central Manager and Aggregator can be on the same appliance. Collectors Definitions pushed up from Collectors and Aggregators to Central Manager. Aggregator Nightly audit data uploads from Collectors.
10 10 Make a plan (continued) S-TAPs reside on Hadoop servers Think of the S-TAP as the listener for data activity; one is installed on each Hadoop server that requires monitoring (see Figure 5). Each S-TAP must be configured with one or more inspection engines. This is how you tell S-TAP which ports to monitor. For example, if you have the HDFS NameNode and Hive master on the same machine, you would need one S-TAP configured with two inspection engines. updating multiple S-TAPs using the Installation Manager (GIM). GIM sits on a Central Manager and provides a UI interface to make S-TAP management, including applying software maintenance, simpler and more automated. This would require the installation of an Installation Manager S-TAP agent on each server, which you can do during any maintenance window, and then use GIM to install the S-TAPs. No SQL DB HBase HBase Master Distributed data processing Map/Reduce JobTracker Clients Distributed data Stgorage HDFS NameNode SecondaryNN Distributed query processing HiveServer To configure the inspection engines, you will need to work with the network or Hadoop administrator to get a list of the ports, such as the JobTracker ports and HBase master. IBM provides a centralized solution for installing and Data Node Task Tracker HBase Region Data Node Task Tracker HBase Region Data Node Task Tracker HBase Region S-TAP Optional S-TAP required only for monitoring HBase Put commands Figure 5. Hadoop servers with Guardium S-TAPs Data Node Task Tracker HBase Region
11 11 Make a plan (continued) How should the deployment be rolled out? As with any significant IT infrastructure enhancement, it s a good idea to do a proof-of-concept in a sandbox environment. Not only will this help you validate the auditing solution, it will give you the opportunity to see for yourself how data activity is stored. It may also help you identify processes and procedures you need to put in place to make sure the production deployment will go smoothly, and to help support automation procedures. For example, it is possible to automatically update privileged users groups or sensitive data objects in the system on a scheduled basis. For a production deployment, IBM services can help you create a project plan that will include education, planning, installation and configuration. What are the business requirements for data monitoring? Although provides a comprehensive data monitoring solution, in reality you don t need to monitor everything. For example, Hadoop has a chatty protocol, so includes a built-in policy with rules to filter out some of the internal messages the system uses for health checks. Over time, you can add rules to ensure that you are capturing activity that is required for audit. There are different levels of auditing to consider: Privileged user audit applies only to specific users or groups of users, and everything else is filtered out before even being sent to the InfoSphere Guardium appliance. Selective auditing means that only a subset of data activity is logged. However, in this case, everything is sent to the appliance, where it is determined whether the information is relevant and should be maintained. Comprehensive auditing means that everything is audited and logged. If you are already using database activity monitoring for audit and compliance, someone with Hadoop expertise may be able to map between the requirements on databases and those for Hadoop. For example, permission exceptions in Hadoop are file system permission errors rather than database authorization errors.
12 12 Implement monitoring After you get the appliances and S-TAPs installed and connected on the network, all the planning work you did around business requirements will be beneficial when implementing monitoring. You will start with the basic building blocks of creating groups and build upon that as follows: 1. Define and populate groups. This includes groups of users, sensitive data objects, applications, server IPs and client IPs. 2. Define a security policy. 3. Customize out-of-the-box auditing reports, or create your own. Create and populate groups to simplify management and maintenance Groups are central to simplifying management and control of the auditing environment. By classifying users, applications, servers, data objects and more into groups, you can fully take advantage of the flexibility and power of the system, while also keeping it manageable. Think about some of the following groups: Privileged users (administrators) Sensitive objects (files or HBase tables) Applications Server IPs (this will help with managing traffic coming from multiple IPs) Client IPs to help you manage and track back suspicious activity Commands (are there certain commands you want to capture and/or filter out?) For example, you may not want to audit the activity of certain authorized applications, but you do want to be made aware of any application that starts accessing data that has not gone through your internal approval processes for being authorized. Therefore, with, you can create a group that you keep maintained with all authorized programs, and you can set up a report to track activity only for those applications NOT in this group.
13 13 Implement monitoring (continued) For example, Figure 6 shows how to create a group of authorized programs called MapReduce and sortlines. The new group is named Hadoop Authorized Job List. Use the Guardium Group Builder to populate the group with members. Figure 6. Creating a group of authorized programs
14 14 Implement monitoring (continued) Figure 7 shows a partial report from Cloudera (CDH4) that includes a query to show activity from any application that is NOT in the authorized job list group. The program PiEstimator has not been added to the authorized job list, and you can see its activity in this report. The automation process can be scheduled to run on a periodic basis to pick up any new changes in the Hadoop system, such as new users. DB User Name MapReduce User MapReduce Name MapReduce Job Source Program There are several options for creating groups. You will probably use several approaches to create and automate the update of these groups, including: Manual entry by working with application owners to identify sensitive data objects for specific environments An API to script the creation of groups from your own input Populate from a query using observed traffic from LDAP/Active Directory integration to import users SVORUGA svoruga PiEstimator SVORUGA svoruga PiEstimator Figure 7. Extract of an unauthorized job activity report job_ _0007 job_ _0007 HADOOP PROTOBUF CLIENT PROGRAM HADOOP PROTOBUF CLIENT PROGRAM
15 15 Implement monitoring (continued) Define a security policy Policies are sets of rules and actions that direct the operations and behavior of the system, including which traffic is ignored and which is logged; which activities require more granular logging; and, when to prompt real-time alerts. You can then add on additional rules such as ignoring trusted sessions, or log the activities of privileged users with more detail. Again, this is where your predefined group of privileged users will help. includes a Hadoop policy that you can customize, as shown in Figure 8. The purpose of the predefined policy rules is to filter out traffic that is not needed for auditing. The policies make use of predefined groups such as Hadoop- SkipObjects. This is the case where you will likely create and modify such groups based on the observed traffic in your system. Figure 8. Hadoop policy built-in rules
16 16 Implement monitoring (continued) In addition, you can use policies to define real-time alerts. For example, you can create a rule in which an alert is fired whenever a user from a particular group, such as a privileged user, attempts to access a sensitive data set that they are not authorized to access. This requires creating a group of privileged users and a group of sensitive data objects. Figure 9 is an example of how this alert will appear on the Guardium Incident Management tab. Alerts can also be sent to addresses. Figure 9. Alert on access to sensitive files by a user who is not authorized
17 17 Implement monitoring (continued) Customize reports and create compliance automation workflows Because stores all information from all monitored sources into a common schema, many existing reports included with will show valid information for Hadoop, such as session information. also includes several reports that have already been tailored for Hadoop, including MapReduce activity, detecting unauthorized MapReduce jobs, Hue/Beeswax reports for Hive, HDFS activity, and full details reports. You can customize these reports, or build your own tailored to your own audit process requirements using the robust query building and report building capabilities in InfoSphere Guardium. includes workflow capabilities to enable the distribution and signoff of audit reports. Results can be delivered to users, groups of users, or roles. (Using roles is recommended to enable more than one user to review and sign off. Roles also make it easier to manage employee absence and turnover.) Start by: 1. Identifying who should receive reports for what job function (info security manager). 2. Identifying groups of users with the same job function and grouping them into roles. You can use the predefined roles in, or create your own customized roles. 3. Creating users and assigning them to their appropriate roles. 4. Determining how often reports need to be generated. 5. Determining who receives the reports, whether review/signoff is required and whether the delivery should stop at any user or role until they complete the required action.
18 18 Operationalize your processes Operational procedures should be defined for each of the teams that are involved in administering the environment or in evaluating and acting on monitoring results. Process flows can be very useful in defining responsibilities and the sequence of steps needed to address a particular situation, such as when new users are authorized to the InfoSphere Guardium system, or when policy rules need to change. Extra emphasis should be given to processes related to handling security breaches and forensic investigations. The support team needs to be made aware of the rules and trained on steps to be performed in case a security breach occurs. Based on the business requirements, daily, weekly, monthly, quarterly and cyclical tasks should be defined and documented. Here is a simplified example of a plan: Daily The Administrator: Verifies archiving/aggregation and backup Follows up on self-monitoring alerts from the previous night The Audit team: Performs review of the automated audit processes set up on the system Investigates any activity that is not business as usual Escalates data security breach attempts Weekly The Administrator: Verifies space utilization on the appliance Verifies that data is being logged correctly Verifies that the appliance is purging and archiving correctly Verifies that all scheduled jobs are executed on time The Audit team: Meets with the members of the Hadoop administration and Application teams to discuss findings
19 19 Conclusion The implementation of an InfoSphere Guardium data activity monitoring solution for Hadoop can help jump start your organization s use of Hadoop for enhanced business value. With the correct planning and understanding of your business requirements for monitoring and auditing, can help you address regulatory requirements and reduce your risk of data breaches from hackers or insiders. Resources For more information, please visit ibm.com/guardium. Data Security v9 Deliver real-time activity monitoring and automated compliance reporting for Big Data security Learn more Big data security and auditing with IBM Monitor and audit access for IBM InfoSphere BigInsights and Cloudera Hadoop Download here Understanding holistic database security 8 steps to successfully securing enterprise data sources Download here
20 20 For more information on managing database security in your organization, visit ibm.com/guardium Copyright IBM Corporation 2012 IBM Corporation Software Group Route 100 Somers, NY Produced in the United States of America December 2012 IBM, the IBM logo, ibm.com, InfoSphere, and Guardium are trademarks of International Business Machines Corp., registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the Web at Copyright and trademark information at This document is current as of the initial date of publication and may be changed by IBM at any time. Not all offerings are available in every country in which IBM operates. THE INFORMATION IN THIS DOCUMENT IS PROVIDED AS IS WITHOUT ANY WARRANTY, EXPRESS OR IMPLIED, INCLUDING WITHOUT ANY WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND ANY WARRANTY OR CONDITION OF NON-INFRINGEMENT. IBM products are warranted according to the terms and conditions of the agreements under which they are provided. NIB03017-USEN-00