DATAGUISE WHITE PAPER SECURING HADOOP: DISCOVERING AND SECURING SENSITIVE DATA IN HADOOP DATA STORES OVERVIEW: The rapid expansion of corporate data being transferred or collected and stored in Hadoop HDFS is creating a critical problem for Chief Information Security Officers, compliance professionals, and IT staff responsible for data management and security. Frequently, the people responsible for securing corporate data are not even aware that Hadoop has been installed and in use within the company. Dataguise DG for Hadoop scans data stores, locates sensitive content and then applies remedies such as data masking and encryption to ensure compliance with industry regulations, such as HIPAA and PCI, as well as internal corporate data governance policies.
2 BIG DATA EXPLOSION Petabytes of data - structured, semi-structured and unstructured - are accumulating and propagating across your business. A good portion of this data comes from external sources and from customer interaction channels such as web sites, call center records, log files, Facebook, and Twitter. To mine these large volumes and varieties of data in a cost efficient way, companies are adopting new technologies such as Hadoop. TRADITIONAL DATA WAREHOUSES What about traditional data warehouses? While they offer many advantages for decision support, the drawbacks of traditional data warehouses are that they are hugely expensive and require that the schema be decided well in advance; taking away the flexibility of deciding how to slice and dice data as new methods of analysis emerge. Because Hadoop can be set up and expanded rapidly using commodity hardware, and schema may be defined at the time of reading the data, it is becoming the new platform of choice for processing and analyzing big data. BUT WE DON T HAVE HADOOP Chief Information Security Officers (CISOs), CIOs and others involved in corporate information security will often say that their organization does not have any Hadoop clusters. They rely on processes in place to ensure that any software installed in the enterprise has gone through an extensive approval and procurement process before it is implemented. They are therefore often surprised to learn that Hadoop is already installed and running and this happens because Hadoop is a free download and is available directly from the Apache website or from one of the leading distributors of Hadoop such as Cloudera, MapR, IBM (InfoSphere BigInsights), and Hortonworks. It is very simple for any number of employees to create a Hadoop installation and be up and running very quickly. Even if Hadoop is only being used in a sandbox (isolated from the production systems) or in a test environment, corporate data being stored in Hadoop must still adhere to the same rigorous corporate standards in place for the data infrastructure or the company risks the consequences of failing a compliance audit, or even worse a data breach. FINDING SENSITIVE DATA IN HADOOP Sensitive data includes items such as taxpayer IDs, employee names, addresses, credit card numbers, and many more. Data theft prevention, sound governance practices and the need to satisfy THE AGE OF BIG DATA HAS ARRIVED. Even companies that think they don t have Hadoop are surprised to learn it has been downloaded and is in use with sensitive corporate data. DO YOU KNOW WHERE YOUR RISK EXPOSURES ARE?
3 compliance requirements for industry regulations such as PCI, HIPAA, and PII make it imperative that organizations implement the necessary processes to identify and protect sensitive information. The first step in securing Hadoop is to search for and locate sensitive information, and determine the volume and types of data that are at risk. The challenge is that many search products are designed to work only with structured data using basic regular expressions. Scanning data in Hadoop requires a sophisticated discovery tool that can scan large volumes of both structured and unstructured data, and do it rapidly. THE NEED TO PROTECT SENSITIVE DATA Once it has been determined that Hadoop is in the corporate infrastructure and contains sensitive information, CIOs and CISOs should be very nervous about potential exposure. Because Hadoop has been mostly used in Social Media companies so far, and only now is being used by financial, health-care, and other security-conscious enterprises, options for data protection in Hadoop have been limited. Whereas there are numerous options for legacy databases and structured data stores, Hadoop poses a new challenge for companies that need to maintain compliance. The same type of strong protection in use for traditional data stores is needed for the Hadoop environment, as well. CHOOSE MASKING OR ENCRYPTION Whether sensitive data was stored in Hadoop intentionally or unintentionally, once it is discovered and documented there are two main approaches to remediation: encryption and masking. Encryption is typically used when access to the sensitive content is needed for analytical purposes. The encrypted data can be decrypted by an authorized user at the time of use. Masking is used to protect sensitive data when there is no need for the actual sensitive content, as masking replaces sensitive data with realistic (but not real) data. Optionally, consistency may be maintained to retain the statistical distribution of data. Although there are some similarities between data masking and encryption, they are different in usage, technology and deployment strategies. Encryption can conceal private data and decrypt it based on encryption keys. Data masking on the other hand conceals private data but masked values cannot be reversed. THE DATAGUISE SOLUTION Dataguise specializes in sensitive data protection in large repositories. We began with relational databases, expanded to shared file systems "It doesn't take a clairvoyant or in this case, a research analyst to see that 'big data' is becoming (if it isn't already, perhaps) a major buzzword in security circles. Much of the securing of big data will need to be handled by thoroughly understanding the data and its usage patterns. Having the ability to identify, control access to, and where possible mask sensitive data in big data environments based on policy is an important part of the overall approach." Ramon Krikken Research VP, Security and Risk Management Strategies Analyst Gartner DO YOU HAVE THE TOOLS TO PROTECT YOUR DATA?
4 and Microsoft SharePoint, and now we are bringing our enterprise-class expertise to secure Hadoop. Bringing together experienced technology professionals from database, security, and search specialties, we combine the best of these disciplines to secure Hadoop in the enterprise. Dataguise s core product for Hadoop DISCOVER DG for Hadoop combines sensitive data discovery, user and event 4R reports, and options for both encryption and masking to provide the most comprehensive data security solution in the market today. INTRODUCING DGSECURE The purpose of DG for Hadoop is simple yet crucial to detect and protect sensitive data in Hadoop implementations. As part of the Dataguise DgSecure product line, DG for Hadoop is the ideal solution to help ensure that compliance standards are met while reaping the benefits of using Hadoop to manage large amounts of structured and unstructured data. To accommodate various usage patterns, DG for Hadoop supports detection and protection at the source before moving data to Hadoop, in flight while moving data to Hadoop, and at rest after moving unprotected data to HDFS. Just in time protection is provided through incremental scan of newly added data in HDFS. Once sensitive data is located and identified, either masking or encryption can be chosen as a protection method, based on specific requirements of the organization or the purpose of storing and managing the data. HOW IT WORKS DG for Hadoop gives the Chief Security office and other entities that have the responsibility of conforming to industry regulations and Corporate Governance, Risk, and Compliance (GRC), the ability to define policies. These policies define what data is considered sensitive, based on a combination of pre-built data types and custom data types that the user can add. The policies also allow the sensitive data types to be grouped to be in alignment with regulations, and allow remedial actions to be specified. This provides guidance for those handling the data on what to do. All of the details of the data repositories and the actions taken are fully logged, so, in the backend of this process, the Chief Security office and others can track risk profiles through the dashboard which also provides actionable details and use reports to audit actions to ensure that the right people have the right access to the right data. The first step is to search for any sensitive data in Hadoop data stores located on premises or in the cloud. A user operating under the corporate policy DEFINE POLICY guidance (PCI, PII, HIPAA etc.) creates a task definition against one or more files and directories or combinations of them and executes the job. DG for Hadoop scans all the targeted data stores to find data that meet those criteria and take appropriate, task-specified remediation actions. Additional search features: Custom data types add custom expressions to augment the built-in capabilities Columnar searches to allow for searches of structured data Incremental scans to quickly search only the new data that came in since the last scan
5 ANALYZE After collecting the information about sensitive data, DG for Hadoop delivers users risk assessment analytics in the form of easy-to-interpret graphical summaries. Users can then evaluate their compliance exposure profiles and decide on the most appropriate remediation policies to implement. REMEDIATION DG for Hadoop provides three main options for remediation 1) Notification (Search only) whenever new data have been ingested into Hadoop, DG for Hadoop processes the content and informs the designated users of the presence of sensitive data 2) Search and mask As part of locating sensitive data, masking can also be executed based on a predefined policy in-flight as data gets into Hadoop or within in Hadoop once the data is there 3) Search and Encrypt As part of locating sensitive data, DG for Hadoop can optionally encrypt the entire row or just the specific fields in-flight or in Hadoop HDFS DASHBOARD AND REPORTING REPORT DG for Hadoop provides top level summary (directory), and in-depth (file level) detailed information about where sensitive content resides, and remediation method(s) applied, highlighting the gaps in protection and providing actionable data for appropriate follow ups. DG FOR HADOOP SOLUTION FOR ALL USAGE PATTERNS
6 BENEFITS DG for Hadoop provides unique and important solution for enabling data security in Hadoop. Tangible benefits include the ability to conform to regulatory requirements, prevent the risk of failing a compliance audit, and ensuring that valuable corporate data is safe from security breaches. Implementing DG for Hadoop enables organizations to: Simplify Data Compliance Management in Hadoop Eliminates the need to build custom applications or patch together disparate tools to search for and protect sensitive data. Improve Operational Efficiencies Less staff time is required to administer custom Hadoop security controls, custom reports or to move data to databases or other data stores for remediation. Reduce Regulatory Compliance Costs One tool can now take care of tasks that previously took multiple software products and costly consulting hours to achieve. Automate Compliance Assessment and Enforcement DG Hadoop dashboard and reports summarize sensitive data content with actionable details of exposure risks. CONCLUSION Protecting sensitive data in Hadoop is critical as volumes of data continue to expand in the enterprise and Hadoop continues to become technology of choice at increasing rate. Effective data protection strategy must start with finding all of the sensitive data, putting in place the proper remediation policies, and monitoring data flow to ensure that the established procedures are followed. DG for Hadoop is the leading solution to secure Hadoop effectively, quickly and to ensure adherence to sound data governance practices across the entire big data environment. ABOUT DATAGUISE Dataguise helps organizations safely leverage their enterprise data with a comprehensive risk-based data protection solution. By automatically locating sensitive data, transparently protecting it with high performance masking, encryption or quarantine, and providing enterprise security intelligence to managers responsible for regulatory compliance and governance, Dataguise improves data risk management and operational efficiencies while reducing regulatory compliance costs. For more information, call 510-824-1036 or visit www.dataguise.com. Dataguise, Inc. 2012. All rights reserved.