Data Domain Profiling and Data Masking for Hadoop



Similar documents
Data Domain Discovery in Test Data Management

How to Install and Configure EBF15328 for MapR or with MapReduce v1

How to Configure a Secure Connection to Microsoft SQL Server

Secure Agent Quick Start for Windows

Performance Tuning Guidelines for PowerExchange for Microsoft Dynamics CRM

How to Setup SQL Server Replication

Install MS SQL Server 2012 Express Edition

SELF SERVICE RESET PASSWORD MANAGEMENT DATABASE REPLICATION GUIDE

High Availability Configuration

Creating Connection with Hive

Configuring Notification for Business Glossary

Cloudera Backup and Disaster Recovery

StarWind iscsi SAN & NAS: Configuring HA Shared Storage for Scale- Out File Servers in Windows Server 2012 January 2013

Cloudera Backup and Disaster Recovery

Configure an ODBC Connection to SAP HANA

Business Intelligence Tutorial: Introduction to the Data Warehouse Center

Enabling Remote Management of SQL Server Integration Services

Setting up the Oracle Warehouse Builder Project. Topics. Overview. Purpose

Using Microsoft Windows Authentication for Microsoft SQL Server Connections in Data Archive

StreamServe Persuasion SP5 Control Center

Oracle Data Integrator for Big Data. Alex Kotopoulis Senior Principal Product Manager

StarWind iscsi SAN Software: Using StarWind with MS Cluster on Windows Server 2008

StarWind iscsi SAN & NAS: Configuring HA File Server on Windows Server 2012 for SMB NAS January 2013

Technical Notes. EMC NetWorker Performing Backup and Recovery of SharePoint Server by using NetWorker Module for Microsoft SQL VDI Solution

IBM Sterling Control Center

Business Intelligence Tutorial

Installing RMFT on an MS Cluster

Running a Workflow on a PowerCenter Grid

P R O V I S I O N I N G O R A C L E H Y P E R I O N F I N A N C I A L M A N A G E M E N T

Project management integrated into Outlook

Metalogix SharePoint Backup. Advanced Installation Guide. Publication Date: August 24, 2015

Instructions for Configuring a SAS Metadata Server for Use with JMP Clinical

Dell SupportAssist Version 2.0 for Dell OpenManage Essentials Quick Start Guide

Configuring Hadoop Distributed File Service as an Optimized File Archive Store

HareDB HBase Client Web Version USER MANUAL HAREDB TEAM

StarWind iscsi SAN & NAS: Configuring HA Storage for Hyper-V October 2012

Project management integrated into Outlook

Plug-In for Informatica Guide

OLH: Oracle Loader for Hadoop OSCH: Oracle SQL Connector for Hadoop Distributed File System (HDFS)

Quick Start Guide For Ipswitch Failover v9.0

Active Directory Management. Agent Deployment Guide

Setting up Hyper-V for 2X VirtualDesktopServer Manual

2. Unzip the file using a program that supports long filenames, such as WinZip. Do not use DOS.

StarWind iscsi SAN Software: Using with Citrix XenServer

Team Foundation Server 2012 Installation Guide

NTP Software File Auditor for Windows Edition

Important Notice. (c) Cloudera, Inc. All rights reserved.

Cloud Services ADM. Agent Deployment Guide

McAfee epolicy Orchestrator 4.5 Cluster Installation Guide

User Document. Adobe Acrobat 7.0 for Microsoft Windows Group Policy Objects and Active Directory

Setting up Citrix XenServer for 2X VirtualDesktopServer Manual

Configure Managed File Transfer Endpoints

Administration GUIDE. SharePoint Server idataagent. Published On: 11/19/2013 V10 Service Pack 4A Page 1 of 201

Coveo Platform 7.0. Microsoft Dynamics CRM Connector Guide

StreamServe Persuasion SP4

StarWind Virtual SAN Installation and Configuration of Hyper-Converged 2 Nodes with Hyper-V Cluster

Specify the location of an HTML control stored in the application repository. See Using the XPath search method, page 2.

Configuring Data Masking

Cloudera Manager Training: Hands-On Exercises

Using HP Systems Insight Manager to achieve high availability for Microsoft Team Foundation Server

Project management integrated into Outlook

For Active Directory Installation Guide

Configuring Steel-Belted RADIUS Proxy to Send Group Attributes

Avigilon Control Center System Integration Guide

ImageNow Cluster Resource Monitor

Implementing a SAS 9.3 Enterprise BI Server Deployment TS-811. in Microsoft Windows Operating Environments

SAS Add-In 2.1 for Microsoft Office: Getting Started with Data Analysis

Management Center. Installation and Upgrade Guide. Version 8 FR4

Talend Open Studio for MDM. Getting Started Guide 6.0.0

Installing Windows Rights Management Services with Service Pack 2 Step-by- Step Guide

Auditing manual. Archive Manager. Publication Date: November, 2015

DESlock+ Basic Setup Guide ENTERPRISE SERVER ESSENTIAL/STANDARD/PRO

Optimizing the Performance of the Oracle BI Applications using Oracle Datawarehousing Features and Oracle DAC

Technical Support Set-up Procedure

Creating a universe on Hive with Hortonworks HDP 2.0

Cloud Attached Storage

NovaBACKUP xsp Version 15.0 Upgrade Guide

Lab 02 Working with Data Quality Services in SQL Server 2014

StarWind Virtual SAN Installing & Configuring a SQL Server 2012 Failover Cluster

Installation Guide. Novell Storage Manager for Active Directory. Novell Storage Manager for Active Directory Installation Guide

Feith Rules Engine Version 8.1 Install Guide

RUNNING TRACKER ON A TERMINAL SERVER

Configuration of a Load-Balanced and Fail-Over Merak Cluster using Windows Server 2003 Network Load Balancing

Setting up Hyper-V for 2X VirtualDesktopServer Manual

SMART Sync Windows operating systems. System administrator s guide

Dell Enterprise Reporter 2.5. Configuration Manager User Guide

Egnyte Single Sign-On (SSO) Configuration for Active Directory Federation Services (ADFS)

Configuring Active Directory with AD FS and SAML for Brainloop Secure Dataroom Setup Guide

Using Additional Pollers with WhatsUp Gold v16.0 Learn how to install, configure, and manage pollers for load balancing on your WhatsUp Gold system

Version 5.0. SurfControl Web Filter for Citrix Installation Guide for Service Pack 2

Improving Performance of Microsoft CRM 3.0 by Using a Dedicated Report Server

SafeGuard Enterprise Web Helpdesk. Product version: 6 Document date: February 2012

Setting up VMware ESXi for 2X VirtualDesktopServer Manual

Jet Data Manager 2012 User Guide

Drobo How-To Guide. Use a Drobo iscsi Array as a Target for Veeam Backups

Administration Guide for the System Center Cloud Services Process Pack

File Management Utility User Guide

StarWind iscsi SAN Software: Using StarWind with MS Cluster on Windows Server 2003

Transcription:

Data Domain Profiling and Data Masking for Hadoop 1993-2015 Informatica LLC. No part of this document may be reproduced or transmitted in any form, by any means (electronic, photocopying, recording or otherwise) without prior consent of Informatica LLC. All other company and product names may be trade names or trademarks of their respective owners and/or copyrighted materials of such owners.

Abstract In TDM, you can use Hadoop connections when you want to lower the cost of raw data storage and solve large scale analytics by using the distributed computing feature of Hadoop. You can perform data domain profiling, data masking, and data movement on Big Data Edition Hadoop clusters. This article describes how to perform data domain profiling and data masking on a relational database and then move the data in to the target Hive database. Supported Versions Test Data Management 9.7.0 Table of Contents Overview.... 2 Scenario.... 2 Prerequisites.... 3 Step 1. Create a Target Hive Database Connection.... 3 Step 2. Create a Credit Card Masking Rule and Add to Project.... 4 Step 3. Create a Data Domain.... 6 Step 4. Create and Run a Data Domain Profile.... 6 Step 5. Assign the Credit Card Masking Rule.... 8 Step 6. Create a Hadoop Plan.... 9 Step 7. Generate and Run the Workflow.... 10 Overview You can perform data masking and data domain profiling for Hadoop. Use Hadoop sources to lower the cost of raw data storage and to solve large scale analytics by using the distributed computing feature of Hadoop. For example, when you move sensitive data into Hadoop, you can classify data for analytics, data provisioning for testing, or any other purposes. The analytics results processed on Hadoop are faster and cost-effective, and you can extract the analytics results to a conventional database. Scenario You work for an organization in the software testing department, and you want to ensure that raw sensitive data is moved into Hadoop and is made available to testers to build analytical reports on Hive. You need to profile the data to identify sensitive columns, such as credit card numbers, and mask them. The resulting test data must have data integrity. To perform data domain profiling and data masking operations for Hadoop, you must perform the following tasks: 1. Create a Hive database target connection. 2. Create a credit card masking rule and add to a project. 3. Create a data domain. 2

4. Create and run a data domain profile to identify sensitive data, such as credit card numbers. 5. Assign the credit card masking rule to the sensitive column. 6. Create a Hadoop plan. 7. Generate and run the plan. Prerequisites Before you begin, you must perform the following tasks: 1. Install and configure TDM. 2. Create an Oracle database source connection. 3. Create a project and import data sources. Step 1. Create a Target Hive Database Connection In Test Data Manager, you can configure the Hive properties, create a Hive database connection, and use the connection as the target. 1. Log in to Test Data Manager and click Administrator > Preferences. 2. In the Hive Properties section, click Edit. The Edit-Preferences dialog box appears. 3. Select a Hive connection to run the Hadoop plan. Note: You cannot use this connection as target connection in the Hadoop plan. 4. To store the mapping in the Model repository for future use, select Persist Mapping. 5. Click OK. 6. Click the Connections tab. 7. Click Actions > New Connection A tab opens to display the new connection properties. 8. Select the Hive connection type. 9. Enter the connection name, description, and owner information. The connection name must begin with an alphabetic character. If you enter a connection name that begins with a numeric character, a workflow that you use the connection in might fail. 10. Click Next. 11. To use Hive as a target, select Access Hive as a source or target. 12. To use the connection to run mappings in the Hadoop cluster, select Use Hive to run mappings in Hadoop cluster. 13. Enter the user name and environment SQL details. 14. Click Next. 15. To access the metadata from the Hadoop cluster, enter the metadata connection string. 16. To access data from the Hadoop cluster, enter the data access connection string. 17. To run mappings in Hadoop cluster, enter the following parameters: Database Name. Enter the name default for tables that do not have a specified database name. 3

Default FS URI. Enter the URI to access the default HDFS. JobTracker/Yarn Resource Manager URI. Enter the specific node in the Hadoop cluster. Hive Warehouse Directory on HDFS. Enter the HDFS file path of the default database. Metastore Execution Mode. To connect to a remote metastore, select Remote. Metastore Database URI. To access metadata in a remote metastore setup, specify the metastore URI with the thrift server details. The following image shows the Hive connection properties: 18. To test the connection, click Test Connection. 19. To save the connection, click Finish. The connection is visible in the Administrator Connections view. Step 2. Create a Credit Card Masking Rule and Add to Project You create a credit card masking rule to mask the credit card numbers. Add the masking to a project. 1. To access the Policies view, click Policies. 2. Click Actions > New > Masking Rule. The Rule Wizard appears. 3. Enter a name and optional description for the rule. 4. Select the String data type. 5. Choose Standard and select the Credit Card masking rule from the list. 6. To enable users to override masking parameters for a rule, select the Override Allowed option. 4

7. Click Next. 8. Select Repeatable Output and enter a seed value. 9. Select Keep Card to return the same credit card type for the masked credit card. 10. Enter the exception handling options. Configure how to handle null or empty spaces. Configure whether to continue processing on error. The following image shows the credit card masking rule parameters: 11. Click Finish. 12. To add the masking rule to the project, click Projects. You can see a list of projects. 13. Open the project that you created. The project window opens in a separate tab. 14. Click Overview > Policies. 15. Click Actions > Add Additional Rules. The Add Additional Rules dialog box appears. 16. Select the credit card masking rule from the list. 17. Click OK. The masking rule appears under the Additional Rules list. 5

Step 3. Create a Data Domain When you create a data domain, you can add the masking rules and enable the default masking rule. 1. To create a data domain, click Policies. The Policies view shows a list of the policies, data domains, and rules in the TDM repository. 2. Click Actions > New > Data Domain. 3. Enter a name, sensitivity level, and description for the data domain. Click Next. 4. Click Next. 5. Optionally, enter a regular expression to filter columns by data pattern. 6. Click Next. 7. To skip adding regular expression for metadata, click Next. 8. To add preferred masking rules to the data domain, click Add Rules. The Add Rules dialog box appears. 9. Select the credit card data masking rule that you created. 10. Click OK. 11. Enable the rule as the default masking rule. 12. Click Finish. Step 4. Create and Run a Data Domain Profile Create and run data domain profiles in the Discover view. A project must contain policies before you create a data domain profile. The policies contain the data domains that you can use in a profile for data discovery. 1. Open the project and click the Discover view. 2. Click the Profile view. The Profile view shows a list of the profiles in the project. 3. To create a new profile, click Actions > New Profile. 4. In the New Profile dialog box, enter the profile name and description. 5. Choose to create a data domain profile. 6. Click Add Tables and select the tables to profile. Click OK. 7. Click Next. 8. Select Data Domain. 9. Select the data domain that you created. 10. In the Sampling panel, select Data and Column name to run data discovery. 11. Enter the maximum number of rows to profile. 12. Enter the minimum conformance percent. All rows might not conform to the data domain expression pattern. You can enter a minimum percentage of the profiled rows that must conform. 6

The following image shows the list of data domains and sampling option that you can configure: 13. Click Save. The profile opens in a tab. 14. To run the profile, click Actions > Execute. 15. In the Execute Profile dialog box, click Execute. View the progress of the profile run from the Monitor tab within the project. 16. Open the profile and click the Data Domain tab. 7

17. Click Actions > Mark domain classification as completed. The following image shows the data domain profile review: Step 5. Assign the Credit Card Masking Rule Assign the masking rule from the data domain to the sensitive column in the project source that you want to mask. 1. In the project, click Define Data Masking to access the Data Masking view. 2. Select the CREDIT_CARD column to assign the masking rule to. 3. Click inside the Masking Rule column to view the list of available rules. The data domain preferred rules appear at the top of the list. 4. Select the credit card masking rule that you created. The following image shows the data masking rule assignment to the CREDIT_CARD column: 5. To save the assignment, click Save. 8

Step 6. Create a Hadoop Plan To perform data domain profiling and data masking operations for Hadoop connections, you can create a Hadoop plan. Add data masking components to the Hadoop plan. 1. Open a project and click Execute. 2. Click Actions > New. 3. In the New Plan dialog box, enter a name and optional description for the plan. 4. Select Hadoop plan type. 5. Click Next. 6. To add the masking rule to the plan, click Add Masking Components. 7. Select the credit card masking rule to add to the plan. Click OK. 8. Click Next. 9. To skip adding the groups, click Next. 10. Review the masking component. 11. Click Next. 12. Configure the Oracle source connection and the Hive target connection. 13. Configure target properties, error and recovery settings, and advanced settings. The following image shows the Hadoop plan settings: 14. Click Next. 15. To override plan settings, click Override Plan Settings and enter the properties. 16. To override table settings, click Override Data Source Settings and enter the properties. 17. Click Finish. 9

Step 7. Generate and Run the Workflow After you create the Hadoop plan, generate and run the workflow to populate the masked data in the Hive database. 1. In the project, click Execute to access the plans in the project. 2. Select the Hadoop plan that you created. 3. Click Actions > Generate Workflow. The Generate Workflow dialog box appears. 4. Select Schedule Now. 5. Click Generate Workflow. View the status of the workflow generation in the Monitor tab. 6. In the project, click Execute to access the plans in the project. 7. Click the Hadoop plan that you created. The plan opens in a separate tab. 8. Click Actions > Execute Workflow. The Execute Workflow dialog box appears. 9. Select the Data Integration Service. 10. Select Schedule Now. 11. Click Execute Workflow. View the status of the workflow run in the Monitor tab. Author Vinita Arun Kumar Senior Technical Writer Acknowledgements The author would like to acknowledge the Development and QA teams. 10