Data Security as a Business Enabler Not a Ball & Chain Big Data Everywhere May 12, 2015
Les McMonagle Protegrity - Director Data Security Solutions Les has over twenty years experience in information security. He has held the position of Chief Information Security Officer (CISO) for a credit card company and ILC bank, founded a computer training and IT outsourcing company in Europe and helped several security technology firms develop their initial product strategy. Les founded and managed Teradata s Information Security, Data Privacy and Regulatory Compliance Center of Excellence and is currently Director of Data Security Solutions for Protegrity. Les holds a BS in MIS, CISSP, CISA, ITIL and other relevant industry certifications. Les McMonagle (CISSP, CISA, ITIL) Mobile: (617) 501-7144 Email:Les.McMonagle@Protegrity.com 2
The Problem... The cost of cybercrime is staggering: The annual cost to the global economy is in excess of $400 billion/year. Businesses that are victims of cybercrime need an average of 18 days to resolve the problem and suffer average costs of over $400K. The tangible and intangible costs associated with some of the recent high-profile cases exceeds $400M. Traditional network security, firewalls, IDS, SIEM, AV and monitoring solutions do not offer the comprehensive security needed to protect the target data against current, new and evolving threats. 3
Typical Phases of an Attack 4 http://eval.symantec.com/mktginfo/enterprise/white_papers/b-anatomy_of_a_data_breach_wp_20049424-1.en-us.pdf
Factors to Consider Bad guys search for the easy targets Large repositories of valuable, un-protected data Systems with weaker controls and/or more access paths Financial Data or Personally Identifiable Information (PII) Blurring or Network Boundaries Where does your company network end and another begin? BYOD Cloud IoT (Internet of Things) Insider threats remain the biggest threat Advanced Persistent Threats (APTs) Coordinated, comprehensive attack strategies 5
Types of Sensitive Data Potentially Stored in Hadoop SSN Credit Card PAN Bank Account Numbers PIN Pending Patents Health History DOB Production Planning Prescriptions Employee Personnel Records Best Practices Trade Secrets Customer Lists Health Records Sales Forecasts Payroll Data Accounts Receivable Order History Accounts Payable Customer Contact Information R&D Home Addresses Income Data Salary Data Location Data Passwords Project Plans
Process Policy Sponsors What to do about it? Engage Information Security CISO & InfoSec Work with Legal and Compliance Establish Good Data Governance Program Apply consistent protection throughout the data flow Limit access on a Need-to-Know basis Protect the actual data Itself (regardless of where it is) De-Identify data without losing analytics value
Engage InfoSec, Legal, Compliance, Privacy Engage Information Security rather than avoid them CISO s and InfoSec ultimately have the same goals Will help fund and implement effective data protection Legal, Privacy and Compliance Identify/interpret regulatory and compliance requirements Helping protect the business by identifying risks to consider Incorporate generally accepted Privacy Principles* 8
Data Governance Program Establish good data governance program Identified Data Owners Identified Data Stewards Identified Data Custodians RACI Roles and Responsibilities Data Governance subject areas Data Ownership Data Quality Data Integration Metadata Management Master Data Management Data Architecture Data Security & Privacy 9
Protect sensitive data consistently wherever it goes At Rest In Transit In Use 10 Ideally with a single, centralized enterprise solution
What Data to Tokenize or Encrypt? Important questions to ask... What policy and regulatory compliance requirements apply? What risks must be mitigated? How/Why are protected columns accessed/used? What other mitigating controls are available? Appropriate balance between business and data privacy/security? When is Tokenization or Encryption most appropriate? Utilization and access control limitations of Hadoop / Hive Alternative protection options to consider Full Disk Encryption (FTE) Important Data Security Architecture Questions
To Encrypt or Tokenize... This is the Question Tokenization SSN Large - Field Size relative to width of lookup table - Small CC-PAN More - Structured - Less Healthcare Records More - Logic in portions of the data element - Less Encryption PIN, CID, CV2 Password X-Ray Cat Scan HIV-Pos* Diagnosis Patient ID # Bank Acct No. report Less - Percent of Access Requiring Clear Text - More Customer ID # Increasing Data Sensitivity DOB * With Initialization Vector (IV)
Potential Additional Controls to Consider Tokenization or Encryption farther upstream in Data Flow Do not load unnecessary regulated data to Hadoop Access Hadoop Hive Tables through Teradata (QueryGrid) HDFS file-level access control Accumulo cell level access control (Row/Column intersection) Knox Gateway (authentication for multiple Hadoop clusters) Coarse grained HDFS File Encryption XASecure (now HDP Advanced Security) Ambari (Hadoop Cluster Management) Kerberos (Authentication) all or nothing Piecemeal independent security tools for Hadoop
Reduce Your Exposure and Risk Population of users who have access to SSN today Population of users who can perform their job function with only the last 4 digits of the SSN SSN Token SSN Last 4 Digits SSN Full Vaultless Tokenization is a form of data protection that converts sensitive data into fake data. The real data can be retrieved only by authorized users. Often a more usable form of protection than encryption. Population of users who need access to the full SSN to perform their job function Improve Security Posture Without Impacting Analytics Value 14
What to look for in a good Enterprise Solution Critical Core Requirements: Single Solution Across All Core Platforms Scalable, Centralized Enterprise-class Solution Segregation of Duties between DBA and Security Admin Good Encryption Key or Token Lookup Table Management Data Layer Solution Tamper-proof Audit Trail Transparent (as possible) to Authorized Users High Availability (HA) Optional In-database vs. Ex-database Encryption/Tokenization 15
Other "nice to have" Features... Flexible protection options (Encrypt, Tokenize, DTP/FPE, Masking) Broadest possible support for a range of data types Built in DR, Dual Active, Key and system recovery capability Minimal performance impact to applications/end users Optimized operations to minimize CPU utilization Proven Implementation methodology PCI-DSS compliant solution (meeting all relevant requirements) Deep partnership with Teradata and other database providers Minimal impact on system upgrades Maintain consistent referential integrity and indexing capability Low Total Cost of Ownership (TCO) 16
What to look for in a good solution for Hadoop Course Grained and Fine Grained Protection Capability HDFS File Encryption, Multi-Tennant File Encryption, HDFS FP (HDFS Codec) Column/Field Level Fine Grained Protection Multi-Tennant Row Level Protection Allow authorized users access to specific rows only Unprotect columns for authorized users only Heterogeneous Protection Capabilities Protect Upstream sources of data and Downstream targets of data Vaultless Tokenization, often less intrusive than encryption, reversible protection Reversible where masking is not Deployed on the (Data) Nodes Leverage MPP architecture of Hadoop Avoid Appliance based solutions that can slow down Hadoop Tokenization capability for Hive access to HDFS Files/Tables Hive does not support VarByte data type (Encryption = Binary Ciphertext) 17
Granularity of Protecting Sensitive Data Coarse Grained Protection (File/Volume) Fine Grained Protection (Data/Field) Methods: File or Volume encryption All or nothing approach Does NOT secure file contents in use OS File System Encryption HDFS Encryption Secures data at rest and in transit Operates at the individual field level Fine Grained Protection Methods: Vaultless Tokenization Masking Encryption (Strong, Format Preserving) Data is protected in use and wherever it goes Business logic can be retained
Data Security Platform RDBMS Applications Audit Log Audit Log EDW Audit Log Enterprise Security Administrator Policy Big Data Audit Log IBM Mainframe Protector Audit Log Netezza Audit Log Audit File Servers Log File and Cloud Gateway Servers Protection Servers 20 Protegrity Confidential
Protegrity s Big Data Protector for Hadoop Hadoop Cluster Hadoop Node Hive Pig Other Policy Audit MapReduce YARN HBase HDFS OS File System Protegrity Big Data Protector for Hadoop delivers protection at every node and is delivered with our own cluster management capability. All nodes are managed by the Enterprise Security Administrator that delivers policy and accepts audit logs Protegrity Data Security Policy contains information about how data is deidentified and who is authorized to have access to that data. Policy is enforced at different levels of protection in Hadoop. 21
Rich Security Layer over the Hadoop Ecosystem UDF Support for Pig UDF Support for Hive Hive - Tokenization Java API Support for MapReduce Hbase - Coprocessor support via UDFs Cassandra UDT Pig / Hive MapReduce YARN HBase HDFS Encryption through the HDFS Codec HDFS Commands Extended for Security Functions HDFS Interface for Java Programs De-identify before Ingestion into HDFS HDFS OS File System Encryption; Folder/File or Volume File System 22
Coarse Grained Protection: File / Volume Encryption All fields are in the clear Pig / Hive All fields are in the clear MapReduce YARN HBase HDFS File Entire with identifiable File is data Encrypted elements File System Volume encryption option will encrypt the entire volume versus the files themselves. 23
Coarse Grained with HDFS Staging Area Pig / Hive MapReduce Jobs MapReduce YARN HBase Ingest into HDFS HDFS Staging Area File System 24
Coarse Grained Multi-Tenant Protection Pig / Hive T1 T2 T3 Ingest into HDFS T1 folder T2 folder T3 folder Key 1 Key 2 Key 3 clear folder MapReduce YARN HBase HDFS File System 25
Fine Grained Protection Production Systems Encryption Reversible Policy Control (authorized / Unauthorized Access) Lacks Integration Transparency Not searchable or sortable Complex Key Management Example:!@#$%a^.,mhu7///&*B()_+!@ Vaultless Tokenization / Pseudonymization Reversible Policy Control (Authorized / Unauthorized Access) or Not Reversible No Complex Key Management In either case Integrates Transparently Searchable and sortable Business Intelligence: 0389 3778 3652 0038 Non-Production Systems Masking Not reversible No Policy, Everyone Can Access the Data Integrates Transparently No Complex Key Management Example: Date of Birth 2/15/1967 masked as xx/xx/1967 Protegrity Confidential
Enterprise-wide Protection Source Systems (Internal / External) Consumption BI Systems Target Systems (Internal / External) Input File Source Input File Source FPG ETL Ecosystem Components Pig Hive Node Node Node Database Server MapReduce YARN HBase Downstream Systems Database Database Protector Sqoop HDFS OS FS Edge Node File Protector Java Program Application Protector ESA If Edge Node is a Hadoop Node, Hadoop resources can be used Policy Deployment Audit Collection
Traditional IT Environment: Protegrity Protection Typical Enterprise Today Internet Inside the Firewall Apps EDW DBs Files Hadoop Apps Arch 028 Protegrity Confidential
Today s IT Environment: Protegrity Protection Typical Enterprise Today Internet Inside the Firewall Apps Cloud Protector Gateway DBs Files File Protector Gateway Files EDW Apps Arch ESA HG Apps Hadoop 029 Protegrity Confidential
In Summary Establish Good Data Governance Protect the actual data Itself Maintain referential integrity De-Identify data while maintaining analytics capability Apply consistent protection throughout the data flow Engage Information Security, Legal and Compliance 30 Build security in rather than bolt it on later
31 Sign Up for a Free ½ Day Risk Assessment Workshop
Thank You Q & A Les McMonagle (CISSP, CISA, ITIL) Mobile: (617) 501-7144 Email:Les.McMonagle@Protegrity.com