Olivier Renault Solu/on Engineer Hortonworks Hadoop Security
Agenda Why security Kerberos HDFS ACL security Network security - KNOX Hive - doas = False - ATZ-NG YARN ACL p67-91 Capacity scheduler ACL Killing job Data encryption - on disk - on network Audit Apache Ranger to the rescue Page 3
Security Needs Security needs are changing 5 areas of security focus Administration Centrally management & consistent security YARN unlocks the data lake Multi-tenant: Multiple applications for data access Changing and complex compliance environment ETL of non-sensitive data can yield sensitive data Authentication Authenticate users and systems Authorization Provision access to data Audit Maintain a record of data access Data Protection Protect data at rest and in motion Fall 2013 Largely silo d deployments with single workload clusters Summer 2014 65% of clusters host multiple workloads Page 4
Kerberos Authentication Page 5
Why kerberos? $ su -l hdfs -c 'hdfs dfs -ls -R /data drwx------ - olivier olivier 0 2014-10-09 01:05 /data/sec_data drwx------ - olivier olivier 0 2014-10-09 00:53 /data/sec_data/user_data $ su bad_user $ hdfs dfs -ls -R /data ls: Permission denied: user=bad_user, access=read_execute, inode="/ data":olivier:olivier:drwx------ $ export HADOOP_USER_NAME=olivier && hdfs dfs -ls -R /data drwx------ - olivier olivier 0 2014-10-09 01:05 /data/sec_data drwx------ - olivier olivier 0 2014-10-09 00:53 /data/sec_data/user_data Page 6
Why Kerberos? SSO - Users don t need to re-login at every service Hadoop accounts do not need to be created Caveat YARN jobs required Unix account (might go away with Linux/Docker Containers) Hadoop tokens (Delegation Token) supplement the Kerberos auth Delegation Tokens deals with delayed job execution Capabilities to deal with distributed nature of Hadoop Trusted Proxies third party services as proxy (Oozie, HDFS Proxy, ) Kerberos tickets symmetric encryption is magnitude faster then alternatives like SSL Page 7
Kerberos + ActiveDirectory/LDAP Use existing directory tools to manage users Use Kerberos tools to manage host + service principals AD / LDAP Cross Realm Trust Users: smith@example.com KDC Hosts: host1@hadoop.example.com Services: hdfs/host1@hadoop.example.com User Store Client Authentication Hadoop Cluster Page 8
HDFS ACL Authorisation Page 9
Existing HDFS Permissions Model HDFS permissions at a File & Directory level Managed by a set of 3 distinct user classes owner, group and others HDFS Directory Owner Group rwx rwx 3 permissions for each user class Read (r), Write (w), Execute (e) For Files, r for read, w for write Others For Directories, r to list content, w to create/delete files + directories, x for access child of directory rwx Page 10
HDFS Extended ACLs The Problem No longer feasible for Olivier to control all modifications to the file New Requirement: Olivier, Diane and Clark are allowed to make modifications New Requirement: New group called executives should be able to read the sales data Current permissions model only allows permissions at 1 group and 1 user HDFS Extended ACLs solves this issue Now assign different permissions to different users and groups Owner rwx Group D rwx HDFS Directory Group rwx Group F rwx Others rwx User Y rwx Page 11
HDFS ACL [olivier@sandbox ~]$ hdfs dfs -ls /data Found 1 items drwxr-xr-x - olivier analysts 0 2014-10-25 19:03 /data/olivier [olivier@sandbox ~]$ hdfs dfs -getfacl /data/olivier # file: /data/olivier # owner: olivier # group: analysts user::rwx group::r-x other::r-x [olivier@sandbox ~]$ hdfs dfs -setfacl -m user:tim:r-x /data/olivier [olivier@sandbox ~]$ hdfs dfs -setfacl -m group:developers:rwx /data/olivier [olivier@sandbox ~]$ hdfs dfs -ls /data Found 1 items drwxr-xr-x+ - olivier analysts 0 2014-10-25 19:03 /data/olivier [olivier@sandbox ~]$ hdfs dfs -getfacl /data/olivier # file: /data/olivier # owner: olivier # group: analysts user::rwx user:tim:r-x group::r-x group:developers:rwx mask::rwx other::r-x Page 12
Hive Page 13
Hive ATZ-NG: Improving Hive Authorization What is it? Initiative to improve Hive authorization addresses authorization gaps with Hive. SQL standard authorization based on SQL:2011 Standard What are the key improvements? Access policy managed with RDBMS style SQL statements GRANT action ON [table view] to role user Access Policy stored in the metastore The default authorization provider in Hadoop for Hive Fine grained access controls to data in Hive via Users/Roles Control access on per-table and per-column basis Improves the Platform by creating SQL compliant security model for Hive Page 14
Hive Authorization: Objects Users Provided by the authentication system. Roles Function like groups. Tables SQL tables. Views SQL views defined as queries involving tables or other views. Page 15
Hive Authorization: Actions Grant GRANT CREATE GRANT INSERT GRANT SELECT GRANT UPDATE GRANT DROP GRANT DELETE GRANT ALL Revoke Page 16
HBase Page 17
HBase ACL [olivier@sandbox ~]$ hbase shell hbase(main):001:0> list TABLE super_secret_squirrel hbase(main):002:0> scan 'super_secret_squirrel' ROW COLUMN+CELL ERROR: org.apache.hadoop.hbase.security.accessdeniedexception: Insufficient permissions for user 'olivier' for scanner open on table super_secret_squirrel [hbase@sandbox~]$ hbase shell hbase(main):001:0> grant olivier', 'R hbase(main):002:0> user_permission 'super_secret_squirrel' User hbase Table,Family,Qualifier:Permission super_secret_squirrel,,: [Permission:actions=READ,WRITE,EXEC,CREATE,ADMIN] hbase(main):004:0> user_permission User olivier Table,Family,Qualifier:Permission hbase:acl,,: [Permission: actions=read] Page 18
YARN Page 19
YARN ACL Enable user to control their job only Guarantee resources to the user no-one can jump to another queue Capacity scheduler Don t need to specify the queue anymore default queue group / user Page 20
YARN ACL Without ACL [tim@sandbox ~]$ mapred job -list Total jobs:1 JobId State StartTime Username job_1396200012809_0002 RUNNING 1396201153018 olivier [tim@sandbox ~]$ mapred job -kill job_1396200012809_0002 Killed job job_1396200012809_0002 WithACL [tim@sandbox ~]$ mapred job -kill job_1396192703139_0001... Exception in thread "main" java.io.ioexception: org.apache.hadoop.yarn.exceptions.yar nexception: java.security.accesscontrolexception: User timcannot perform operation MODIFY_APP on application_1396192703139_0001 at org.apache.hadoop.yarn.ipc.rpcutil.getremoteexception(rpcutil.java:38) Page 21
Apache Ranger Page 22
Central Security Administration Delivers a single pane of glass for the security administrator Centralizes administration of security policy Ensures consistent coverage across the entire Hadoop stack Page 23
Setup Authorization Policies file level access control, flexible definition Control permissions Page 24
Monitor through Auditing Page 25
Authorization and Auditing w/ Ranger Hadoop Components Enterprise Users RDBMS HDFS HBase Hive Server2 Hadoop distributed file system (HDFS) Plugin Plugin Plugin Ranger Audit Server Ranger Administration Portal Ranger Policy Server Plugin Plugin Plugin* Knox Storm TBD Legacy Tools Integration API * - Future Integration Page 26
Apache Knox Page 27
What does Perimeter Security really mean? Knox Gateway controls all Hadoop REST API access through firewall Firewall required at perimeter (today) REST API REST API Page 28 User Firewall only allows connections through specific ports from Knox host Gateway Firewall Hadoop Services Hadoop cluster mostly unaffected
Why Knox? Enhanced Security Protect network details Partial SSL for non-ssl services WebApp vulnerability filter Centralized Control Central REST API auditing Service-level authorization Alternative to SSH edge node Simplified Access Kerberos encapsulation Extends API reach Single access point Multi-cluster support Single SSL certificate Enterprise Integration LDAP integration Active Directory integration SSO integration Apache Shiro extensibility Custom extensibility Page 29
Current Hadoop Client Model FileSystem and MapReduce Java APIs HDFS, Pig, Hive and Oozie clients (that wrap the Java APIs) Typical use of APIs is via Edge Node that is inside cluster Users SSH to Edge Node and execute API commands from shell User SSH Edge Node Hadoop Page 30 Page 30
Hadoop REST APIs Service WebHDFS WebHCat Hive HBase Oozie API Supports HDFS user operations including reading files, writing to files, making directories, changing permissions and renaming. Job control for MapReduce, Pig and Hive jobs, and HCatalog DDL commands. Learn more about WebHCat. Hive REST API operations, JDBC/ODBC over HTTP HBase REST API operations Job submission and management, and Oozie administration. Useful for connecting to Hadoop from the outside the cluster Page 31 Page 31
Data Protection Page 32
Data Protection HDP allows you to apply data protection policy at two different layers across the Hadoop stack Layer What? How? Storage Transmission Encrypt data in disk Encrypt data as it moves Volume level: LUKS (Linux), BitLocker (Window) Native in Hadoop: HDFS TDE Partners: Voltage, Protegrity, DataGuise, Vormetric OS level encrypt Native in HDP: SSL & SASL AES 256 for SSL & DTP with SASL Page 33
Data at rest Encryption Protection Encryption of Data at rest choices 1. HDFS TDE Open Source & native in Hadoop data encryption Selective Encrypt directories/files in HDFS 2. Encryption through Partners: Voltage, Protegrity, DataGuise Encryption, Masking, Data Redaction in HDFS, Hive, Hbase 3. Leverage Volume level with LUKS Encrypt everything on the node Hadoop Level Encryption - HDFS TDE Partner (Voltage, Protegrity, Dataguise, Vormetric) OS File Level Encryption (Open Source - ecryptfs) Volume Level Encryption (Open Source - LUKS, DMCrypt, Bit-Locker (Windows)) Page 34
HDFS Transparent Data Encryption How it works DATA ACCESS SECURITY HDFS Client YARN Crypto Stream (r/w with DEK) KeyProvider DEK API EDEK EDEK KeyProvider API Hadoop- 10141 Acronym EZ Descrip/on Encryp/on Zone (an HDFS directory) 1 1 Encrypted File (abributes - EDEK, IV) Encryp:on Zone (abributes - EZKey ID, version) HDFS- 6134 HDFS (Hadoop Distributed File System) KeyProvider EDEK API Name Node N DEKs EZKs Key Management System (KMS) Hadoop- 10433 EZK DEK EDEK IV Encryp/on Zone Key; master key associated with all files in an EZ Data Encryp/on Key, unique key associated with each file. EZ Key used to generate DEK Encrypted DEK, Name Node only has access to encrypted DEK. Ini/aliza/on Vector DATA MANAGEMENT Page 35
Summary Page 36
Hadoop Security with HDP Centralized Security Administration with Ranger Authentication Who am I/prove it? Authorization Restrict access to explicit data Audit Understand who did what Data Protection Encrypt data at rest & in motion HDP 2.2 Kerberos in native Apache Hadoop HTTP/REST API Secured with Apache Knox HDFS Permissions, HDFS ACL, Audit logs in with HDFS & MR Hive ATZ-NG Knox Wire encryption in Hadoop HDP Data Encryption Partner Solutions Ranger Page 37 As-Is, works with current authentication methods HDFS, Hive and Hbase Fine grain access control RBAC Centralized audit reporting Policy and access history Future Integration
HDP Security Features Authentication Kerberos Support Perimeter Security For services and REST API Authorizations Fine grained access control Role base access control Column level Permission Support Auditing Resource access auditing Policy auditing HDP with Ranger HDFS, HBase and Hive Create, Drop, Index, lock, user Extensive Auditing Page 38
HDP Security Features HDP w/ Advanced Security Data Protection Wire Encryption Volume Encryption File/Column Encryption Reporting Global view of policies and audit data Manage User/ Group mapping Global policy manager, Web UI Delegated administration + Partners Page 39
END Questions? Page 40