Extended Attributes and Transparent Encryption in Apache Hadoop

Similar documents

Next-Gen Big Data Analytics using the Spark stack

Fast, Low-Overhead Encryption for Apache Hadoop*

Hadoop Applications on High Performance Computing. Devaraj Kavali

Intel Media SDK Library Distribution and Dispatching Process

Encryption and Anonymization in Hadoop

Big Data for Big Science. Bernard Doering Business Development, EMEA Big Data Software

HDFS 2015: Past, Present, and Future

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Accelerating Enterprise Big Data Success. Tim Stevens, VP of Business and Corporate Development Cloudera

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Intel and Qihoo 360 Internet Portal Datacenter - Big Data Storage Optimization Case Study

Sujee Maniyam, ElephantScale

Deploying Hadoop with Manager

Big Data. Value, use cases and architectures. Petar Torre Lead Architect Service Provider Group. Dubrovnik, Croatia, South East Europe May, 2013

Workshop on Hadoop with Big Data

Intel HPC Distribution for Apache Hadoop* Software including Intel Enterprise Edition for Lustre* Software. SC13, November, 2013

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Apache Sentry. Prasad Mujumdar

Intel Media Server Studio - Metrics Monitor (v1.1.0) Reference Manual

Cloud Computing. Big Data. High Performance Computing

Hetero Streams Library 1.0

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Chase Wu New Jersey Ins0tute of Technology

Lustre* HSM in the Cloud. Robert Read, Intel HPDD

Ankush Cluster Manager - Hadoop2 Technology User Guide

Intel Cyber Security Briefing: Trends, Solutions, and Opportunities. Matthew Rosenquist, Cyber Security Strategist, Intel Corp

Real-Time Analytical Processing (RTAP) Using the Spark Stack. Jason Dai Intel Software and Services Group

Real-Time Big Data Analytics SAP HANA with the Intel Distribution for Apache Hadoop software

Hadoop: Embracing future hardware

Hadoop implementation of MapReduce computational model. Ján Vaňo

Cray XC30 Hadoop Platform Jonathan (Bill) Sparks Howard Pritchard Martha Dumler

Hadoop Ecosystem B Y R A H I M A.

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

HDFS. Hadoop Distributed File System

Hadoop Scalability at Facebook. Dmytro Molkov YaC, Moscow, September 19, 2011

Real-Time Big Data Analytics for the Enterprise

TRAINING PROGRAM ON BIGDATA/HADOOP

Cloud-based Analytics and Map Reduce

Dell* In-Memory Appliance for Cloudera* Enterprise

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Scaling up to Production

Comprehensive Analytics on the Hortonworks Data Platform

Intel Service Assurance Administrator. Product Overview

ITG Software Engineering

Hortonworks and ODP: Realizing the Future of Big Data, Now Manila, May 13, 2015

The ROI from Optimizing Software Performance with Intel Parallel Studio XE

Cloudera Distributed Hadoop (CDH) Installation and Configuration on Virtual Box

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Intel Platform and Big Data: Making big data work for you.

Hadoop* on Lustre* Liu Ying High Performance Data Division, Intel Corporation

Developing High-Performance, Scalable, cost effective storage solutions with Intel Cloud Edition Lustre* and Amazon Web Services

HDFS Snapshots and Beyond

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Complete Java Classes Hadoop Syllabus Contact No:

BIG DATA HADOOP TRAINING

Hur hanterar vi utmaningar inom området - Big Data. Jan Östling Enterprise Technologies Intel Corporation, NER

The Big Data Ecosystem at LinkedIn. Presented by Zhongfang Zhuang

A Brief Outline on Bigdata Hadoop

Constructing a Data Lake: Hadoop and Oracle Database United!

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Certified Big Data and Apache Hadoop Developer VS-1221

Hadoop & Spark Using Amazon EMR

Implementation and Performance of AES-NI in CyaSSL. Embedded SSL

Upcoming Announcements

Intel Unite. User Guide

<Insert Picture Here> Big Data

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

Accelerating Business Intelligence with Large-Scale System Memory

Dominik Wagenknecht Accenture

MySQL and Hadoop: Big Data Integration. Shubhangi Garg & Neha Kumari MySQL Engineering

Big Data Too Big To Ignore

Hadoop and Map-Reduce. Swati Gore

HDFS Under the Hood. Sanjay Radia. Grid Computing, Hadoop Yahoo Inc.

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

CDH AND BUSINESS CONTINUITY:

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

Significantly Speed up real world big data Applications using Apache Spark

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Page Modification Logging for Virtual Machine Monitor White Paper

Hadoop Job Oriented Training Agenda

NIST/ITL CSD Biometric Conformance Test Software on Apache Hadoop. September National Institute of Standards and Technology (NIST)

Data Security in Hadoop

Introduction to Big Data Training

CURSO: ADMINISTRADOR PARA APACHE HADOOP

Enabling High performance Big Data platform with RDMA

Cloud based Holdfast Electronic Sports Game Platform

Intelligent Business Operations

Addressing Open Source Big Data, Hadoop, and MapReduce limitations

From Relational to Hadoop Part 1: Introduction to Hadoop. Gwen Shapira, Cloudera and Danil Zburivsky, Pythian

Contents. Pentaho Corporation. Version 5.1. Copyright Page. New Features in Pentaho Data Integration 5.1. PDI Version 5.1 Minor Functionality Changes

What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea

THE HADOOP DISTRIBUTED FILE SYSTEM

Large scale processing using Hadoop. Ján Vaňo

COURSE CONTENT Big Data and Hadoop Training

Transcription:

Extended Attributes and Transparent Encryption in Apache Hadoop Uma Maheswara Rao G Yi Liu ( 刘轶 )

Who we are? Uma Maheswara Rao G - umamahesh@apache.org - Software Engineer at Intel - PMC/committer, Apache Hadoop - PMC/committer, Apache BookKeeper Yi Liu ( 刘轶 ) - yliu@apache.org - Software Engineer at Intel - Active committer, Apache Hadoop - PMC/committer, Apache Tajo - Senior security expert of Big data

Intel BigData Team Global team, local focus Worldwide (China, US and India) teams, >80% in China Local collaborations (industry & academic) a high priority Greater impact thru open source Active open source development (Spark, Hadoop, HBase, Storm, etc.) Widely used in the industry (from Facebook to Alibaba to Cloudera to China Mobile ) Strong influence in the open source community ~10 project committers in the team Technology and innovation oriented Next generations of Big Data Technologies Real-time, in-memory, complex analytics (statistic modeling, machine learning, graph analysis, ) Bridging advanced research and real-world applications

Agenda Extended Attributes Transparent Encryption

ZooKeeper HBase HADOOP Ecosystem Batch Processing MAPREDUCE, HIVE, PIG Search SQL Stream SPARK Machine Learning YARN (Resource Management) HDFS (Hadoop Distributed File System) DATA INTEGRATION (Sqoop, Flume )

HDFS Extended Attributes HDFS-2006

Introduction Allows user to associate additional metadata with files/directories Extended Attributes(Xattrs) can be set as Key-Value pair on any INode XAttrs will not be interpreted by File System Derived from Linux XAttrs feature, so it is functionally similar Allows user to set custom encoding format to XAttrs

Namespaces of XAttrs XAttrs should be prefixed with namespace HDFS support 5 XAttrs namespaces USER Access permission defined by file/directory permission bits For Sticky directories, only owner and privileged users can write TRUSTED Only visible and accessed by privileged users SYSTEM Not visible to users Only available for System kernel SECURITY Not visible to users Only available for System kernel for storing security information RAW They are like SYSTEM attributes, but they can be accessed the files/directories under./reserved/raw by the super users only.

Implementation details XAttrs implemented as separate INode feature in Namenode XAttrs will be persisted as part of INode information XAttrs will be validated against the Namespaces at the Namenode No compatibility issues. Upgrades automatically handled as Xattrs stored as Inode feature. XAttrs development was tracked under HDFS-2006

Configuration dfs.namenode.xattrs.enabled Whether the support of XAttrs is enabled in HDFS dfs.namenode.fs-limits.max-xattrs-per-inode Max number of XAttrs per Inode. Default 32 dfs.namenode.fs-limits.max-xattr-size Max combined size of name and value of XAttrs. Default 16384 bytes

Use Cases Storing the Encrypted Data Encryption Keys as XAttrs in HDFS Encrypted cluster environment Storing policy for Heterogeneous Storage Release HDFS-2006 branch merged to Trunk and Branch-2 Feature released in hadoop-2.5.0

How to use? Java API Command line

Transparent Encryption in Hadoop (HADOOP-10150 & HDFS-6134)

Outlines Transparent to upper layer applications and transparent access to encrypted files by all HDFS clients. High performance, it s not bottleneck. Encryption is independent of the file type, data format. Scalable key management. End-to-end encryption: data can only be encrypted and decrypted by the client. This satisfies two typical requirements for encryption: at-rest encryption and in-transit encryption. Security: HDFS never handles unencrypted data or data encryption keys.

Write file 5. Encrypt data using DEK DFS Client 4. Decrypt EDEK and get DEK KMS Backing keystore Fill EDEK cache in background DN DN DN NN NN 2. EDEK from cache and persist to File metadata. HDFS

Read file 6. Decrypt data using DEK DFS Client 4. Decrypt EDEK and get DEK KMS Backing keystore DN DN DN NN NN 2. Read EDEK from File metadata. HDFS 16

Implementation details Pread support. Original file and Cipher file have the same length and 1:1 corresponding by using AES-CTR Use AES-NI support on Intel platform to improve encryption performance, 20x speedup. We define encryption zone and files are transparently encrypted/decrypted in the zone. We use two layer keys: encryption zone key (EZK), and data encryption key (DEK) which is encrypted by EZK. Each file has a different DEK. 17

Encryption/Decryption for HDFS Blocks 18

User Ops Create Key hadoop key create <keyname> [-cipher <cipher>] [-size <size>] [-description <description>] [-attr <attribute=value>] [-provider <provider>] Roll Key hadoop key roll <keyname> [-provider <provider>] Delete Key hadoop key delete <keyname> [-provider <provider>] List Keys hadoop key list [-provider <provider>] [-metadata] 19

Admin Ops Create Encryption Zone hdfs crypto -createzone -keyname <keyname> -path <path> List Encryption Zones hdfs crypto -listzones 20

Usage Example As a normal user, create a new encryption key: $ hadoop key create mykey As the super user, create a new empty directory and make it an encryption zone: $ sudo -u hdfs hadoop fs -mkdir /zone $ sudo -u hdfs hdfs crypto -createzone -keyname mykey -path /zone Change its ownership to the normal user: $ sudo -u hdfs hadoop fs -chown myuser:myuser /zone As the normal user, put a file in, read it out: $ hadoop fs -put helloworld /zone $ hadoop fs -cat /zone/helloworld 21

Release Fs-encryption branch merged to trunk and branch-2 Feature released in hadoop-2.6.0 22

Performance AES-NI enabled TestDFSIO Benchmark

Call for Collaborations Close collaborations with local ecosystems Intel Big Data engineering teams, industry partners and academic research Building next generations of Big Data Technologies Real-time, in-memory, complex analytics, etc. Bridging advanced research and real-world applications Highly impactful through open source, university research (e.g., UC Berkeley) and industry adoptions (e.g., Alibaba, Cloudera, etc.) 24

Q & A Thanks!

Notices and Disclaimers: Intel, the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest Intel product specifications and roadmaps. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information about performance and benchmark results, visit www.intel.com/benchmarks. Optimization Notice Intel's compilers may or may not optimize to the same degree for non-intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Intel technologies may require enabled hardware, specific software, or services activation. Check with your system manufacturer or retailer. No computer system can be absolutely secure. Intel does not assume any liability for lost or stolen data or systems or any damages resulting from such losses. You may not use or facilitate the use of this document in connection with any infringement or other legal analysis concerning Intel products described herein. You agree to grant Intel a non-exclusive, royalty-free license to any patent claim thereafter drafted which includes subject matter disclosed herein. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document. The products described may contain design defects or errors known as errata which may cause the product to deviate from publish. 26