Integrating Kerberos into Apache Hadoop



Similar documents
Hadoop Security Design

Hadoop Security Design Just Add Kerberos? Really?

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

TIBCO Spotfire Platform IT Brief

Like what you hear? Tweet it using: #Sec360

-- Mario G. Salvadori, Why Buildings Fall Down*

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

How To Create A Data Visualization With Apache Spark And Zeppelin

Hadoop and its Usage at Facebook. Dhruba Borthakur June 22 rd, 2009

Large scale processing using Hadoop. Ján Vaňo

Cloudera Manager Training: Hands-On Exercises

Hadoop implementation of MapReduce computational model. Ján Vaňo

Hadoop & its Usage at Facebook

CS54100: Database Systems

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

Hadoop Architecture. Part 1

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

Deploying Hadoop with Manager

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Note that if at any time during the setup process you are asked to login, click either Cancel or Work Offline depending upon the prompt.

Communicating with the Elephant in the Data Center

Hadoop & its Usage at Facebook

How to set up Outlook Anywhere on your home system

Where is Hadoop Going Next?

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Hadoop Distributed File System. Jordan Prosch, Matt Kipps

Open source Google-style large scale data analysis with Hadoop

EXPERIMENTATION. HARRISON CARRANZA School of Computer Science and Mathematics

From Wikipedia, the free encyclopedia

Hadoop Security Analysis NOTE: This is a working draft. Notes are being collected and will be edited for readability.

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Hadoop Job Oriented Training Agenda

HDFS Users Guide. Table of contents

Implementation of Hadoop Distributed File System Protocol on OneFS Tanuj Khurana EMC Isilon Storage Division

Cloudera Backup and Disaster Recovery

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

The Hadoop Distributed File System

THE HADOOP DISTRIBUTED FILE SYSTEM

Qsoft Inc

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

HDFS Federation. Sanjay Radia Founder and Hortonworks. Page 1

Insights to Hadoop Security Threats

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Design and Evolution of the Apache Hadoop File System(HDFS)

Apache Hadoop. Alexandru Costan

How to Install and Configure EBF15328 for MapR or with MapReduce v1

A Multilevel Secure MapReduce Framework for Cross-Domain Information Sharing in the Cloud

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Hadoop Ecosystem B Y R A H I M A.

CDH AND BUSINESS CONTINUITY:

Apache Hadoop new way for the company to store and analyze big data

Introduction to Cloud Computing

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

ENABLING GLOBAL HADOOP WITH EMC ELASTIC CLOUD STORAGE

Hadoop Distributed File System. Dhruba Borthakur June, 2007

Application Development. A Paradigm Shift

MapReduce, Hadoop and Amazon AWS

Apache Hadoop FileSystem and its Usage in Facebook

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Introduction to HDFS. Prasanth Kothuri, CERN

System Protection for Hyper-V User Guide

Secure Your Hadoop Cluster With Apache Sentry (Incubating) Xuefu Zhang Software Engineer, Cloudera April 07, 2014

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Cloudera Backup and Disaster Recovery

Hadoop: Embracing future hardware

Using Kerberos for Web Authentication. Wesley Craig University of Michigan

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Big Data and Apache Hadoop s MapReduce

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Using Data Mining and Machine Learning in Retail

BIG DATA HADOOP TRAINING

Kerberos and Single Sign On with HTTP

Big Data With Hadoop

Single Sign-on (SSO) technologies for the Domino Web Server

docs.hortonworks.com

BlackBerry Enterprise Service 10. Secure Work Space for ios and Android Version: Security Note

Processing of Hadoop using Highly Available NameNode

Accelerating and Simplifying Apache

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Can Storage Fix Hadoop

Extending Hadoop beyond MapReduce

Reference Architecture and Best Practices for Virtualizing Hadoop Workloads Justin Murray VMware

Cloud Platforms, Challenges & Hadoop. Aditee Rele Karpagam Venkataraman Janani Ravi

Big Data Management and Security

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

Introduction to Cloud Computing

A Brief Outline on Bigdata Hadoop

Hadoop Scalability at Facebook. Dmytro Molkov YaC, Moscow, September 19, 2011

From Relational to Hadoop Part 1: Introduction to Hadoop. Gwen Shapira, Cloudera and Danil Zburivsky, Pythian

Optimize the execution of local physics analysis workflows using Hadoop

Use Enterprise SSO as the Credential Server for Protected Sites

Apache Hadoop 2.0 Installation and Single Node Cluster Configuration on Ubuntu A guide to install and setup Single-Node Apache Hadoop 2.

How to Hadoop Without the Worry: Protecting Big Data at Scale

Big Data - Infrastructure Considerations

Introduction to MapReduce and Hadoop

Olivier Renault Solu/on Engineer Hortonworks. Hadoop Security

Configuration Worksheets for Oracle WebCenter Ensemble 10.3

Pulse Policy Secure. UAC Solution Guide for SRX Series Services Gateways. Product Release 5.1. Document Revision 1.0 Published:

Transcription:

Integrating Kerberos into Apache Hadoop Kerberos Conference 2010 Owen O Malley owen@yahoo-inc.com Yahoo s Hadoop Team

Who am I An architect working on Hadoop full time Mainly focused on MapReduce Tech-lead on adding security to Hadoop Before Hadoop Yahoo Search WebMap Before Yahoo NASA, Sun PhD from UC Irvine

What is Hadoop? A framework for storing and processing big data on lots of commodity machines. Up to 4,000 machines Up to 20 PB Open Source Apache project High reliability done in software Automated failover for data and computation Implemented in Java

What is Hadoop? HDFS Distributed File System Combines cluster s local storage into a single namespace. All data is replicated to multiple machines. Provides locality information to clients MapReduce Batch computation framework Tasks re-executed on failure User code wrapped around a distributed sort Optimizes for data locality of input

What is Hadoop NOT? Hadoop is aimed at moving large amounts of data efficiently. It is not aimed at doing real-time reads or updates. Hadoop moves data like a freight train, slow to start but very high bandwidth. Databases answer queries quickly, but can t match the bandwidth.

Case Study: Yahoo Front Page Personalized for each visitor twice the engagement Result: twice the engagement Recommended links News Interests Top Searches +79% clicks vs. randomly selected +160% clicks vs. one size fits all +43% clicks vs. editor selected 6

Thousands of Servers Petabytes Scaling and Stability Yahoo is the largest contributor and user of Hadoop. It has become the platform of choice for big data analytics. Moved from Research to Science to Production 90 80 70 60 50 40 30 20 38K Servers 170 PB Storage 1M+ Monthly Jobs Hadoop Servers Hadoop Storage (PB) 250 200 150 100 50 10 0 Today 2006 2007 2008 2009 2010 0

Hadoop Community 2007 2008 2009 2010 The Datagraph Blog 8

Problem Yahoo! has more yahoos than clusters. Hundreds of yahoos using Hadoop each month 38,000 computers in ~20 Hadoop clusters. Requires isolation or trust. Different users need different data. Not all yahoos should have access to sensitive data financial data and PII In Hadoop 0.20, easy to impersonate. Segregate different data on separate clusters 9

Solution Prevent unauthorized HDFS access All HDFS clients must be authenticated. Including tasks running as part of MapReduce jobs And jobs submitted through Oozie. Users must also authenticate servers Otherwise fraudulent servers could steal credentials Integrate Hadoop with Kerberos Provides well tested open source distributed authentication system. 10

Requirements Security must be optional. Not all clusters are shared between users. Hadoop must not prompt for passwords Makes it easy to make trojan horse versions. Must have single sign on. Must handle the launch of a MapReduce job on 4,000 Nodes

Definitions Authentication Determining the user Hadoop 0.20 completely trusted the user User passes their username and groups over wire We need it on both RPC and Web UI. Authorization What can that user do? HDFS had owners, groups and permissions since 0.16. Map/Reduce had nothing in 0.20.

Authentication Changes low-level transport RPC authentication using SASL Kerberos (GSSAPI) Token Simple Browser HTTP secured via plugin Use auth_to_local name translation to map principals to user names.

Authorization HDFS Command line and semantics unchanged Web UI enforces authentication MapReduce added Access Control Lists Lists of users and groups that have access. mapreduce.job.acl-view-job view job mapreduce.job.acl-modify-job kill or modify job Code for determining group membership is pluggable.

Delegation Tokens To prevent a flood of authentication requests at the start of a job, NameNode can create delegation tokens. Allows user to authenticate once and pass credentials to all tasks of a job. JobTracker automatically renews tokens while job is running. Cancels tokens when job finishes.

Primary Communication Paths 16

API Changes Very Minimal API Changes MapReduce added secret credentials Available from JobConf and JobContext Never displayed via Web UI Automatically get tokens for HDFS Primary HDFS, File{In,Out}putFormat, and DistCp Can set mapreduce.job.hdfs-servers 17

Web UIs Hadoop relies on the Web UIs. These need to be authenticated also Web UI authentication is pluggable. Yahoo uses an internal package We have written a very simple static auth plug-in SPNEGO plugin being developed All servlets enforce permissions.

Proxy-Users Some services access HDFS and MapReduce as other users. Can t store credentials, since they expire. Configure services with the proxy user: Group of users that the proxy can impersonate Which hosts they can impersonate from Provides control without over burdening operations. 19

Out of Scope Encryption RPC transport easy Block transport protocol difficult On disk difficult File Access Control Lists Still use Unix-style owner, group, other permissions Non-Kerberos Authentication Much easier now that framework is available

Schedule The security team worked hard to get security added to Hadoop on schedule. Roughly 6 months of calendar time. Security Development team: Devaraj Das, Ravi Gummadi, Jakob Homan, Owen O Malley, Jitendra Pandey, Boris Shkolnik, Vinod Vavilapalli, Kan Zhang Currently on production clusters

Questions? Questions should be sent to: common/hdfs/mapreduce-user@hadoop.apache.org Security holes should be sent to: security@hadoop.apache.org Available from http://developer.yahoo.com/hadoop/distribution/ Also a VM with Hadoop cluster with security Thanks!