Big Data Management and Security

Similar documents
Data Security in Hadoop

How to Hadoop Without the Worry: Protecting Big Data at Scale

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Upcoming Announcements

HDP Hadoop From concept to deployment.

Comprehensive Analytics on the Hortonworks Data Platform

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

Apache Sentry. Prasad Mujumdar

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Encryption and Anonymization in Hadoop

Hadoop & Spark Using Amazon EMR

Hortonworks and ODP: Realizing the Future of Big Data, Now Manila, May 13, 2015

Bringing Big Data to People

and Hadoop Technology

White paper. The Big Data Security Gap: Protecting the Hadoop Cluster

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Modernizing Your Data Warehouse for Hadoop

Data Lake In Action: Real-time, Closed Looped Analytics On Hadoop

ENABLING GLOBAL HADOOP WITH EMC ELASTIC CLOUD STORAGE

Hadoop Ecosystem B Y R A H I M A.

Please give me your feedback

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

IBM BigInsights Has Potential If It Lives Up To Its Promise. InfoSphere BigInsights A Closer Look

Trafodion Operational SQL-on-Hadoop

Intel HPC Distribution for Apache Hadoop* Software including Intel Enterprise Edition for Lustre* Software. SC13, November, 2013

Olivier Renault Solu/on Engineer Hortonworks. Hadoop Security

HDP Enabling the Modern Data Architecture

#TalendSandbox for Big Data

SQL Server 2012 PDW. Ryan Simpson Technical Solution Professional PDW Microsoft. Microsoft SQL Server 2012 Parallel Data Warehouse

Luncheon Webinar Series May 13, 2013

How To Handle Big Data With A Data Scientist

Constructing a Data Lake: Hadoop and Oracle Database United!

Hadoop implementation of MapReduce computational model. Ján Vaňo

EMC Federation Big Data Solutions. Copyright 2015 EMC Corporation. All rights reserved.

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

Like what you hear? Tweet it using: #Sec360

Implement Hadoop jobs to extract business value from large and varied data sets

Securing the Big Data Ecosystem

Deploying Hadoop with Manager

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

Addressing Open Source Big Data, Hadoop, and MapReduce limitations

IBM BigInsights for Apache Hadoop

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics

BIG DATA TRENDS AND TECHNOLOGIES

Forecast of Big Data Trends. Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 3 September 2014

How To Scale Out Of A Nosql Database

CA Big Data Management: It s here, but what can it do for your business?

Dominik Wagenknecht Accenture

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

PEPPERDATA OVERVIEW AND DIFFERENTIATORS

IBM Big Data Platform

WHAT S NEW IN SAS 9.4

Large scale processing using Hadoop. Ján Vaňo

Apache Hadoop: Past, Present, and Future

Ensure PCI DSS compliance for your Hadoop environment. A Hortonworks White Paper October 2015

Cisco IT Hadoop Journey

Secure Your Hadoop Cluster With Apache Sentry (Incubating) Xuefu Zhang Software Engineer, Cloudera April 07, 2014

Integrating Hadoop. Into Business Intelligence & Data Warehousing. Philip Russom TDWI Research Director for Data Management, April

Workshop on Hadoop with Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

The Future of Data Management

Certified Big Data and Apache Hadoop Developer VS-1221

Dell In-Memory Appliance for Cloudera Enterprise

Copyright 2012, Oracle and/or its affiliates. All rights reserved.

Big Data Open Source Stack vs. Traditional Stack for BI and Analytics

Native Connectivity to Big Data Sources in MicroStrategy 10. Presented by: Raja Ganapathy

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

SECURING YOUR ENTERPRISE HADOOP ECOSYSTEM

Session 0202: Big Data in action with SAP HANA and Hadoop Platforms Prasad Illapani Product Management & Strategy (SAP HANA & Big Data) SAP Labs LLC,

Complete Java Classes Hadoop Syllabus Contact No:

Testing 3Vs (Volume, Variety and Velocity) of Big Data

Data Security as a Business Enabler Not a Ball & Chain. Big Data Everywhere May 12, 2015

HADOOP. Revised 10/19/2015

Qsoft Inc

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Cloud Computing. Big Data. High Performance Computing

Securing Your Enterprise Hadoop Ecosystem Comprehensive Security for the Enterprise with Cloudera

Big Data on Microsoft Platform

BIG DATA What it is and how to use?

Implementation of Hadoop Distributed File System Protocol on OneFS Tanuj Khurana EMC Isilon Storage Division

BITKOM& NIK - Big Data Wo liegen die Chancen für den Mittelstand?

Peers Techno log ies Pv t. L td. HADOOP

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

GAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION

White Paper: Evaluating Big Data Analytical Capabilities For Government Use

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

A Brief Outline on Bigdata Hadoop

Native Connectivity to Big Data Sources in MSTR 10

Enterprise-grade Hadoop: The Building Blocks

PEPPERDATA IN MULTI-TENANT ENVIRONMENTS

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Qlik Sense Enabling the New Enterprise

A Modern Data Architecture with Apache Hadoop

The Digital Enterprise Demands a Modern Integration Approach. Nada daveiga, Sr. Dir. of Technical Sales Tony LaVasseur, Territory Leader

A Next-Generation Analytics Ecosystem for Big Data. Colin White, BI Research September 2012 Sponsored by ParAccel

Enterprise Operational SQL on Hadoop Trafodion Overview

IBM Software InfoSphere Guardium. Planning a data security and auditing deployment for Hadoop

Transcription:

Big Data Management and Security Audit Concerns and Business Risks Tami Frankenfield Sr. Director, Analytics and Enterprise Data Mercury Insurance

What is Big Data? Velocity + Volume + Variety = Value

The Big Data Journey Big Data is the next step in the evolution of analytics to answer critical and often highly complex business questions. However, that journey seldom starts with technology and requires a broad approach to realize the desired value. Big Data Fast Data Leverage large volumes of multi-structured data for advanced data mining and predictive purposes Business Intelligence Focus on what happened and more importantly, why it happened Reporting Report on standard business processes Modeling and Predicting Leverage information for forecasting and predictive purposes Analyze streams of real-time data, identify significant events, and alert other systems Data Management Establish initial processes and standards

Implications of Big Data? Enterprises face the challenge and opportunity of storing and analyzing Big Data. Handling more than 10 TB of data Data with a changing structure or no structure at all Very high throughput systems: for example, in globally popular websites with millions of concurrent users and thousands of queries per second Business requirements that differ from the relational database model: for example, swapping ACID (Atomicity, Consistency, Isolation, Durability) for BASE (Basically Available, Soft State, Eventually Consistent) Processing of machine learning queries that are inefficient or impossible to express using SQL Shift thinking from the old world where data was scarce, to a world where business leaders demonstrate data fluency - Forrester Information governance focus needs to shift away from more concrete, black and white issues centered on truth, toward more fluid shades of gray centered on trust. - Gartner Enterprises can leverage the data influx to glean new insights Big Data represents a largely untapped source of customer, product, and market intelligence IBM CIO Study

Taking a Look at the Big Data Ecosystem Big Data Is supported and moved forward by a number of capabilities throughout the ecosystem. In many cases, vendors and resources play multiple roles and are continuing to evolve their technologies and talent to meet the changing market demands. Big Data Analytics BI/Data Visualization Big Data Integration Appliances Big Data File and Database Management Stream Processing and Analysis Big Data Ecosystem

ZOOKEEPER (Coordination) Big Data Storage and Management An Hadoop based solution is designed to leverage distributed storage and a parallel processing framework (MapReduce) for addressing the big data problem. Hadoop is an Apache foundation open source project. Apache Hadoop Ecosystem PIG (Data Flow) UI Framework HIVE (SQL) SDK OOZIE (Workflow & Scheduling) Hadoop MapReduce (Job Scheduling and Processing Engine) HBase (Distributed DB) HDFS (Storage Layer) FLUME, SQOOP (Data Integration) Connectors Traditional Data Warehouse (DW) Traditional Databases Advanced Analytics Tools MPP or In-memory solutions

Big Data Storage and Management The need for Big Data storage and management has resulted in a wide array of solutions spanning from advanced relational databases to non-relational databases and file systems. The choice of the solution is primarily dictated by the use case and the underlying data type. Non-Relational Databases have been developed to address the need for semistructured and unstructured data storage and management. Relational Databases are evolving to address the need for structured Big Data storage and management. Hadoop HDFS is a widely used key-value store designed for Big Data processing.

Big Data Security Big Data Security Scope Big Data security should address four main requirements perimeter security and authentication, authorization and access, data protection, and audit and reporting. Centralized administration and coordinated enforcement of security policies should be considered. Perimeter Security & Authentication Required for guarding access to the system, its data and services. Authentication makes sure the user is who he claims to be. Two levels of Authentication need to be in place perimeter and intra-cluster. Kerberos LDAP/ Active Directory and BU Security Tool Integration** Applicability Across Environments Production Research Development Authorization & Access Required to manage access and control over data, resources and services. Authorization can be enforced at varying levels of granularity and in compliance with existing enterprise security standards. File & Directory Permissions, ACL Role Based Access Controls Data Protection Required to control unauthorized access to sensitive data either while at rest or in motion. Data protection should be considered at field, file and network level and appropriate methods should be adopted Encryption at Rest and in Motion Data Masking & Tokenization Audit & Reporting Required to maintain and report activity on the system. Auditing is necessary for managing security compliance and other requirements like security forensics. Audit Data Audit Reporting 8

Big Data Security, Access, Control and Encryption Integration of Security, Access, Control and Encryption across major components of the Big Data landscape. Column Folder File File Folder Security/Access Control UI Ability to define roles Ability to add/remove users Table Database Ability to assign roles to users Table Ability to scale across platforms Database CF CF Column Table LDAP/ACTIVE Directory Restricted Access Encrypted Data

Security, Access, Control and Encryption Details Guidelines & Considerations Encryption / Anonymization Data should be natively encrypted during ingestion of data into Hadoop (regardless of the data getting loaded into HDFS/Hive/HBase) Encryption Key management should be maintained at a Hadoop Admin level, there by the sanctity of the Encryption is maintained properly Levels of granularity in relation to data access and security HDFS: Folder and File level access control Hive: Table and Column level access control HBase: Table and Column Family level access control Security implementation protocol Security/Data Access controls should be maintained at the lowest level of details within a Hadoop cluster Overhead of having Security/Data Access should be minimum on any CRUD operation Manageability / scalability GUI to Create/Maintain Roles/Users etc. to enable security or data access for all areas of Hadoop Ability to export/share the information held in the GUI across other applications/platforms Same GUI interface or application should be able to scale across multiple platforms (Hadoop and Non-Hadoop) Key Terms Defined In Hadoop, Kerberos currently provides two aspects of security: Authentication This feature can be enabled by mapping the UNIX level Kerberos IDs to that of Hadoop. In a mature environment, Kerberos is linked/mapped to Active Directory or LDAP system of an organization. Maintenance of mapping is typically complicated Authorization Mapping done in the Authentication level is leveraged by the Authorization and the users can be authorized to access data at the HDFS folder level Point of View Large organizations should adopt a more scalable solution with finer grains of access control and encryption / anonymization of data Select a tool which is architecturally highly scalable and consists of the following features: Levels of granularity in relation to data access and security Security implementation protocol Manageability / scalability Encryption / Anonymization

Multi-Tenancy Details Guidelines & Considerations Multi-tenancy in Hadoop primarily need to address two key things: 1. Resource sharing between multiple applications and making sure none of the application impacts the cluster because of heavy usage 2. Data security/auditing - users and applications of one application should not be able to access HDFS data of other apps Vendors vary according to their support for a POSIX file system: MapR provides it by default, IBM BigInsights and Pivotal provide POSIX compliant add on packages (General Parallel File System and Isilon, respectively) POSIX compliant file systems simplify initial migration into the Hadoop distribution. The relative advantage over other Hadoop distribution decreases as the load approaches steady state Access Control Lists (ACLs) facilitate multi-tenancy by ensuring only certain groups and users can run jobs With the evolution of YARN, the capacity scheduler can be used to set a minimum guaranteed resource per application as a % of RAM available (% of CPU will be possible in future) This is a more efficient way of sharing resources between different groups within an organization. Before YARN, resources in Hadoop are available only as a number of map reduce slots available. So although multi-tenancy was possible, it was not very efficient Key Terms Defined A POSIX compliant filesystem - provides high availability of Hadoop by having name node data totally distributed removing the single point of failure efficiently. Enables HDFS federation MapR volume a collection of directories that contains data for a single business unit, application or user group. Policies may be applied to a volume to enforce security or ensure availability YARN decouples workload management from resource management, enabling multi tenancy Point of View Multi-tenancy begins with a multi-user-capable platform. Further, in a clustered computing environment, it involves multi-tenancy at a data (file system) level, a workload (jobs and tasks) level, and a systems (node and host) level The recent advancements of YARN and Hadoop 2.0 are quickly closing the feature gap between proprietary file systems and HDFS

Big Data Security Options and Recommendations Background Recommendations Perimeter Security & Authentication Hadoop supports Kerberos as a third party Authentication mechanism Used for Intra Services authentication (Name Node- Data Node, Task Tracker, Oozie etc),end User-Services (Hue, File browser, cli etc) Ticket based authentication and Operates within a realm(hadoop cluster) and inter-realms as well Users and services rely on a third party Kerberos server - to authenticate each other Offers one way, two way Trust options Offers LDAP integration options Kerberos is highly recommended as it supports authentication mechanisms throughout the cluster Use Apache Knox for perimeter authentication to Hive, HDFS, HBase etc Manage user groups in POSIX layer (initially atleast). Resolving user groups from LDAP involves corporate IT dependency Hadoop security implementation is not easy. Take a phased approach. Configure Kerberos for Hadoop Provision initial set of users/user groups in POSIX layer Integrate LDAP/Single-Sign On (Kerberos, Hue, Ambari) Prepare final list of users/user groups and provision the users on POSIX layer matching ldap principals Optionally Configure Knox gateway for perimeter authentication for external systems (with LDAP or SSO) Authorization & Access HDFS Permission bits HDFS ACLs YARN ACLs Access control for Hive/HBase with Apache Knox Authorization control with Apache Knox for column/row access restrictions for users Optionally configure Accumulo if cell level restrictions are required for HBase/Hive Data Protection Hadoop supports Data Encryption in Motion HTTPS(web clients, REST), RPC Encryption (API, Java Frameworks), Data Transfer protocol (Name Node- Data Node) No Options available for Data Encryption at Rest at the moment Configure Data in Motion wire encryption Agree on data encryption algorithms for data which needs to be exported outside lake within Zurich network Optionally implement Data at Rest Encryption Audit & Reporting HDFS Auditing Centrally available via Ranger(XA secure) Perimeter Access Available at Knox Job Auditing Admin console, Job Tracker Logs Monitor your YARN capacity Key to enhance multi-tenancy Set-up Security compliance process Set-up user administrative process/auditing 12