Securing the Big Data Ecosystem

Similar documents
Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

How to Hadoop Without the Worry: Protecting Big Data at Scale

Data Lake In Action: Real-time, Closed Looped Analytics On Hadoop

Big Data Management and Security

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Peers Techno log ies Pv t. L td. HADOOP

Auditing Big Data for Privacy, Security and Compliance

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

Apache Hadoop: Past, Present, and Future

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Workshop on Hadoop with Big Data

Qsoft Inc

ITG Software Engineering

Complete Java Classes Hadoop Syllabus Contact No:

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Deploying Hadoop with Manager

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Upcoming Announcements

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Certified Big Data and Apache Hadoop Developer VS-1221

Pivotal HD Enterprise

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Hadoop Job Oriented Training Agenda

Intel HPC Distribution for Apache Hadoop* Software including Intel Enterprise Edition for Lustre* Software. SC13, November, 2013

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Information Builders Mission & Value Proposition

COURSE CONTENT Big Data and Hadoop Training

EMC Federation Big Data Solutions. Copyright 2015 EMC Corporation. All rights reserved.

THE HADOOP DISTRIBUTED FILE SYSTEM

HADOOP. Revised 10/19/2015

Comprehensive Analytics on the Hortonworks Data Platform

Hadoop Ecosystem B Y R A H I M A.

Fundamentals Curriculum HAWQ

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

BIG DATA SERIES: HADOOP DEVELOPER TRAINING PROGRAM. An Overview

Big Data Course Highlights

Dominik Wagenknecht Accenture

Bringing Big Data to People

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

Extending the Enterprise Data Warehouse with Hadoop Robert Lancaster. Nov 7, 2012

The Future of Big Data SAS Automotive Roundtable Los Angeles, CA 5 March 2015 Mike Olson Chief Strategy Officer,

Constructing a Data Lake: Hadoop and Oracle Database United!

The Future of Data Management

BIG DATA HADOOP TRAINING

Hortonworks and ODP: Realizing the Future of Big Data, Now Manila, May 13, 2015

A Brief Outline on Bigdata Hadoop

BIG DATA - HADOOP PROFESSIONAL amron

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

Data Security in Hadoop

Intelligence-Driven Security

WHAT S NEW IN SAS 9.4

Hadoop IST 734 SS CHUNG

BIG DATA TECHNOLOGY. Hadoop Ecosystem

I/O Considerations in Big Data Analytics

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS PART 4 BEYOND MAPREDUCE...385

Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies

Apache Sentry. Prasad Mujumdar

Big Data Open Source Stack vs. Traditional Stack for BI and Analytics

Big Data Big Data/Data Analytics & Software Development

HDP Hadoop From concept to deployment.

Detecting Anomalous Behavior with the Business Data Lake. Reference Architecture and Enterprise Approaches.

Big Data and Market Surveillance. April 28, 2014

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Data Governance in the Hadoop Data Lake. Kiran Kamreddy May 2015

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

Ankush Cluster Manager - Hadoop2 Technology User Guide

Big Data and the Data Lake. February 2015

ITG Software Engineering

Addressing Open Source Big Data, Hadoop, and MapReduce limitations

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

[Type text] Week. National summer training program on. Big Data & Hadoop. Why big data & Hadoop is important?

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Hadoop & Spark Using Amazon EMR

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Hadoop: The Definitive Guide

Big Data Analytics. Copyright 2011 EMC Corporation. All rights reserved.

#TalendSandbox for Big Data

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

TRAINING PROGRAM ON BIGDATA/HADOOP

Chase Wu New Jersey Ins0tute of Technology

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Hadoop implementation of MapReduce computational model. Ján Vaňo

Tapping Into Hadoop and NoSQL Data Sources with MicroStrategy. Presented by: Jeffrey Zhang and Trishla Maru

Cisco IT Hadoop Journey

Hadoop: Embracing future hardware

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015

A very short Intro to Hadoop

and Hadoop Technology

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Big Data and Hadoop. Module 1: Introduction to Big Data and Hadoop. Module 2: Hadoop Distributed File System. Module 3: MapReduce

Hadoop 101. Lars George. NoSQL- Ma4ers, Cologne April 26, 2013

Hadoop Architecture. Part 1

Hadoop Development & BI- 0 to 100

Transcription:

Securing the Big Data Ecosystem SESSION ID: STU-T07A Davi Ottenheimer Senior Director of Trust, EMC @daviottenheimer

COWS NOT PETS ( ) (xx) /-------\/ / * ---- ^^ ^^ Systematic Treatment of Illness Easily Identified Routine Treatment Minimum Judgment 1. Identify Sick Cattle ASAP 2. Keep Adequate Records 3. Evaluate Daily Sick Cattle 4. Adapt Until Noted Improvement 2

2012 PRESENTATION 1854: London Cholera Death Scale 1854: London Cholera Death Polygons 2012: Data Breach Investigations http://www.flyingpenguin.com/?p=18259 Source Observation 2010: Rinderpest 2013: AIDS 3

4 https://secure.flickr.com/photos/boston_public_library/6192821769/ DATA

BIG DATA 5 http://www.farm-equipment.com/wysiwyg/images/1120572208_img63164_opt.jpg

OBSTACLES TO BIG DATA Slow Ingest and Process Time Isolated Analysis Untapped Sources 6

PATHS TO BIG DATA Speedy Ingest and Process Time Analysis of Data Lakes Access to Sources 7

CIO SURVEY: TOP CONCERNS 54% What to Collect 85% How to Analyze Sources: Barclays September 2013 CIO Survey, KPMG January 2014 CIO/CFO Survey 8

NEW ERA OF DATA INDUSTRIALIZATION 54% What to Collect Data Streams to Data Lakes Save Everything (Lake Conservation) Indirect Communications - Internet of Things 85% How to Analyze (26 billion devices by 2020) No Two Anomalies Alike Indicators of Leaks, Tampering or Loss 9

WHY TYPICAL SOLUTIONS DON T FIT

CONTROLS AND DE-IDENTIFICATION LOCATION NAME UNIQUE ID# DATE >STATE NO NAME TEMP ID YEAR IDENTIFIABLE ANONYMOUS Role-Based Access Controls (AAA) Scrubbing / Substitution Encryption / Secure-Erase k-anonymity In-Memory Processing Limits (Classification) CONTROL DE-IDENTIFICATION

DE-IDENTIFICATION HURDLES FEW CHARACTERISTICS NEEDED Latanya Sweeney, 2000 87.1% of U.S. IDs Unique by Zip+Sex+DOB 53% of U.S. IDs Unique by City+Sex+DOB State / Year Advised Minimums Voter Reg. Compared To Group Insurance Commission (GIC) Data 2014: Neighbor Identity for $22 2000: MA Governor Identity for $20 Cambridge Voters GIC Customers Birth Date Gender Zipcode http://dataprivacylab.org/projects/identifiability/index.html http://www.uclalawreview.org/?p=1353 http://www.zeit.de/2014/07/harald-martenstein-datenschutz 12

EXAMPLE: WASTEWATER ANALYSIS Meta, Ripples, Tails, Exhausts, Shadows, etc. 1.5B gal/day Chicago Environment Disease Drugs? Know estimated numbers of people served by each waste water treatment plant Can back-calculate daily [drug] loads - Dr Kasprzyk-Horder 13 http://phys.org/news/2012-03-wastewater-clues-illicit-drug.html

EXAMPLE: WASTEWATER ANALYSIS Croatia Italy Finland London Oregon Canada http://gizmodo.com/meth-in-london-heroin-in-zagreb-the-answer-is-found-i-1508209127 14

ON THE OTHER HAND: CONTROLS Authentication Authorization Encryption Caveats All or Nothing (Security Required to Communicate) Rolling Upgrades Impossible

DATA CONTROL DESIGN Brakes Suspension Horn Mirrors Seatbelts 16

BIG DATA CONTROL DESIGN TRUST REDEFINED Brakes Suspension Horn Mirrors Seatbelts Threat Avoidance Checklists (Rapid Repairs) 10X More Data, Accessible 24x7x365 17

SIMPLE CHECKLISTS INTELLIGENT ANALYSIS PCI DSS Requirement 2.1: Always change vendor-supplied defaults before installing http://www.mdjonline.com/view/full_story/9738998/article-father-trains-son--to-fly-helicopters-with-night-vision http://www.dvidshub.net/image/962244/oklahoma-national-guard-pilots-train-war-time-standard 18

THREAT ANALYSIS MATURITY SCALE BINARY RANKED MEANING ZERO POINT EXACT ERROR MARGIN CAVEAT: NO FISH IN TOO CLEAR WATER INTELLIGENCE 19

INTELLIGENCE REDEFINES CONTROLS Long-Term <NOUN> Users Apps Content API Alert & Report Investigate & Analyze Visualize Record Sort Collect <ADJ> Time Alias Property Respond GRC Devices Networks Real-Time 20

TRUST REDEFINES BIG DATA Annual Savings 33 Years of Time US$8,000,000 27 Fuel Tanker Trucks http://rhythmtraffic.com/insyncs-performance/ 21

22 ARCHITECTURE AND USE CASES

NEW WAVES OF BIG DATA TECHNOLOGY Hive Pig Mahout Behavior MapReduce R Sentiment Business Hawq Pivotal Predictive Hadoop Sqoop SAS SPSS Network Simulation Objectives Data Analytics Reporting 23

TYPICAL ARCHITECTURE AND CONTROL Data Shared Nodes Distributed Clients Unauthenticated Access Controls Open Web Services Open Networks Open SQOOP (DB SYNC) INTERFACES (REPORTING) PIG (PROCEDURAL) MAP-REDUCE (PROGRAMMING MODEL) HIVE (DECLARATIVE) HBASE (RANDOM R/W) HADOOP DISTRIBUTED FILE SYSTEM (HDFS) PROCESSING 24

PROCESSING ROLES Client Client MapReduce HDFS Masters Job Tracker Name Node Name Node (checkpoint) 2 nd Slaves 25

PROCESSING AUTOMATION CLUSTER 1 Admin : 30,000+ Nodes B3 B2 switch switch job tracker name node name node client A A1 B1 switch A2 A3 switch 2 nd client B switch Rack 1 Rack 2 Rack 3 Rack 4 Rack n (0.5 PB) 26

PROCESSING PATHS Splits Splits Split Splits Splits Split JSON Splits Splits Split Job Tracker Task NameNode Data Node HDFS Block Data Node Data Node HDFS Block HDFS Block Data Node Data Node HDFS Block HDFS Block RPC Read REDUCE MAP Data 27 Output Output Files File

TRUST DELEGATION VS. BEHAVIOR Runaway Job! Kill -9 Job Tracker Name Node Task Tracker 28

NEW AND DIFFERENT WAYS TO MANAGE RISK

DATA IS THE NEW CENTER OF GRAVITY SOCIAL DATA CLOUD MOBILE 30

TRUST REDFINED: SPACE-TIME BENDS BECAUSE GRAVITY 31

TRUSTED ARCHITECTURE Enhanced DB Services Resource Management & Workflow HBase ANSI SQL + Analytics Xtension Catalog Query Framework Hadoop Services Virtualization (HVE) Optimizer Dynamic Pipeline Pig, Hive Mahout Map Reduce Command Center Configure Deploy Yarn Zookeeper HDFS DataLoader Monitor Manage Sqoop Flume Apache Non-Apache 32

INTELLIGENCE-DRIVEN SECURITY EASY, ROUTINE & MINIMUM JUDGMENT http://images.fineartamerica.com/images-medium-large/the-cow-jumped-over-the-moon-wingsdomain-art-and-photography.jpg 33

Securing the Big Data Ecosystem THANK YOU! SESSION ID: STU-T07A Davi Ottenheimer Senior Director of Trust, EMC @daviottenheimer 2/25/14 (Tuesday) 1:20 PM - West 3012