Securing the Big Data Ecosystem SESSION ID: STU-T07A Davi Ottenheimer Senior Director of Trust, EMC @daviottenheimer
COWS NOT PETS ( ) (xx) /-------\/ / * ---- ^^ ^^ Systematic Treatment of Illness Easily Identified Routine Treatment Minimum Judgment 1. Identify Sick Cattle ASAP 2. Keep Adequate Records 3. Evaluate Daily Sick Cattle 4. Adapt Until Noted Improvement 2
2012 PRESENTATION 1854: London Cholera Death Scale 1854: London Cholera Death Polygons 2012: Data Breach Investigations http://www.flyingpenguin.com/?p=18259 Source Observation 2010: Rinderpest 2013: AIDS 3
4 https://secure.flickr.com/photos/boston_public_library/6192821769/ DATA
BIG DATA 5 http://www.farm-equipment.com/wysiwyg/images/1120572208_img63164_opt.jpg
OBSTACLES TO BIG DATA Slow Ingest and Process Time Isolated Analysis Untapped Sources 6
PATHS TO BIG DATA Speedy Ingest and Process Time Analysis of Data Lakes Access to Sources 7
CIO SURVEY: TOP CONCERNS 54% What to Collect 85% How to Analyze Sources: Barclays September 2013 CIO Survey, KPMG January 2014 CIO/CFO Survey 8
NEW ERA OF DATA INDUSTRIALIZATION 54% What to Collect Data Streams to Data Lakes Save Everything (Lake Conservation) Indirect Communications - Internet of Things 85% How to Analyze (26 billion devices by 2020) No Two Anomalies Alike Indicators of Leaks, Tampering or Loss 9
WHY TYPICAL SOLUTIONS DON T FIT
CONTROLS AND DE-IDENTIFICATION LOCATION NAME UNIQUE ID# DATE >STATE NO NAME TEMP ID YEAR IDENTIFIABLE ANONYMOUS Role-Based Access Controls (AAA) Scrubbing / Substitution Encryption / Secure-Erase k-anonymity In-Memory Processing Limits (Classification) CONTROL DE-IDENTIFICATION
DE-IDENTIFICATION HURDLES FEW CHARACTERISTICS NEEDED Latanya Sweeney, 2000 87.1% of U.S. IDs Unique by Zip+Sex+DOB 53% of U.S. IDs Unique by City+Sex+DOB State / Year Advised Minimums Voter Reg. Compared To Group Insurance Commission (GIC) Data 2014: Neighbor Identity for $22 2000: MA Governor Identity for $20 Cambridge Voters GIC Customers Birth Date Gender Zipcode http://dataprivacylab.org/projects/identifiability/index.html http://www.uclalawreview.org/?p=1353 http://www.zeit.de/2014/07/harald-martenstein-datenschutz 12
EXAMPLE: WASTEWATER ANALYSIS Meta, Ripples, Tails, Exhausts, Shadows, etc. 1.5B gal/day Chicago Environment Disease Drugs? Know estimated numbers of people served by each waste water treatment plant Can back-calculate daily [drug] loads - Dr Kasprzyk-Horder 13 http://phys.org/news/2012-03-wastewater-clues-illicit-drug.html
EXAMPLE: WASTEWATER ANALYSIS Croatia Italy Finland London Oregon Canada http://gizmodo.com/meth-in-london-heroin-in-zagreb-the-answer-is-found-i-1508209127 14
ON THE OTHER HAND: CONTROLS Authentication Authorization Encryption Caveats All or Nothing (Security Required to Communicate) Rolling Upgrades Impossible
DATA CONTROL DESIGN Brakes Suspension Horn Mirrors Seatbelts 16
BIG DATA CONTROL DESIGN TRUST REDEFINED Brakes Suspension Horn Mirrors Seatbelts Threat Avoidance Checklists (Rapid Repairs) 10X More Data, Accessible 24x7x365 17
SIMPLE CHECKLISTS INTELLIGENT ANALYSIS PCI DSS Requirement 2.1: Always change vendor-supplied defaults before installing http://www.mdjonline.com/view/full_story/9738998/article-father-trains-son--to-fly-helicopters-with-night-vision http://www.dvidshub.net/image/962244/oklahoma-national-guard-pilots-train-war-time-standard 18
THREAT ANALYSIS MATURITY SCALE BINARY RANKED MEANING ZERO POINT EXACT ERROR MARGIN CAVEAT: NO FISH IN TOO CLEAR WATER INTELLIGENCE 19
INTELLIGENCE REDEFINES CONTROLS Long-Term <NOUN> Users Apps Content API Alert & Report Investigate & Analyze Visualize Record Sort Collect <ADJ> Time Alias Property Respond GRC Devices Networks Real-Time 20
TRUST REDEFINES BIG DATA Annual Savings 33 Years of Time US$8,000,000 27 Fuel Tanker Trucks http://rhythmtraffic.com/insyncs-performance/ 21
22 ARCHITECTURE AND USE CASES
NEW WAVES OF BIG DATA TECHNOLOGY Hive Pig Mahout Behavior MapReduce R Sentiment Business Hawq Pivotal Predictive Hadoop Sqoop SAS SPSS Network Simulation Objectives Data Analytics Reporting 23
TYPICAL ARCHITECTURE AND CONTROL Data Shared Nodes Distributed Clients Unauthenticated Access Controls Open Web Services Open Networks Open SQOOP (DB SYNC) INTERFACES (REPORTING) PIG (PROCEDURAL) MAP-REDUCE (PROGRAMMING MODEL) HIVE (DECLARATIVE) HBASE (RANDOM R/W) HADOOP DISTRIBUTED FILE SYSTEM (HDFS) PROCESSING 24
PROCESSING ROLES Client Client MapReduce HDFS Masters Job Tracker Name Node Name Node (checkpoint) 2 nd Slaves 25
PROCESSING AUTOMATION CLUSTER 1 Admin : 30,000+ Nodes B3 B2 switch switch job tracker name node name node client A A1 B1 switch A2 A3 switch 2 nd client B switch Rack 1 Rack 2 Rack 3 Rack 4 Rack n (0.5 PB) 26
PROCESSING PATHS Splits Splits Split Splits Splits Split JSON Splits Splits Split Job Tracker Task NameNode Data Node HDFS Block Data Node Data Node HDFS Block HDFS Block Data Node Data Node HDFS Block HDFS Block RPC Read REDUCE MAP Data 27 Output Output Files File
TRUST DELEGATION VS. BEHAVIOR Runaway Job! Kill -9 Job Tracker Name Node Task Tracker 28
NEW AND DIFFERENT WAYS TO MANAGE RISK
DATA IS THE NEW CENTER OF GRAVITY SOCIAL DATA CLOUD MOBILE 30
TRUST REDFINED: SPACE-TIME BENDS BECAUSE GRAVITY 31
TRUSTED ARCHITECTURE Enhanced DB Services Resource Management & Workflow HBase ANSI SQL + Analytics Xtension Catalog Query Framework Hadoop Services Virtualization (HVE) Optimizer Dynamic Pipeline Pig, Hive Mahout Map Reduce Command Center Configure Deploy Yarn Zookeeper HDFS DataLoader Monitor Manage Sqoop Flume Apache Non-Apache 32
INTELLIGENCE-DRIVEN SECURITY EASY, ROUTINE & MINIMUM JUDGMENT http://images.fineartamerica.com/images-medium-large/the-cow-jumped-over-the-moon-wingsdomain-art-and-photography.jpg 33
Securing the Big Data Ecosystem THANK YOU! SESSION ID: STU-T07A Davi Ottenheimer Senior Director of Trust, EMC @daviottenheimer 2/25/14 (Tuesday) 1:20 PM - West 3012