Big data security on.nl: infrastructure and one application

Similar documents
Malicious domains: Automatic Detection with DNS traffic analysis

DNS Big Data

ENTRADA: Enabling DNS Big Data Applications

.nl ENTRADA. CENTR-tech 33. November 2015 Marco Davids, SIDN Labs. Klik om de s+jl te bewerken

TLD Data Analysis. ICANN Tech Day, Dublin. October 19th 2015 Maarten Wullink, SIDN. Klik om de s+jl te bewerken

Cymon.io. Open Threat Intelligence. 29 October 2015 Copyright 2015 esentire, Inc. 1

Big Data in Action: Behind the Scenes at Symantec with the World s Largest Threat Intelligence Data

Domain Name Abuse Detection. Liming Wang

Decoding DNS data. Using DNS traffic analysis to identify cyber security threats, server misconfigurations and software bugs

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics

WE KNOW IT BEFORE YOU DO: PREDICTING MALICIOUS DOMAINS Wei Xu, Kyle Sanders & Yanxin Zhang Palo Alto Networks, Inc., USA

Exploring the Efficiency of Big Data Processing with Hadoop MapReduce

Ali Ghodsi Head of PM and Engineering Databricks

How To Handle Big Data With A Data Scientist

OPINION MINING IN PRODUCT REVIEW SYSTEM USING BIG DATA TECHNOLOGY HADOOP

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

We Know It Before You Do: Predicting Malicious Domains

SEIZE THE DATA SEIZE THE DATA. 2015

Driving in the Cloud: An Analysis of Drive-by Download Operations and Abuse Reporting

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

How Companies are! Using Spark

SQL Server 2012 PDW. Ryan Simpson Technical Solution Professional PDW Microsoft. Microsoft SQL Server 2012 Parallel Data Warehouse

Data Refinery with Big Data Aspects

IANA Functions to cctlds Sofia, Bulgaria September 2008

Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies

Big Data on Cloud Computing- Security Issues

Manifest for Big Data Pig, Hive & Jaql

Building your Big Data Architecture on Amazon Web Services

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

A Privacy framework for DNS big data. November 28, 2014 Jelte Jansen

Basheer Al-Duwairi Jordan University of Science & Technology

Understanding Your Customer Journey by Extending Adobe Analytics with Big Data

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Interactive data analytics drive insights

Internet Monitoring via DNS Traffic Analysis. Wenke Lee Georgia Institute of Technology

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

In-Database Analytics

Embedded inside the database. No need for Hadoop or customcode. True real-time analytics done per transaction and in aggregate. On-the-fly linking IP

BIG DATA What it is and how to use?

Full-text Search in Intermediate Data Storage of FCART

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May Santa Clara, CA

Big Data on Google Cloud

Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework

Big Data Analytics - Accelerated. stream-horizon.com

Big Data and Market Surveillance. April 28, 2014

Scaling Big Data Mining Infrastructure: The Smart Protection Network Experience

BOTNET Detection Approach by DNS Behavior and Clustering Analysis

The Flash-Transformed Financial Data Center. Jean S. Bozman Enterprise Solutions Manager, Enterprise Storage Solutions Corporation August 6, 2014

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Advanced Big Data Analytics with R and Hadoop

Whose IP Is It Anyways: Tales of IP Reputation Failures

Dashboard Engine for Hadoop

Index Terms : Load rebalance, distributed file systems, clouds, movement cost, load imbalance, chunk.

Parquet. Columnar storage for the people

Hadoop and its Usage at Facebook. Dhruba Borthakur June 22 rd, 2009

Big Data Open Source Stack vs. Traditional Stack for BI and Analytics

WHITEPAPER. A Technical Perspective on the Talena Data Availability Management Solution

Unified Big Data Processing with Apache Spark. Matei

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

FAQ (Frequently Asked Questions)

In-Memory Analytics for Big Data

Constructing a Data Lake: Hadoop and Oracle Database United!

SEMANTIC WEB BASED INFERENCE MODEL FOR LARGE SCALE ONTOLOGIES FROM BIG DATA

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

Know Your Foe. Threat Infrastructure Analysis Pitfalls

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

Beyond Batch Processing: Towards Real-Time and Streaming Big Data

Bringing Big Data Modelling into the Hands of Domain Experts

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Big Data on Microsoft Platform

Workshop on Hadoop with Big Data

Data Governance in the Hadoop Data Lake. Michael Lang May 2015

Next-Generation Cloud Analytics with Amazon Redshift

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

BITKOM& NIK - Big Data Wo liegen die Chancen für den Mittelstand?

RevoScaleR Speed and Scalability

CentralNic Privacy Policy Last Updated: July 31, 2012 Page 1 of 12. CentralNic. Version 1.0. July 31,

International Journal of Applied Science and Technology Vol. 2 No. 3; March Green WSUS

What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea

Stanford Computer Security Lab. TrackBack Spam: Abuse and Prevention. Elie Bursztein, Peifung E. Lam, John C. Mitchell Stanford University

Big Data Storage Architecture Design in Cloud Computing

IBM InfoSphere Guardium Data Activity Monitor for Hadoop-based systems

Performance and Scalability Overview

Examining How The Great Firewall Discovers Hidden Circumvention Servers

Approaches for parallel data loading and data querying

Fast Analytics on Big Data with H20

Customized Report- Big Data

A USE CASE OF BIG DATA EXPLORATION & ANALYSIS WITH HADOOP: STATISTICS REPORT GENERATION

Scalable NetFlow Analysis with Hadoop Yeonhee Lee and Youngseok Lee

Networking in the Hadoop Cluster

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

Transcription:

Big data security on.nl: infrastructure and one application Giovane C. M. Moura, Maarten Wullink, Moritz Mueller, and Cristian Hesselman SIDN Labs first.lastname@sidn.nl IEPG Meeting IETF 94 Yokohama, Nov. 1st, 2015

Open.nl DNS datasets http://stats.sidnlabs.nl/ SIDN is the.nl registry; nonprofit 5.6M domains registered; 5th cctld in zone size 2.5M DNSSec signed domains; 1st worldwide aggregated.nl auth servers data (DNS/IP/DNSSEC...) 18 months + ; daily updated open for research collab: talk to me giovane.moura@sidn.nl

Background SIDN is the.nl registry; nonprofit SIDN Labs research arm This presentation: big data security Contains parts of material submitted to NOMS 2016 and PAM 2016 conferences Mini-bio: joined SIDN last May; in academia before

Introduction Newly registered malicious domains have an abnormal initial DNS lookup [1] Random Sample Jan Mar, 2015 Phishing Daily Queries 3.5 4.5 5.5 Daily Queries 0 50 150 0 5 10 15 20 25 30 Days After Registration 0 5 10 15 20 25 30 Days After Registration Figure:.nl DNS lookups - 20K Random vs Netcraft Phishing

Introduction Why is that? Assumption: spam-based business model Automated Maximize profit before being taken down Question: can we use this to improve security in the.nl zone? Or build an early warning system for newly registered domains? We have both registration and DNS traffic data Registry role Privacy framework and board that oversees it

Introduction What we need: 1. High-performance data analytics platform 2. Efficient algorithm that can be used in production This presentation covers both things

ENTRADA: our big data platform Hadoop cluster data streaming warehouse interactive response times 5K USD/EUR per node; low cost Store traffic data from.nl auth servers Enable production applications Based on open-source, can be deployed even in a cloud environment only one part (one converter) we develop in house studying open-source it

ENTRADA: our big data platform By definition, a data-streaming warehouse must deliver interactive response times pcap storage& analysis wouldn t fly not at low cost So, what are the alternatives? Our requirements: Usability = SQL Extensibility: no vendor lock-in Security: Dependability Low cost High performance

ENTRADA: our big data platform Engine Usab. Exten. Perf. Scal. Dep. HBase(HDFS) 0 0 1 1 0 Elasticsearch 0 0 1 1 1 MongoDB 0 0 1 1 1 Hadoop+MapReduce(HDFS) 1 0 0 1 1 PostgreSQL 1 0 0 1 0 Impala+Parquet(HDFS) 1 1 1 1 1 Table: Comparison of Data Query Engines (1 = matches our requirements, 0 does not match) Two Core parts: 1. Optimized Apache Parquet file format (based on Google s Dremel [2]) Column-based storage; reads only necessary columns convert pcap to parquet 2. MPP query engine (Impala [3]) multi parallel queries

ENTRADA: data flow Users (I) Recursive DNSes (II) Internet (III).nl Auth Servers (IV) pcap files STAGING pcap to parquet converter parquet files (V) Impala (VII) Daily Buffer (VI) Spark (VII) Data warehouse (VIII) Analyst and Applications HDFS Hadoop Cluster Figure: ENTRADA data sequence flow

ENTRADA: evaluation Query: select concat_ws('-',day,month,year), count(1) from dns.queries where ipv=4 and year = 'X' and month = 'Y' and day='z' group by concat_ws('-',day,month,year). 10 parallel queries (1 per day) 52 TB of pcap data = 2.2 TB of parquet; Time: 3.5 minutes on 4 data nodes Conclusion: fast, and cheap; and open-source

Part 2: Early Warning System (I) Pull new domains (II) Stats for domains (III) Classify into 1st-timer and re-regs. 1st-timers domains Suspicious Normal (IV) k-means Figure: ndews Architecture Bad domains are likely to be more popular k-means clustering algorithm: unsupervised, classifies according to features Req, IPs, CC, ASes Run it every day

Evaluation 1,5+ years of DNS data on ENTRADA 78B DNS request/responses All registration database Key Value Interval Jan 1st, 2015 to Aug 30th 2015 Average.nl zone size 5,500,000 new domains 586,201 New domains - first timers 476,040(81.2%) New domains - re-registered 110,161 (18.8%) Total DNS Requests 32,864,402,270 DNS request new domains (24h) 826,740 DNS request new domains - first-timers (24h) 420,362 Table: Evaluated datasets (from one.nl auth server)

Evaluation Cluster Size Req IPs CC ASes Normal 132,425 4.31 3.06 1.64 1.43 Suspicious 2,956 55.03 27.87 4.99 7.43 Table: Mean values for features and clusters - excluding domains with 1 request - 1st Timers

Validation Were those suspicious domains really malicious? Very hard to verify on historical data: if they had pages; they might be gone or diff by now Results on historical data: Content analysis: 148 shoes stores, 17 adult/malware 19 phishing domains (out of 49 reported by Netcraft on the same period) VirusTotal: 25 domains matched Results on current data: By far the shoes sites dominate it Adult and malware is also detected; we now download screenshots and content as we classify False positives: rapidly popular political websites and others

Discussion Why so many (5 10) new shoes stores per day? Probably concocted websites [4] Automatically created; spam based

Why shoes? Most counterfeit product = 40% of US Border seizures [5] Large demand Re-current registration suggest profitability; one site down does not affect operations Online fraud is the NL: 5.3 billion EUR in 2 years; many from site websites [6] Evade industry s tools/techniques: Solutions for phishing and malware exist Users left unprotected Shoes are a smart play: high demand, and low penalties Currently: studying how to share/handle this

Summary We showed ENTRADA, our data streaming warehouse Fast & cheap Anyone can deploy it; even on a cloud We showed one application of it: new malicious domains early warning system Under the radar abuse form (shoes) Can be detected by their lookup patterns Run it on a daily basis; have to reduce false positives Studying pilot studies to handle that information More big-data based security applications to come

Questions? Contact: http://sidnlabs.nl giovane.moura@sidn.nl Looking for collaboration to : build and validate systems to improve security; write measurement papers Thank you for your attention

Bibliography I Hao, Shuang and Feamster, Nick and Pandrangi, Ramakant, Monitoring the initial dns behavior of malicious domains, in Proceedings of the 2011 ACM SIGCOMM Conference on Internet Measurement Conference, ser. IMC 11. New York, NY, USA: ACM, 2011, pp. 269 278. S. Melnik, A. Gubarev, J. J. Long, G. Romer, S. Shivakumar, M. Tolton, and T. Vassilakis, Dremel: Interactive Analysis of Web-scale Datasets, Proc. VLDB Endow., vol. 3, no. 1-2, pp. 330 339, Sep. 2010. M. Kornacker, A. Behm, V. Bittorf, T. Bobrovytsky, C. Ching, A. Choi, J. Erickson, M. Grund, D. Hecht, M. Jacobs et al., Impala: A modern, open-source SQL engine for Hadoop, in Proceedings of the Conference on Innovative Data Systems Research (CIDR 15), 2015.

Bibliography II A. Abbasi and H. Chen, A comparison of tools for detecting fake websites, Computer, no. 10, pp. 78 86, 2009. N. Schmidle, Inside the Knockoff-Tennis-Shoe Factory - The New York Times, http://www.nytimes.com/2010/08/22/magazine/22fake-t.html, 2010. FraudHelpdesk.nl, Ruim miljoen Nederlanders opgelicht (in Dutch), https://www.fraudehelpdesk.nl/nieuws/ ruim-miljoen-nederlanders-opgelicht-2/, Dec 2014.