Malicious domains: Automatic Detection with DNS traffic analysis



Similar documents
Big data security on.nl: infrastructure and one application

ENTRADA: Enabling DNS Big Data Applications

DNS Big Data

.nl ENTRADA. CENTR-tech 33. November 2015 Marco Davids, SIDN Labs. Klik om de s+jl te bewerken

TLD Data Analysis. ICANN Tech Day, Dublin. October 19th 2015 Maarten Wullink, SIDN. Klik om de s+jl te bewerken

SEIZE THE DATA SEIZE THE DATA. 2015

Decoding DNS data. Using DNS traffic analysis to identify cyber security threats, server misconfigurations and software bugs

Big Data With Hadoop

Unified Big Data Processing with Apache Spark. Matei

WE KNOW IT BEFORE YOU DO: PREDICTING MALICIOUS DOMAINS Wei Xu, Kyle Sanders & Yanxin Zhang Palo Alto Networks, Inc., USA

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

A Privacy framework for DNS big data. November 28, 2014 Jelte Jansen

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Complexity and Scalability in Semantic Graph Analysis Semantic Days 2013

A Performance Analysis of Distributed Indexing using Terrier

FAQ (Frequently Asked Questions)

How Companies are! Using Spark

We Know It Before You Do: Predicting Malicious Domains

BIG DATA What it is and how to use?

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Scalable NetFlow Analysis with Hadoop Yeonhee Lee and Youngseok Lee

Internet Monitoring via DNS Traffic Analysis. Wenke Lee Georgia Institute of Technology

# Not a part of 1Z0-061 or 1Z0-144 Certification test, but very important technology in BIG DATA Analysis

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

The Internet of Things and Big Data: Intro

Putting Apache Kafka to Use!

Ibis: Scaling Python Analy=cs on Hadoop and Impala

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

Customized Report- Big Data

The big data revolution

Native Connectivity to Big Data Sources in MicroStrategy 10. Presented by: Raja Ganapathy

Moving From Hadoop to Spark

Preetham Mohan Pawar ( )

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Domain Name Abuse Detection. Liming Wang

Big Data and Analytics: Getting Started with ArcGIS. Mike Park Erik Hoel

Spark: Cluster Computing with Working Sets

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May Santa Clara, CA

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

User Documentation Web Traffic Security. University of Stavanger

SAS BIG DATA SOLUTIONS ON AWS SAS FORUM ESPAÑA, OCTOBER 16 TH, 2014 IAN MEYERS SOLUTIONS ARCHITECT / AMAZON WEB SERVICES

Parquet. Columnar storage for the people

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

How To Handle Big Data With A Data Scientist

Big Data Processing with Google s MapReduce. Alexandru Costan

Ali Ghodsi Head of PM and Engineering Databricks

Manifest for Big Data Pig, Hive & Jaql

Hadoop Ecosystem B Y R A H I M A.

Actian SQL in Hadoop Buyer s Guide

Constructing a Data Lake: Hadoop and Oracle Database United!

Up Your R Game. James Taylor, Decision Management Solutions Bill Franks, Teradata

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

SURVEY REPORT DATA SCIENCE SOCIETY 2014

Workshop on Hadoop with Big Data

Bringing Big Data Modelling into the Hands of Domain Experts

Detection of Distributed Denial of Service Attack with Hadoop on Live Network

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

StratioDeep. An integration layer between Cassandra and Spark. Álvaro Agea Herradón Antonio Alcocer Falcón

Big Data Patterns. Ron Bodkin Founder and President, Think Big

Big Data in Action: Behind the Scenes at Symantec with the World s Largest Threat Intelligence Data

New Trends in FastFlux Networks

Presented by: Aaron Bossert, Cray Inc. Network Security Analytics, HPC Platforms, Hadoop, and Graphs Oh, My

OPINION MINING IN PRODUCT REVIEW SYSTEM USING BIG DATA TECHNOLOGY HADOOP

Know Your Foe. Threat Infrastructure Analysis Pitfalls

Real-Time Analysis of CDN in an Academic Institute: A Simulation Study

Scaling Big Data Mining Infrastructure: The Smart Protection Network Experience

Making big data simple with Databricks

How Comcast Built An Open Source Content Delivery Network National Engineering & Technical Operations

TUT NoSQL Seminar (Oracle) Big Data

Big Data and Its Impact on the Data Warehousing Architecture

Advanced In-Database Analytics

Data Refinery with Big Data Aspects

Dell Compellent Storage Center

Real Time Big Data Processing

Embedded inside the database. No need for Hadoop or customcode. True real-time analytics done per transaction and in aggregate. On-the-fly linking IP

Networks and Security Lab. Network Forensics

April 2016 JPoint Moscow, Russia. How to Apply Big Data Analytics and Machine Learning to Real Time Processing. Kai Wähner.

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Advanced Big Data Analytics with R and Hadoop

Dashboard Engine for Hadoop

Beyond Hadoop with Apache Spark and BDAS

Whose IP Is It Anyways: Tales of IP Reputation Failures

High Performance Spatial Queries and Analytics for Spatial Big Data. Fusheng Wang. Department of Biomedical Informatics Emory University

Why Big Data in the Cloud?

Big Data and Market Surveillance. April 28, 2014

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

Interactive data analytics drive insights

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Concept and Project Objectives

Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify

Transcription:

Malicious domains: Automatic Detection with DNS traffic analysis Giovane C. M. Moura, Maarten Wullink, Moritz Müller, and Cristian Hesselman SIDN Labs {first.lastname}@sidn.nl Network Machine Learning Research Group (NMLRG) IETF 95 April 7, 2016 Buenos Aires, Argentina

Introduction DNS provides a simple label for hosts, services, applications on the Internet Often, it is misused in malicious activities phishing campaigns malware spam For phishing: 1. Compromised domains (majority) - easier 2. Malicious domains (new domains) - more effective?

Introduction Newly registered malicious domains have an abnormal initial DNS lookup [1] We see the same on the.nl TLD Random Sample Jan Mar, 2015 Phishing Daily Queries 3.5 4.5 5.5 Daily Queries 0 50 150 0 5 10 15 20 25 30 Days After Registration 0 5 10 15 20 25 30 Days After Registration Figure:.nl DNS lookups - 20K Random vs Netcraft Phishing

Popular new domains Why phishing is more popular? Assumption: spam-based business model Automated Maximize profit before being taken down Question: can we detect these domains based on DNS traffic as soon as possible?

Early Detection of Malicious Domains What we need: 1. Centralized data (TLD point-of-view) As A TLD registry, we observe a fraction of all.nl TLD traffic (due to caching) Plus, we have registration information 2. High-performance data analytics platform (ENTRADA [2]) Our open-source solution http://entrada.sidnlabs.nl Allows quick hypothesis test : 53 TB of equivalent pcap analysis under 3.5 min (4 data node cluster) In short: pcap analysis is either too slow or too expensive 3. Efficient algorithm that can be used in production

DNS and TLD traffic: centralized data root (1) example.nl.nl Figure: Resolving a Name

DNS and TLD traffic: centralized data (2).nl? root (1) example.nl.nl Figure: Resolving a Name

DNS and TLD traffic: centralized data (2).nl? root (1) example.nl (3).nl is....nl Figure: Resolving a Name

DNS and TLD traffic: centralized data (2).nl? root (1) example.nl (3).nl is (4) example.nl?.nl Figure: Resolving a Name

DNS and TLD traffic: centralized data (2).nl? root (1) example.nl (3).nl is (4) example.nl? (5) example.nl is...nl Figure: Resolving a Name

DNS and TLD traffic: centralized data (2).nl? root (1) example.nl (3).nl is (4) example.nl? (6) example.nl is (5) example.nl is...nl Figure: Resolving a Name

OK, we ve got the data... now analyze it 85 GB of pcap per day, per auth name server You can map/reduce it, but it s gonna be costly or slow CSV, DBRMS have their own limitations Still it would be very hard to deliver interactive response times (< few minutes)

OK, so what can we do? Build your data streaming warehouse (DSW) ENTRADA, ours, is a DSW Open-source: http://entrada.sidnlabs.nl Analyze 53 TB of pcap data in less than 3.5min in a small 4-data node cluster! Used in operation for 2+ years; 100 Billion+ DNS records Our case: DNS analysis

How? Why? Three main reasons: 1. Efficient file format ( Apache Parquet) 2. Efficient query engine (Cloudera Impala - SQL) 3. Hadoop cluster beneath the hood

ENTRADA Data Flow Users (I) Recursive DNSes (II) Internet (III).nl Auth Servers (IV) pcap files STAGING pcap to parquet converter parquet files (V) Impala (VII) Daily Buffer (VI) Spark (VII) Data warehouse (VIII) Analyst and Applications HDFS Hadoop Cluster Figure: ENTRADA DNS data flow [2]

1st: File format - Apache Parquet Google Dremel: optimized format for aggregation type queries Parquet: based on Dremel (Apache) It combines columnar storage Fields stored separately Partition pruning! Compression 85 GB DNS pcap 6 GB Parquet (some filtering too)

2nd: Query Engine: Cloudera Impala SQL support no more awk Run daemons on each node; parallel queries Parquet-file compatible Note: there were other options; please refer to paper [2]

3rd: Hadoop Cluster Scalability HDFS Redundancy

Ok, we ve got the data and the platform. What s next? (I) Pull new domains (II) Classify into 1st-timer and re-regs. (III) Feature extaction Suspicious Normal (V) Registrar Notification (IV) k-means Figure: ndews Architecture [3] Work to be presented at AnNET 2016/IEEE NOMS 2016 [3] Bad domains are likely to be more popular k-means clustering algorithm: unsupervised, classifies according to features Run it daily, for all newly added domains on the.nl zone

Feature selection Empirically chosen Req : how popular it is IPs: resolver s diversity CC: countries diversity ASes: ASes diversity Domains involved in phishing tend to score high on all of them Why? spam knows no borders We choose two cluster: normal and suspicious

Evaluation 1,5+ years of DNS data on ENTRADA 78B DNS request/responses All registration database Key Value Interval Jan 1st, 2015 to Aug 30th 2015 Average.nl zone size 5,500,000 new domains 586,201 New domains - first timers 476,040(81.2%) New domains - re-registered 110,161 (18.8%) Total DNS Requests 32,864,402,270 DNS request new domains (24h) 826,740 DNS request new domains - first-timers (24h) 420,362 Table: Evaluated datasets (from one.nl auth server)

Evaluation Cluster Size Req IPs CC ASes Normal 132,425 4.31 3.06 1.64 1.43 Suspicious 2,956 55.03 27.87 4.99 7.43 Table: Mean values for features and clusters - excluding domains with 1 request - 1st Timers

Validation: historical data Were those suspicious domains really malicious? Very hard to verify on historical data: if they had pages; they might be gone or diff by now Results on historical data: Content analysis: 148 shoes stores, 17 adult/malware 19 phishing domains (out of 49 reported by Netcraft on the same period) VirusTotal: 25 domains matched

Discussion Why so many (5 10) new shoes stores per day? Probably concocted websites [4] Automatically created; spam based

Why shoes? Most counterfeit product = 40% of US Border seizures [5] Re-current registration suggest profitability; one site down does not affect operations Online fraud is the NL: 5.3 billion EUR in 2 years; many from site websites [6] Evade industry s tools/techniques: Solutions for phishing and malware exist Users left unprotected Shoes are a smart play: high demand, and low penalties

Validation on current data Shoes sites dominate it, depending on the day Adult and malware is also detected; we now download screenshots and content as we classify False positives: rapidly popular political websites and others work on reducing this now Working on making it in near real-time (currently 24h delay)

Summary 1. A DSW delivers the performanced needed for ML on network traffic Ours is open-source: https://entrada.sidnlabs.nl Test hypothesis on large datasets within seconds 2. We presented ndews Early Warning system for new domains Uses k-means to classify each domain based on network traffic features It monitors all new domains on the.nl zone, daily We notify registrars about it 3. Future work: making it near real-time incorporate time-series analysis evaluate all the domains, and not only the new ones

Questions? Contact: http://sidnlabs.nl giovane.moura@sidn.nl Thank you for your attention Download our software at: http://entrada.sidnlabs.nl

Bibliography I [1] Hao, Shuang and Feamster, Nick and Pandrangi, Ramakant, Monitoring the initial dns behavior of malicious domains, in Proceedings of the 2011 ACM SIGCOMM Conference on Internet Measurement Conference, ser. IMC 11. New York, NY, USA: ACM, 2011, pp. 269 278. [2] Maarten Wullink, Giovane C. M. Moura, Moritz Muller, and Cristian Hesselman, ENTRADA: a High Performance Network Traffic Data Streaming Warehouse, in Network Operations and Management Symposium (NOMS), 2016 IEEE (to appear), April 2016. [Online]. Available: https://www.sidnlabs.nl/downloads/sidn-noms2016_en.pdf

Bibliography II [3] Giovane C. M. Moura, Moritz Muller, Maarten Wullink, and Cristian Hesselman, ndews: a New Domains Early Warning System for TLDs, in IEEE/IFIP International Workshop on Analytics for Network and Service Management (AnNet 2016), co-located with IEEE/IFIP Network Operations and Management Symposium (NOMS 2016), April 2016. [Online]. Available: https: //www.sidnlabs.nl/downloads/presentations/sidn-annet2016.pdf [4] A. Abbasi and H. Chen, A comparison of tools for detecting fake websites, Computer, no. 10, pp. 78 86, 2009. [5] N. Schmidle, Inside the Knockoff-Tennis-Shoe Factory - The New York Times, http://www.nytimes.com/2010/08/22/magazine/22fake-t.html, 2010.

Bibliography III [6] FraudHelpdesk.nl, Ruim miljoen Nederlanders opgelicht (in Dutch), https://www.fraudehelpdesk.nl/nieuws/ ruim-miljoen-nederlanders-opgelicht-2/, Dec 2014.