DNS Big Data

Size: px

Start display at page:

Download "DNS Big Data Analy@cs"

Brittany McCoy
10 years ago
Views:

1 Klik om de s+jl te bewerken Klik om de models+jlen te bewerken! Tweede niveau! Derde niveau! Vierde niveau DNS Big Data Vijfde niveau DNS- OARC Fall 2015 Workshop October 4th 2015 Maarten Wullink, SIDN Wie zijn wij? Mijlpalen Het huidige internet Missie - Visie Diensten Referen@es SamenvaJng 1

2 SIDN Domain name registry for.nl cctld > 5,6 million domain names 2,45 million domain names secured with DNSSEC SIDN Labs is the R&D team of SIDN

3 DNS > 3.1 million resolvers > 1.3 billion query's daily > 300 GB of PCAP data daily

4 ENTRADA ENhanced Top- Level Domain Resilience through Advanced Data Analysis Goal: data- driven improved security & stability of.nl Problem: for analyzing network data do not work well with large datasets and have limited Main requirement: high- performance, near data warehouse Approach: avoid expensive pcap analysis: Convert pcap data to a performance- op@mized format (key) Perform analysis with tools/engines that leverage that

5 Requirements SQL support Scalability High performance Capacity for >1 year of DNS data Extensibility Stability Don t spend too much money!

6 Query Engine Engines galore! Evaluated SQL and NoSQL soluions SQL (PostgreSQL) MongoDB Cassandra Hadoop (HBASE + Apache Phoenix or Hive) SQL on Hadoop (HDFS + Impala + Parquet)

7 SQL on Hadoop Best fit for our requirements Hadoop Node N IMPALA Hadoop Node N+1 IMPALA Hadoop Node N+2 IMPALA PARQUET PARQUET PARQUET HDFS

8 HDFS Distributed file system for storing large volumes of data High availability through of data blocks Scalable to hundreds of PB s and thousands of servers

9 Impala query engine MPP (massively parallel processing) Inspired by Google Dremel paper Provides low latency and high concurrency for queries on Hadoop Excellent performance when compared to other Hadoop based query engines.

10 Impala (2) Data formats Text Hadoop formats Apache Avro Apache Parquet Interfaces Web- based GUI Command line (impala- shell) Python (Impyla) JDBC

11 Apache Parquet Why not just use the PCAP files? Reading (compressed) PCAP data is just too slow engines cannot read PCAP files Columnar storage format data! row oriented! column oriented!

12 Apache Parquet (2) Columnar storage allows for efficient encoding/compression encoding schemes support for Snappy compression data (e.g. by year, month, day and server) pruning allows Impala to skip data we are not interested in Other engines such as Apache Spark can use the same Parquet data.

13 ENTRADA Architecture DNS big data system Goal: develop and services that further enhance the security and stability of.nl, the DNS, and the Internet at large ENTRADA main components and services Planorm Data sources Privacy framework

14 ENTRADA Privacy Framework Legal and organisational ENTRADA data platform (technical) R&D licence ENTRADA privacy framework PEP- U Security and stability services and dashboards Adjustments Template Author (Application Developer) Draft Policy Privacy Board Policy PEP- A PEP- S PEP- C Data analysis algorithms Database queries Storage DNS packets (PCAP) Collection.nl name servers DNS queries and responses Resolvers Download paper: hnp://goo.gl/gvsfzq Policy elements: Purpose Data that is used Filters on the data Reten@on period Access to the data Type of applica@on (Research vs. Produc@on)

15 Cluster Design nano sized I II III management node data nodes data nodes 2Gb/s network

16 Hardware Management node! HP ProLiant DL380 Xeon 1.9 GHz 12 core CPU 64GB RAM 3 TB storage Data node! HP ProLiant DL380 Xeon 1.9 GHz 12 core CPU 64GB RAM 6 TB storage Scaling! Ver@cal by adding more resources Horizontal by adding more data nodes

17 Workflow name server PCAP staging PCAP decode Join Filter Hadoop Impala Analyst Enrich Parquet Monitoring Metrics Import Query data available for analysis within 10 minutes

18 Performance Example query, count # ipv4 queries per day. select concat_ws( -,day,month,year), count(1) from dns.queries where ipv=4 group by concat_ws( -,day,month,year) Query 1 Year of data is 2.2TB Parquet ~ 52TB of PCAP

19 ENTRADA Status Name server feeds Queries per day Daily PCAP volume(gzipped) Daily Parquet volume Months Total # queries stored Total Parquet volume HDFS (3x replica@on) Cluster capacity 1 ~150M ~33GB ~6GB 18 > 71B > 3TB > 9TB ~150B- 200B tuples

20 Use Cases Focussed on increasing the security and stability of.nl Visualize DNS paxerns (visualize traffic paxerns for phishing domain names) Detect botnet Phishing (stats.sidnlabs.nl) research with Dutch support for DNS operators

21 Example DNS security scoreboard Resolver

22 DNS Security Dcoreboard Goal: Visualize DNS paxerns for malicious How: Combine external phishing feeds with DNS data

23 Architecture Security feed I new event Security feed II new event Hadoop Event Analyzer save enriched event PostgreSQL REST API Web UI retrieve event data

24 Traffic

25 Resolver (RESREP) Goal: Try to detect malicious by assigning scores to resolvers How: resolver behaviour

26 RESREP Concept ISP Resolvers.nl Registry DNS.nl Malicious Spam- runs Botnets like Cutwail DNS- axacks DNS and responses

27 RESREP Architecture Root operator.nl Privacy Board ISP network $ RESREP Privacy Policy Resolvers & RESREP service # " % "% ENTRADA Planorm AbuseHUB Abusedesk ) ' User * HTTP ( Child operator (example.nl)

28 Conclusions Technical: Hadoop HDFS + Parquet + Impala is a winning combina@on! Contribu@ons: Research by SIDN Labs and universi@es Iden@fied malicious domain names and botnets External data feed to the Abuse Informa@on Exchange Insight into DNS query data

29 Future Work Combine data from.nl name server with scans of the complete.nl zone and ISP data. Get data from more name servers and resolvers Expand Open Data program

30 and Feedback Maarten Wullink Senior Research hxps://stats.sidnlabs.nl

.nl ENTRADA. CENTR-tech 33. November 2015 Marco Davids, SIDN Labs. Klik om de s+jl te bewerken

.nl ENTRADA. CENTR-tech 33. November 2015 Marco Davids, SIDN Labs. Klik om de s+jl te bewerken Klik om de s+jl te bewerken Klik om de models+jlen te bewerken Tweede niveau Derde niveau Vierde niveau.nl ENTRADA Vijfde niveau CENTR-tech 33 November 2015 Marco Davids, SIDN Labs Wie zijn wij? Mijlpalen