.nl ENTRADA. CENTR-tech 33. November 2015 Marco Davids, SIDN Labs. Klik om de s+jl te bewerken

Size: px

Start display at page:

Download ".nl ENTRADA. CENTR-tech 33. November 2015 Marco Davids, SIDN Labs. Klik om de s+jl te bewerken"

Harvey Kennedy
8 years ago
Views:

1 Klik om de s+jl te bewerken Klik om de models+jlen te bewerken Tweede niveau Derde niveau Vierde niveau.nl ENTRADA Vijfde niveau CENTR-tech 33 November 2015 Marco Davids, SIDN Labs Wie zijn wij? Mijlpalen Het huidige internet Missie - Visie Diensten Referen@es 1 SamenvaJng

nl ENTRADA Vijfde niveau CENTR-tech 33 November 2015 Marco Davids,

2 SIDN Registry for.nl cctld ~ 5,6 million domain names ~ 2,45 million domain names secured with DNSSEC SIDN Labs is the R&D team

3 ENTRADA ENhanced Top-Level Domain Resilience through Advanced Data Analysis > 300 GB of PCAP data daily > 1.3 billion query's daily > 3.1 million resolvers Currently capturing some 10% of total

4 Requirements SQL support Scalability / Extensibility Stability / High performance Capacity for >1 year of DNS data Reasonable budget

5 Query Engine Evaluated SQL and NoSQL soludons SQL (PostgreSQL) MongoDB Cassandra Hadoop (HBASE + Apache Phoenix or Hive) SQL on Hadoop (HDFS + Impala + Parquet)

Cassandra Elas@csearch Hadoop (HBASE + Apache

6 SQL on Hadoop Best fit for our requirements Hadoop Node N Hadoop Node N+1 Hadoop Node N+2 IMPALA IMPALA IMPALA PARQUET PARQUET PARQUET HDFS

7 HDFS Distributed file system for storing large volumes of data High availability through of data blocks Scalable to hundreds of PB s and thousands of servers

8 Apache Parquet Why not just use the PCAP files? Reading (compressed) PCAP data is just too slow engines cannot read PCAP files Columnar storage format data! row oriented! column oriented!

Analy@cal engines cannot read PCAP files Columnar

9 Apache Parquet (2) Columnar storage allows for efficient encoding/compression encoding schemes support for Snappy compression data (e.g. by year, month, day and server) pruning allows Impala to skip data we are not interested in Other engines, like Apache Spark can use the same Parquet data

schemes support for Snappy compression Par@@on data (e.g.

10 Impala query engine MPP (massively parallel processing) Inspired by Google Dremel paper Provides low latency and high concurrency for queries on Hadoop Excellent performance when compared to other Hadoop based query engines.

high concurrency for BI/analy@c queries on Hadoop

11 Impala (2) Data formats Text Hadoop formats Apache Avro Apache Parquet Interfaces Web-based GUI Command line (impala-shell) Python (Impyla) JDBC

12 ENTRADA Architecture DNS big data system Goal: develop and services that further enhance the security and stability of.nl, the DNS, and the Internet at large ENTRADA main components: and services Planorm Data sources Privacy framework

13 Privacy Legal and organisational ENTRADA data platform (technical) R&D licence ENTRADA privacy framework PEP-U Security and stability services and dashboards Adjustments PEP-A Data analysis algorithms Database queries Template Author (Application Developer) Draft Policy Privacy Board Policy PEP-S PEP-C Storage DNS packets (PCAP) Collection.nl name servers DNS queries and responses Resolvers Policy elements: Purpose Data that is used Filters on the data period Access to the data Type of (Research vs. Prod.) Download paper: hlps://goo.gl/wec5dr

Privacy Board Policy PEP-S PEP-C Storage DNS packets (PCAP) Collection.

14 Workflow name server PCAP staging PCAP decode Join Filter Hadoop Impala Analyst Enrich Parquet Monitoring Metrics Import Query data available for analysis within 10 minutes

15 Cluster Design nano sized I II III management node data nodes data nodes 2Gb/s network

16 Hardware Management node! HP ProLiant DL380 Xeon 1.9 GHz 12 core CPU 64GB RAM 3 TB storage Data node! HP ProLiant DL380 Xeon 1.9 GHz 12 core CPU 64GB RAM 6 TB storage Scaling! Ver@cal by adding more resources Horizontal by adding more data nodes

17 Performance Example query, count # IPv4 queries/day. select concat_ws( -,day,month,y ear), count(1) from dns.queries where ipv=4 group by concat_ws( -,day,month,y ear) 1 Year of data is 2.2TB Parquet ~ 52TB of PCAP Query response-@mes

queries where ipv=4 group by concat_ws( -,day,month,y ear)

18 Status Name server feeds 2 Queries per day ~150M Daily PCAP volume(gzipped) ~33GB Daily Parquet volume ~6GB Months opera@onal 18 Total # queries stored > 71B Total Parquet volume > 3TB HDFS (3x replica@on) > 9TB Cluster capacity ~150B-200B tuples

opera@onal 18 Total # queries stored > 71B Total Parquet

19 Use Cases Focussed on increasing the security and stability of.nl Visualize DNS pawerns (visualize traffic pawerns for phishing) Detect botnet Phishing (stats.sidnlabs.nl) research with Dutch support for DNS operators

botnet infec@ons Real-@me Phishing detec@on Sta@s@cs (stats.sidnlabs.

20 Example DNS security scoreboard Resolver

21 DNS Security Scoreboard Goal: Visualize DNS pawerns for malicious How: Combine external phishing feeds with DNS data.

22 Architecture Security feed I new event Security feed II new event Hadoop Event Analyzer save enriched event PostgreSQL REST API Web UI retrieve event data

23 Traffic

24 Resolver (ResRep) Goal: Try to detect malicious by assigning scores to resolvers. How: resolver behaviour.

25 ResRep Concept.nl Registry ISP Resolvers DNS and responses DNS.nl Malicious Spam-runs Botnets like Cutwail awacks

26 ResRep Architecture Root operator.nl Privacy Board ISP network RESREP Privacy Policy Resolvers RESREP service ENTRADA Planorm AbuseHUB Abusedesk User HTTP nl Child operator (example. nl)

27 Open Data program hwps://stats.sidnlabs.nl

28 Open Data program (JSON files) hwps://stats.sidnlabs.nl

29 Conclusions Technical: Hadoop HDFS + Parquet + Impala is a winning combina@on! Contribu@ons: Research by SIDN Labs and universi@es. Iden@fied malicious domain names and botnets. External data feed to the Abuse Informa@on Exchange. Insight into DNS query data.

30 Future Work Combine data from.nl name server with scans of the complete.nl zone and ISP data. Get data from more name servers and resolvers. Expand Open Data program.

31 Please let us know Is, or can our aggregated open data (be) useful? For whom? What can we do to improve things?

32 and Feedback Marco Davids Senior Research

DNS Big Data Analy@cs

Klik om de s+jl te bewerken Klik om de models+jlen te bewerken! Tweede niveau! Derde niveau! Vierde niveau DNS Big Data Analy@cs Vijfde niveau DNS- OARC Fall 2015 Workshop October 4th 2015 Maarten Wullink,