DNS Big Data Analy@cs

Similar documents
.nl ENTRADA. CENTR-tech 33. November 2015 Marco Davids, SIDN Labs. Klik om de s+jl te bewerken

TLD Data Analysis. ICANN Tech Day, Dublin. October 19th 2015 Maarten Wullink, SIDN. Klik om de s+jl te bewerken

Big data security on.nl: infrastructure and one application

Using RDBMS, NoSQL or Hadoop?

A Privacy framework for DNS big data. November 28, 2014 Jelte Jansen

Performance Management in Big Data Applica6ons. Michael Kopp, Technology

ENTRADA: Enabling DNS Big Data Applications

How To Handle Big Data With A Data Scientist

Malicious domains: Automatic Detection with DNS traffic analysis

Hadoop Ecosystem B Y R A H I M A.

Decoding DNS data. Using DNS traffic analysis to identify cyber security threats, server misconfigurations and software bugs

Oracle Big Data SQL Technical Update

SQream Technologies Ltd - Confiden7al

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

A very short talk about Apache Kylin Business Intelligence meets Big Data. Fabian Wilckens EMEA Solutions Architect

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Big Data. The Big Picture. Our flexible and efficient Big Data solu9ons open the door to new opportuni9es and new business areas

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics

Data Management in the Cloud: Limitations and Opportunities. Annies Ductan

Texas Digital Government Summit. Data Analysis Structured vs. Unstructured Data. Presented By: Dave Larson

Big Data Use Case. How Rackspace is using Private Cloud for Big Data. Bryan Thompson. May 8th, 2013

Cost-Effective Business Intelligence with Red Hat and Open Source

Ibis: Scaling Python Analy=cs on Hadoop and Impala

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

The 3 questions to ask yourself about BIG DATA

A Performance Analysis of Distributed Indexing using Terrier

SQL Server 2012 PDW. Ryan Simpson Technical Solution Professional PDW Microsoft. Microsoft SQL Server 2012 Parallel Data Warehouse

Big Data Analytics Nokia

Big Data Architecture & Analytics A comprehensive approach to harness big data architecture and analytics for growth

Parquet. Columnar storage for the people

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc All Rights Reserved

An to Big Data, Apache Hadoop, and Cloudera

Amazon Redshift & Amazon DynamoDB Michael Hanisch, Amazon Web Services Erez Hadas-Sonnenschein, clipkit GmbH Witali Stohler, clipkit GmbH

Federated SQL on Hadoop and Beyond: Leveraging Apache Geode to Build a Poor Man's SAP HANA. by Christian

Apache Hadoop: The Pla/orm for Big Data. Amr Awadallah CTO, Founder, Cloudera, Inc.

How To Scale Out Of A Nosql Database

Hunk & Elas=c MapReduce: Big Data Analy=cs on AWS

Native Connectivity to Big Data Sources in MSTR 10

Open source large scale distributed data management with Google s MapReduce and Bigtable

Database Scalability and Oracle 12c

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

NERSC Data Efforts Update Prabhat Data and Analytics Group Lead February 23, 2015

Workshop on Hadoop with Big Data

Unified Big Data Processing with Apache Spark. Matei

Introduc8on to Apache Spark

Luncheon Webinar Series May 13, 2013

Big Data for Investment Research Management

Linux Clusters Ins.tute: Turning HPC cluster into a Big Data Cluster. A Partnership for an Advanced Compu@ng Environment (PACE) OIT/ART, Georgia Tech

Data Analytics Infrastructure

Moving From Hadoop to Spark

Hadoop implementation of MapReduce computational model. Ján Vaňo

Big Data Analytics - Accelerated. stream-horizon.com

Cloudera Impala: A Modern SQL Engine for Hadoop Headline Goes Here

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

How Companies are! Using Spark

Introduction to Big Data Training

CitusDB Architecture for Real-Time Big Data

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

Beyond Web Application Log Analysis using Apache TM Hadoop. A Whitepaper by Orzota, Inc.

Big Data and Market Surveillance. April 28, 2014

BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

The Internet of Things and Big Data: Intro

Mr. Apichon Witayangkurn Department of Civil Engineering The University of Tokyo

Cisco IT Hadoop Journey

Unlocking Hadoop for Your Rela4onal DB. Kathleen Technical Account Manager, Cloudera Sqoop PMC Member BigData.

Large scale processing using Hadoop. Ján Vaňo

Presenters: Luke Dougherty & Steve Crabb

Stream Deployments in the Real World: Enhance Opera?onal Intelligence Across Applica?on Delivery, IT Ops, Security, and More

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Apache Spark and the future of big data applica5ons. Eric Baldeschwieler

Real-Time Data Analytics and Visualization

Hadoop & SAS Data Loader for Hadoop

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

How To Create A Data Visualization With Apache Spark And Zeppelin

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

Application Development. A Paradigm Shift

BITKOM& NIK - Big Data Wo liegen die Chancen für den Mittelstand?

Implement Hadoop jobs to extract business value from large and varied data sets

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

Performance and Scalability Overview

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

Performance and Scalability Overview

Using distributed technologies to analyze Big Data

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Jun Liu, Senior Software Engineer Bianny Bian, Engineering Manager SSG/STO/PAC

Transcription:

Klik om de s+jl te bewerken Klik om de models+jlen te bewerken! Tweede niveau! Derde niveau! Vierde niveau DNS Big Data Analy@cs Vijfde niveau DNS- OARC Fall 2015 Workshop October 4th 2015 Maarten Wullink, SIDN Wie zijn wij? Mijlpalen Organisa@e Het huidige internet Missie - Visie Diensten Referen@es SamenvaJng 1

SIDN Domain name registry for.nl cctld > 5,6 million domain names 2,45 million domain names secured with DNSSEC SIDN Labs is the R&D team of SIDN

DNS Data @SIDN > 3.1 million dis@nct resolvers > 1.3 billion query's daily > 300 GB of PCAP data daily

ENTRADA ENhanced Top- Level Domain Resilience through Advanced Data Analysis Goal: data- driven improved security & stability of.nl Problem: Exis@ng solu@ons for analyzing network data do not work well with large datasets and have limited analy@cal capabili@es. Main requirement: high- performance, near real- @me data warehouse Approach: avoid expensive pcap analysis: Convert pcap data to a performance- op@mized format (key) Perform analysis with tools/engines that leverage that

Requirements SQL support Scalability High performance Capacity for >1 year of DNS data Extensibility Stability Don t spend too much money!

Query Engine Op@ons Engines galore! Evaluated SQL and NoSQL soluions Rela@onal SQL (PostgreSQL) MongoDB Cassandra Elas@csearch Hadoop (HBASE + Apache Phoenix or Hive) SQL on Hadoop (HDFS + Impala + Parquet)

SQL on Hadoop Best fit for our requirements Hadoop Node N IMPALA Hadoop Node N+1 IMPALA Hadoop Node N+2 IMPALA PARQUET PARQUET PARQUET HDFS

HDFS Distributed file system for storing large volumes of data High availability through replica@on of data blocks Scalable to hundreds of PB s and thousands of servers

Impala query engine MPP (massively parallel processing) Inspired by Google Dremel paper Provides low latency and high concurrency for BI/analy@c queries on Hadoop Excellent performance when compared to other Hadoop based query engines.

Impala (2) Data formats Text Hadoop formats Apache Avro Apache Parquet Interfaces Web- based GUI Command line (impala- shell) Python (Impyla) JDBC

Apache Parquet Why not just use the PCAP files? Reading (compressed) PCAP data is just too slow Analy@cal engines cannot read PCAP files Columnar storage format data! row oriented! column oriented!

Apache Parquet (2) Columnar storage allows for efficient encoding/compression mul@ple encoding schemes support for Snappy compression Par@@on data (e.g. by year, month, day and server) Par@@on pruning allows Impala to skip data we are not interested in Other analy@cal engines such as Apache Spark can use the same Parquet data.

ENTRADA Architecture DNS big data system Goal: develop applica@ons and services that further enhance the security and stability of.nl, the DNS, and the Internet at large ENTRADA main components Applica@ons and services Planorm Data sources Privacy framework

ENTRADA Privacy Framework Legal and organisational ENTRADA data platform (technical) R&D licence ENTRADA privacy framework PEP- U Security and stability services and dashboards Adjustments Template Author (Application Developer) Draft Policy Privacy Board Policy PEP- A PEP- S PEP- C Data analysis algorithms Database queries Storage DNS packets (PCAP) Collection.nl name servers DNS queries and responses Resolvers Download paper: hnp://goo.gl/gvsfzq Policy elements: Purpose Data that is used Filters on the data Reten@on period Access to the data Type of applica@on (Research vs. Produc@on)

Cluster Design nano sized loca@on I loca@on II loca@on III management node data nodes data nodes 2Gb/s network

Hardware Management node! HP ProLiant DL380 Xeon 1.9 GHz 12 core CPU 64GB RAM 3 TB storage Data node! HP ProLiant DL380 Xeon 1.9 GHz 12 core CPU 64GB RAM 6 TB storage Scaling! Ver@cal by adding more resources Horizontal by adding more data nodes

Workflow name server PCAP staging PCAP decode Join Filter Hadoop Impala Analyst Enrich Parquet Monitoring Metrics Import Query data available for analysis within 10 minutes

Performance Example query, count # ipv4 queries per day. select concat_ws( -,day,month,year), count(1) from dns.queries where ipv=4 group by concat_ws( -,day,month,year) Query response @mes 1 Year of data is 2.2TB Parquet ~ 52TB of PCAP

ENTRADA Status Name server feeds Queries per day Daily PCAP volume(gzipped) Daily Parquet volume Months opera@onal Total # queries stored Total Parquet volume HDFS (3x replica@on) Cluster capacity 1 ~150M ~33GB ~6GB 18 > 71B > 3TB > 9TB ~150B- 200B tuples

Use Cases Focussed on increasing the security and stability of.nl Visualize DNS paxerns (visualize traffic paxerns for phishing domain names) Detect botnet infec@ons Real- @me Phishing detec@on Sta@s@cs (stats.sidnlabs.nl) Scien@fic research (collabora@on with Dutch Universi@es) Opera@onal support for DNS operators

Example Applica@ons DNS security scoreboard Resolver reputa@on

DNS Security Dcoreboard Goal: Visualize DNS paxerns for malicious ac@vity How: Combine external phishing feeds with DNS data

Architecture Security feed I new event Security feed II new event Hadoop Event Analyzer save enriched event PostgreSQL REST API Web UI retrieve event data

Traffic Visualiza@on

Resolver Reputa@on (RESREP) Goal: Try to detect malicious ac@vity by assigning reputa@on scores to resolvers How: fingerprin@ng resolver behaviour

RESREP Concept ISP Resolvers.nl Registry Authorita@ve DNS.nl Malicious ac@vity: Spam- runs Botnets like Cutwail DNS- amplifica@on axacks DNS ques@ons and responses

RESREP Architecture Root operator.nl Privacy Board ISP network $ RESREP Privacy Policy Resolvers & RESREP service # " % "% ENTRADA Planorm AbuseHUB Abusedesk ) ' User * HTTP ( Child operator (example.nl) www.example.nl

Conclusions Technical: Hadoop HDFS + Parquet + Impala is a winning combina@on! Contribu@ons: Research by SIDN Labs and universi@es Iden@fied malicious domain names and botnets External data feed to the Abuse Informa@on Exchange Insight into DNS query data

Future Work Combine data from.nl authorita@ve name server with scans of the complete.nl zone and ISP data. Get data from more name servers and resolvers Expand Open Data program

Ques@ons and Feedback Maarten Wullink Senior Research Engineer maarten.wullink@sidn.nl @wulliak www.sidnlabs.nl hxps://stats.sidnlabs.nl