Machine Learning for Cyber Security Intelligence

Similar documents

KASPERSKY SECURITY INTELLIGENCE SERVICES. EXPERT SERVICES.

An Overview of Knowledge Discovery Database and Data mining Techniques

Kaspersky Fraud Prevention platform: a comprehensive solution for secure payment processing

Trend Micro Incorporated Research Paper Adding Android and Mac OS X Malware to the APT Toolbox

Who Drives Cybersecurity in Your Business? Milan Patel, K2 Intelligence. AIBA Quarterly Meeting September 10, 2015

The Data Mining Process

Security Intelligence Services. Cybersecurity training.

Where every interaction matters.

Introduction: 1. Daily 360 Website Scanning for Malware

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015

Data Mining for Customer Service Support. Senioritis Seminar Presentation Megan Boice Jay Carter Nick Linke KC Tobin

SECURITY ADVISORY. December 2008 Barracuda Load Balancer admin login Cross-site Scripting

LASTLINE WHITEPAPER. Using Passive DNS Analysis to Automatically Detect Malicious Domains

Threat Intelligence UPDATE: Cymru EIS Report. cymru.com

The Cyber Threat Profiler

Using Data Mining for Mobile Communication Clustering and Characterization

WEB SECURITY CONCERNS THAT WEB VULNERABILITY SCANNING CAN IDENTIFY

Pentaho Data Mining Last Modified on January 22, 2007

Fighting Advanced Threats

Comprehensive Understanding of Malicious Overlay Networks

[state of the internet] / SEO Attacks. Threat Advisory: Continuous Uptick in SEO Attacks

Azure Machine Learning, SQL Data Mining and R

Protecting critical infrastructure from Cyber-attack

SHARE THIS WHITEPAPER. Top Selection Criteria for an Anti-DDoS Solution Whitepaper

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

EVILSEED: A Guided Approach to Finding Malicious Web Pages

SHARE THIS WHITEPAPER. On-Premise, Cloud or Hybrid? Approaches to Mitigate DDoS Attacks Whitepaper

Machine Learning for Fraud Detection

User Documentation Web Traffic Security. University of Stavanger

THREAT VISIBILITY & VULNERABILITY ASSESSMENT

Network Machine Learning Research Group. Intended status: Informational October 19, 2015 Expires: April 21, 2016

DATA MINING TECHNIQUES AND APPLICATIONS

Knowledge Discovery from patents using KMX Text Analytics

NATIONAL CYBER SECURITY AWARENESS MONTH

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer

Web Security. Discovering, Analyzing and Mitigating Web Security Threats

IBM X-Force 2012 Cyber Security Threat Landscape

Using Artificial Intelligence to Manage Big Data for Litigation

Threat Advisory: Accellion File Transfer Appliance Vulnerability

Spam Detection Using Customized SimHash Function

MS1b Statistical Data Mining

Data Mining: Overview. What is Data Mining?

Symantec Cyber Threat Analysis Program Program Overview. Symantec Cyber Threat Analysis Program Team

A Server and Browser-Transparent CSRF Defense for Web 2.0 Applications. Slides by Connor Schnaith

Big Data Text Mining and Visualization. Anton Heijs

WHITE PAPER: THREAT INTELLIGENCE RANKING

Top Five Data Security Trends Impacting Franchise Operators. Payment System Risk September 29, 2009

Next Generation IPS and Reputation Services

Foundations of Business Intelligence: Databases and Information Management

What do a banking Trojan, Chrome and a government mail server have in common? Analysis of a piece of Brazilian malware

Threat Intelligence: What is it, and How Can it Protect You from Today s Advanced Cyber-Attacks A Webroot publication featuring analyst research

Architecture. The DMZ is a portion of a network that separates a purely internal network from an external network.

QUARTERLY REPORT 2015 INFOBLOX DNS THREAT INDEX POWERED BY

Detecting Web Application Vulnerabilities Using Open Source Means. OWASP 3rd Free / Libre / Open Source Software (FLOSS) Conference 27/5/2008

IT services for analyses of various data samples

The University of Jordan

KASPERSKY FRAUD PREVENTION FOR ENDPOINTS

Protect Your Business and Customers from Online Fraud

2015 TRUSTWAVE GLOBAL SECURITY REPORT

A Regression Approach for Forecasting Vendor Revenue in Telecommunication Industries

How To Protect Your Online Banking From Fraud

Cyber Security Metrics Dashboards & Analytics

Secure Because Math: Understanding ML- based Security Products (#SecureBecauseMath)

Cyber Security in Taiwan's Government Institutions: From APT To. Investigation Policies

DYNAMIC DNS: DATA EXFILTRATION

NETWORK PENETRATION TESTING

Penetration Test Report

Tespok Kenya icsirt: Enterprise Cyber Threat Attack Targets Report

SPATIAL DATA CLASSIFICATION AND DATA MINING

Cyber and Operational Solutions for a Connected Industrial Era

Anatomy of Cyber Threats, Vulnerabilities, and Attacks

End to End Security do Endpoint ao Datacenter

Five reasons SecureData should manage your web application security

Data Mining Part 5. Prediction

Machine Learning and Statistics: What s the Connection?

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Information Security for Modern Enterprises

Operation Liberpy : Keyloggers and information theft in Latin America

CYBER SCIENCE 2015 AN ANALYSIS OF NETWORK TRAFFIC CLASSIFICATION FOR BOTNET DETECTION

Defending Networks with Incomplete Information: A Machine Learning Approach. Alexandre

Digging for Gold: Business Usage for Data Mining Kim Foster, CoreTech Consulting Group, Inc., King of Prussia, PA

An Anomaly-Based Method for DDoS Attacks Detection using RBF Neural Networks

Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD

IBM SPSS Modeler Professional

Kaspersky Fraud Prevention: a Comprehensive Protection Solution for Online and Mobile Banking

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Targeted attacks: Tools and techniques

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Know Your Foe. Threat Infrastructure Analysis Pitfalls

Inner Classification of Clusters for Online News

Web applications. Web security: web basics. HTTP requests. URLs. GET request. Myrto Arapinis School of Informatics University of Edinburgh

Spatio-Temporal Patterns of Passengers Interests at London Tube Stations

HTML5 and security on the new web

The evolution of virtual endpoint security. Comparing vsentry with traditional endpoint virtualization security solutions

Cymon.io. Open Threat Intelligence. 29 October 2015 Copyright 2015 esentire, Inc. 1

Security Intelligence Blacklisting

Machine Learning Capacity and Performance Analysis and R

High Productivity Data Processing Analytics Methods with Applications

Impelling Heart Attack Prediction System using Data Mining and Artificial Neural Network

Transcription:

Machine Learning for Cyber Security Intelligence 27 th FIRST Conference 17 June 2015 Edwin Tump Senior Analyst National Cyber Security Center

Introduction whois Edwin Tump 10 yrs at NCSC.NL (GOVCERT.NL) 9 yrs as a security specialist 1 yr as a security analyst Areas of special interest Information collection (e.g. Taranis) Tooling Output Machine learning? 3

Introduction Machine Learning 4

Agenda - Current situation - Challenges - Desired situation - Approach - Machine learning - Results - Conclusions

Current situation Taranis Taranis n sources nx4 items 1,500 sources Day 6,000 items Week 42,000 items Month 180,000 items Year 2,190,000 items 6

Current situation Taranis categories 7

Current situation Taranis keyword matching Automatic dispatch of irrelevant items Only for specific sources Relevant items determined based on keyword lists Threats Cross-Site Scripting SQL-injection Vulnerability Defacement Data dump Constituents Ministry of Security and Justice Ministry of Power Company X Bank Y 8

Current situation Taranis deduplication Clustering/deduplication of news items 9

Current situation Taranis 5 phases Collect Assess Analyze Write Publish 10

Current situation Outputs Security advisory End-of-Shift End-of-Week Internal analysis Taranis Dossier External analysis Monthly monitor Factsheet Whitepaper Other sources End-of-Year Cyber Security Assessment NL 11

Challenges Image courtesy of Stuart Miles at FreeDigitalPhotos.net 12

Challenges Handling growing amount of information is labor intensive Determine relevant topics Determine new trends Rate the reliability of news: the good, the bad and the ugly Keep dossiers up-to-date Higher expectations; reports must be timely complete short 13

Image courtesy of Stuart Miles at FreeDigitalPhotos.net Desired situation 14

Desired situation 15

Desired situation Automatically cluster items and detect stories 16

Desired situation Automatically determine story relevance 17

Desired situation Automatically assign items to dossiers 18

Approach Image courtesy of Stuart Miles at FreeDigitalPhotos.net 19

Approach Agile-scrum Close contact between participants Multiple short sprints (4 sprints of 2 weeks) Principles High level of trust Use of well-known concepts in the field of machine learning Open mind, not restricted by a ready-mode product Deliverables Better understanding of usefulness of machine learning techniques Proof-of-Concept(s) Detailed requirements for future work Paper describing the results 20

Approach Data (Taranis) Expertise Source data Process Products Tools Expertise Text mining Machine learning Information retrieval 21

Image courtesy of Stuart Miles at FreeDigitalPhotos.net

Machine learning (ML) Main methods Supervised learning o Classification o Regression Unsupervised learning o Clustering o Association o Density estimation o Dimensionality reduction Reinforcement learning Semi supervised learning 23

Proof-of-Concept toolset PostgreSQL Relational database Offline partial copy of the Taranis database http://www.postgresql.org/ R A free software environment for statistical computing and graphics http://www.r-project.org Tableau Data visualization / data analysis http://www.tableau.com/ FoamTree JavaScript tree map visualization http://carrotsearch.com/foamtree-overview 24

Overview Taranis LDA Topics Language detection Preprocessing and representation Story detection and linking Story Relevance SVM Classification k Means 25

Language detection 38% With regular news 53% 13% Without regular news 76% 26

Preprocessing and representation Remove noisy words (stop words) Remove duplicates Remove numbers Remove punctuation marks Normalize case (upper to lower) Stem words Use of Porter Stemming Algorithm 27

Example Exploit Kit Delivers DNS Changer to Thousands of Routers A malicious campaign deployed by cybercriminals aims at changing the Domain Name System (DNS) server settings in router configuration, responsible for retrieving the correct web pages from legitimate web servers. An attacker changing these settings can point to malicious locations, exposing the victim to a wide range of risks varying from credential stealing and ad-fraud to traffic interception and malware delivery. 28 exploit deliv changer thousand router malicy campaign deploy cybercrimin chang domain server router configur respons retriev correct page legitim server attacker chang point malicy locat expos victim wide rang risk vary credenty steal fraud traffic intercept malwar delivery

k Means History already begun in 1957, but published in 1982 by Stuart Lloyd Unsupervised Widely used because of simplicity and performance Words represented as vectors Vectors calculated based on TF-IDF Steps: Assignment: choose k cluster centers and then assign each point to the closest cluster center Update: recompute each center Repeat until update step makes no modifications or threshold is reached Taranis items 29

Results 30

Story Detection and Relevance Steps 1. Deduplication 2. Clustering in (overlapping) sliding windows Windows of 12 hrs Slices/slicing of 2 hrs 3. Clustering between time slices 4. Determine story relevance, based on Volume of documents within the story Source reputation (derived from mail actions on items from source) Cross-category stories 12 hrs Important parameters 2 hrs story within story (0 1) between story (0 1) cluster 31

Story Relevance Taranis 5 phases Project focus collected only collected and forwarded Collect Assess Analyze Write Publish collected and analyzed collected and used in publication 32

Results 33

34

Results 35

Results 36

Support Vector Machines (SVM) Supervised classification SVMs take your data and draw a hyper plane to divide your dataset into groups of positive and negative observations Types of classification: Binary classification (simplest) Multi-category classification o One-against-one o One-against-rest Cyber Security Assessment Report Netherlands 2013 (CSAN-3) used as training set 37

Results 38

Results 39

Latent Dirichlet Allocation Unsupervised Assumption: documents are generated by latent topics, the hidden variables in from which the documents are observed from Main motivation: uncover underlying semantic structures (=hidden topics) of a document Each document is a mix of topics, defined by a collection of words Variables: K = number of topics W = word D = documents Z = topic Continually answer the following questions for each word: How often does "W" appear in topic "Z" elsewhere? How common is topic "Z" in the rest of the document Visualization based on own implementation of LDAVis 40

41

42

43

44

45

46

47

Conclusions Image courtesy of Stuart Miles at FreeDigitalPhotos.net

Conclusions Overall Promising We already trained our system (without knowing we did ) Interactive approach leads to quick development of knowledge Proof-of-Concept Stories found are understandable and relevant Linking items to existing dossiers is successful Determining new dossiers is difficult To-do Quantitative evaluation Decide how to move further Possibly work towards a working system based on the mid-level requirements 49

END Thanks for your attention!