Big Data Analy1cs. Radu State

Similar documents
Making historical connections: Building Application Layer Traffic Optimization (ALTO) network and cost maps from public broadband data

Concept and Project Objectives

LARGE-SCALE INTERNET MEASUREMENTS FOR DATA-DRIVEN PUBLIC POLICY. Henning Schulzrinne (+ Walter Johnston & James Miller) FCC & Columbia University

Ins+tuto Superior Técnico Technical University of Lisbon. Big Data. Bruno Lopes Catarina Moreira João Pinho

INCREASE NETWORK VISIBILITY AND REDUCE SECURITY THREATS WITH IMC FLOW ANALYSIS TOOLS

Presented by: Aaron Bossert, Cray Inc. Network Security Analytics, HPC Platforms, Hadoop, and Graphs Oh, My

White Paper. Intelligence Driven. Security Monitoring. v nexusguard.com

Measuring Broadband America

SQream Technologies Ltd - Confiden7al

2014 Measuring Broadband America Fixed Broadband Report

QoE-Aware Multimedia Content Delivery Over Next-Generation Networks

Dr. John E. Kelly III Senior Vice President, Director of Research. Differentiating IBM: Research

Internet Traffic and Content Consolidation

Service Description DDoS Mitigation Service

STEALTHWATCH MANAGEMENT CONSOLE

Workshop on Infrastructure Security and Operational Challenges of Service Provider Networks

( NBP ). 4 See

Splunk for Networking and SDN

A New Era Of Analytic

Internet Traffic Evolution

Botnet Detection Based on Degree Distributions of Node Using Data Mining Scheme

Huawei One Net Campus Network Solution

Testing & Assuring Mobile End User Experience Before Production. Neotys

Mining NetFlow Records for Critical Network Activities

Network Performance Monitoring at Minimal Capex

Complex, true real-time analytics on massive, changing datasets.

Massive Cloud Auditing using Data Mining on Hadoop

How To Handle Big Data With A Data Scientist

DDOS WALL: AN INTERNET SERVICE PROVIDER PROTECTOR

Denial of Service Attacks and Resilient Overlay Networks

An Anomaly-Based Method for DDoS Attacks Detection using RBF Neural Networks

Availability Digest. Prolexic a DDoS Mitigation Service Provider April 2013

Securing data centres: How we are positioned as your ISP provider to prevent online attacks.

Scaling Big Data Mining Infrastructure: The Smart Protection Network Experience

Privacy- Preserving P2P Data Sharing with OneSwarm. Presented by. Adnan Malik

Multi-Datacenter Replication

ATLAS Internet Observatory Bandwidth Bandwagon: An ISOC Briefing Panel November 11, 2009, Hiroshima, Japan

CLASSIFYING NETWORK TRAFFIC IN THE BIG DATA ERA

Elevating Data Center Performance Management

PALANTIR CYBER An End-to-End Cyber Intelligence Platform for Analysis & Knowledge Management

Application and practice of parallel cloud computing in ISP. Guangzhou Institute of China Telecom Zhilan Huang

The changing face of global data network traffic

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

QRadar SIEM and FireEye MPS Integration

Mobile Operator Big Data Analytics & Actions

BIG DATA What it is and how to use?

DATA ANALYSIS II. Matrix Algorithms

STATEMENT OF CRAIG LABOVITZ, PHD Co-Founder and CEO of DeepField Before the House Judiciary Committee Subcommittee on Regulatory Reform, Commercial

10 METRICS TO MONITOR IN THE LTE NETWORK. [ WhitePaper ]

Disaster Recovery Design Ehab Ashary University of Colorado at Colorado Springs

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc All Rights Reserved

Big Data in Enterprise challenges & opportunities. Yuanhao Sun 孙 元 浩 yuanhao.sun@intel.com Software and Service Group

2013 Measuring Broadband America February Report

Giving life to today s media distribution services

Joint ITU-T/IEEE Workshop on Next Generation Optical Access Systems. DBA & QoS on the PON - Commonalities with Switching & Routing

Craig Labovitz, Scott Iekel-Johnson, Danny McPherson Arbor Networks Jon Oberheide, Farnam Jahanian University of Michigan

Truffle Broadband Bonding Network Appliance

Image Analytics on Big Data In Motion Implementation of Image Analytics CCL in Apache Kafka and Storm

DOMINO Broadband Bonding Network

ENHANCED HYBRID FRAMEWORK OF RELIABILITY ANALYSIS FOR SAFETY CRITICAL NETWORK INFRASTRUCTURE

Protecting DNS Critical Infrastructure Solution Overview. Radware Attack Mitigation System (AMS) - Whitepaper

Internet Firewall CSIS Packet Filtering. Internet Firewall. Examples. Spring 2011 CSIS net15 1. Routers can implement packet filtering

McAfee Network Security Platform

Hadoop s Advantages for! Machine! Learning and. Predictive! Analytics. Webinar will begin shortly. Presented by Hortonworks & Zementis

Timely Decisions with Big Data: Opportunities & Challenges

Glasnost or Tyranny? You Can Have Secure and Open Networks!

A Review on Zero Day Attack Safety Using Different Scenarios

Trends in Internet Traffic Patterns Darren Anstee, EMEA Solutions Architect

BITAG Publishes Report: Differentiated Treatment of Internet Traffic

BIG DATA IN BUSINESS ENVIRONMENT

How Network Operators Do Prepare for the Rise of the Machines

CHAPTER 6. VOICE COMMUNICATION OVER HYBRID MANETs

Big Graph Processing: Some Background

Data Mining for Data Cloud and Compute Cloud

Whitepaper. 10 Metrics to Monitor in the LTE Network. blog.sevone.com

How the Netflix ISP Speed Index Documents Netflix Congestion Problems

Virtual Leased Line (VLL) for Enterprise to Branch Office Communications

Peer-to-Peer Botnet Detection Using NetFlow Master Thesis

How To Create A Data Science System

DoS: Attack and Defense

DDoS Protection. How Cisco IT Protects Against Distributed Denial of Service Attacks. A Cisco on Cisco Case Study: Inside Cisco IT

ProtectWise: Shifting Network Security to the Cloud Date: March 2015 Author: Tony Palmer, Senior Lab Analyst and Aviv Kaufmann, Lab Analyst

MetroNet6 - Homeland Security IPv6 R&D over Wireless

Tim Blevins Execu;ve Director Labor and Revenue Solu;ons. FTA Technology Conference August 4th, 2015

PERFORMANCE OF MOBILE AD HOC NETWORKING ROUTING PROTOCOLS IN REALISTIC SCENARIOS

Security Solutions for the New Threads

An Elastic and Adaptive Anti-DDoS Architecture Based on Big Data Analysis and SDN for Operators

locuz.com Big Data Services

Take the Red Pill: Becoming One with Your Computing Environment using Security Intelligence

Mucho Big Data y La Seguridad para cuándo?

Akuda Labs. Leverages Peak Hosting s Operations-as-a-Service Managed Hosting Solution to Process Big Data Analytics 500 Faster without Big Costs

Detecting Network Anomalies. Anant Shah

ALTO and Content Delivery Networks dra7- penno- alto- cdn

A Mock RFI for a SD-WAN

Detecting Anomalous Behavior with the Business Data Lake. Reference Architecture and Enterprise Approaches.

The Sumo Logic Solution: Security and Compliance

VMDC 3.0 Design Overview

Practical Graph Mining with R. 5. Link Analysis

IBM: An Early Leader across the Big Data Security Analytics Continuum Date: June 2013 Author: Jon Oltsik, Senior Principal Analyst

Transcription:

Big Data Analy1cs Radu State

Short Bio Master of Science in Engineering, Johns Hopkins University, USA Senior Researcher with INRIA, France Professor, (University of Nancy 1) Research Scien1st, SnT/UL in Netlab research group 6/11/14 NETLAB Presenta1on 2

Outline What is Big Data Big Data Analysis in team Call Details record at a country level Analysis of Network traffic research done at SnT 6/11/14 NETLAB Presenta1on 3

Big Data at a glance 6/11/14 NETLAB Presenta1on 4

What are Big Data Architectures? 6/11/14 NETLAB Presenta1on 5

Phases in Big Data Processing 6/11/14 NETLAB Presenta1on 6

Map- Reduce holy grail in Big Data 6/11/14 NETLAB Presenta1on 7

Big Data is more n simple HPC 6/11/14 NETLAB Presenta1on 8

The current Eco- System 6/11/14 NETLAB Presenta1on 9

Anomaly Detec1on in large scale CDR records David Goergen, Radu State, Thomas Engel and Veena MENDIRATTA (Bell Labs, USA) One country : Ivory Cost Time Period: 01.12.2011 to 28.04.2012 5 million users 1124 base sta1ons (for mobile communica1ons) More n 3 billions entries summarizing on a hourly basis SMS and Voice Calls 50000 mobile users tracked over se months with GPS and call records 6/11/14 NETLAB Presenta1on 10

What happened in Ivory Cost in 2012? 6/11/14 NETLAB Presenta1on 11

The silent base sta1ons 6/11/14 NETLAB Presenta1on 12

Strange calling behaviors..at 2 AM. 6/11/14 NETLAB Presenta1on 13

Where is most variability in data? PCA analysis on dura1on pcacall Variances 0 20 40 60 80 100 120 140 6/11/14 NETLAB Presenta1on 14

Cloud and Service Management ALTO solves general rendezvous problem: Given a choice of resources, which one is best candidate? Normalized costs: Type: What does cost represent? ( Air-miles, hop count,...) 15 - Numerical (virtual coordinates) - Ordinal(position-based preferences) ALTO client 2 main abstractions: - Network Map Datacenter 2 - Cost Map Datacenter 1 Dynamic network information Provisioning policies ALTO service discovery Network specified in terms of Partition/Provider ID (PID): aggregation of endpoints identified by a providerdefined network location identifier. Routing Graphics sources: protocols http://pubs.vmware.com/vi301/intro/images/introduction_chapter.3.2.1.jpg 6/11/14 IIT RTC conference Datacenter 3

Op1miza1on of service provisioning (Cloud, CDNs) over large ISPs and networks FCC Dataset specifica1on (200 GB/year) FCC has embarked on a na1onwide performance study of residen1al wireline broadband service easurement points (unit_ids) hroughput (b) Latency Aim is to use raw datasets from this Cablevision study f or a nalysis a nd Figure 6: Sharpe ratios for upload throughput and latency to create ALTO Charter Comcast topology map and a cost map from this Cox dataset portfolio is modeled as specific service of interest Embarq MCI Using a c anonical M ap- Reduce ( Big D ata) eking PIDs where nodes have a high upload throughput or os for upload bandwidth merger completed, it took about 5 months for operating Mediacom computa1onal paradigm on a Hadoop TimeWarner b). figurenodes showshave of merged network to start showing a positive DsThe where low capacities latency. cluster Windstream Cablevision 0.000 0.498 0.498 0.498 0.498 0.498 0.498 0.498 0.498 Charter 0.799 0.000 0.799 0.799 0.799 0.799 0.799 0.799 0.799 Comcast 0.804 0.804 0.000 0.804 0.804 0.804 0.804 0.804 0.804 Cox 0.796 0.796 0.796 0.000 0.796 0.796 0.796 0.796 0.796 Embarq 0.377 0.377 0.377 0.377 0.000 0.377 0.377 0.377 0.377 MCI 0.700 0.700 0.700 0.700 0.700 0.000 0.700 0.700 0.700 Mediacom 0.316 0.316 0.316 0.316 0.316 0.316 0.000 0.316 0.316 TimeWarner 0.508 0.508 0.508 0.508 0.508 0.508 0.508 0.000 0.508 Windstream 0.782 0.782 0.782 0.782 0.782 0.782 0.782 0.782 0.000 Default 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 Sharpe ratio (y-axis) growth again. The Sharpe ratio for foroura portfolio p, is computed by subtable III: Cost map for upload bandwidth (Cu ) Major o utcomes graphs (at y = 0) represents Calculating cost maps using Sharpe ratio (Equation acting risk-free rate of return (R f ) from rate of e investment no longer is 1) isfinancial modeling ith Sharpe ra1os to a set of all straightforward. Wewconsider P ID to be Cablevision Charter Comcast Cox Embarq MCI Mediacom TimeWarner Windstream Default e portfolio return itself (r and dividing by standard p ), ISPs build in third par1es ALTO maps that takes ompared in terms of absolute Table II. The upload throughput cost map, Cu, Cablevision 0.00 1.67 1.81 1.10 1.41 0.86 1.32 0.32 1.38 2.00 0.76 0.00 1.81 1.10 1.41 0.86 1.32 0.32 1.38 2.00 eviation of individual portfolio ( p ), as shown in l(1): both bandwidth nd atency nto account 2 Charter volution for ISPs. returns is calculated from asharpe ratio iusing Equations and 3 Comcast 0.76 1.67 0.00 1.10 1.41 0.86 1.32 0.32 1.38 2.00 s to explore here, which we below. Cox 0.76 1.67 1.81 0.00 1.41 0.86 1.32 0.32 1.38 2.00 r R Embarq 0.76 1.67 1.81 1.10 0.00 0.86 1.32 0.32 1.38 2.00 f w to create cost maps.sp = p (1) MCI 0.76 1.67 1.81 1.10 1.41 0.00 1.32 0.32 1.38 2.00 (a) Upload Throughput 12 (b) Latency Mediacom 0.76 1.67 1.81 1.10 1.41 0.86 0.00 0.32 1.38 2.00 p X any correlation between i i TimeWarner 0.76 1.67 1.81 1.10 1.41 0.86 1.32 0.00 1.38 2.00 8i 2 P ID, C = S W (2) i Figure ratios upload0.86 throughput u yhistorical among ISPs. However, Windstream 0.76 1.67 6: Sharpe 1.81 1.10 for1.41 1.32 and latency 0.32 0.00 2.00 averages alone might not be appropriate if m m=1 Ps deliver acceptable latency Table IV: Cost map for latency (Cl ) sociated standard deviations are high, because in such cases Relevant PHere, ublica1ons and submissions ent benchmark we establish. i C is upload throughput cost for each PID i, Figure 6 shows Sharpe ratios for upload bandwidth merger completed, it took about 5 months for operating e effective metric of interest is much u lower than historhroughput (Figure 6a), where (Figure 6a) and latency (Figure 6b). The figure shows I capacities of merged networkin tocstart showing a positive David G oergen, V eena B. M endirafa ( Bell L abs, U SA), R adu S tate T homas E ngel. den1fying abnormal pafern ellular which is calculated over summation of value of each alisps average. However, Sharpe ratio takes associatedi that deliver what we monthly evolution (x-axis) of Sharpe (y-axis) foreach our growth again. as a distance from default cost. The ratio latency cost for month s Sharpe ratio for PID i (S ) and multiplying by communica1on (IPTCOMM 2013) m i calculatedline in ainsimilar manner. 9 PID, ISPs.CThe graphs (at The y = computations 0) represents andard deviation in account, so it isflows a better candidate Calculating cost maps using Sharpe ratio (Equation peed. Second, a number of a weight l, ishorizontal associated with PID. To rewardofsharpe ratios Equations 2which and 3 result in: cost matrices shown in Aggrega1ng threshold atr risk-free investment no longer is, 1) is straightforward. We consider P ID to be a set of all costdavid Goergen Vijay K. Gurbani (Bell L abs, U SA), adu State Of m aps a nd c osts: large- scale broadband ent over year inaupload r approximating function. Higher Sharpe ratios are that are positive, we weigh sum by fraction ofiiimonths Tables and IVratios for Ccan Cl, respectively. In of absolute tables, ISPs in Table II. The upload throughput cost map, Cu, best option. These be compared in terms u and and Mediacom). This may be measurements or tmonths) he Applica1on Layer ratio Traffic but O p1miza1on ( ALTO) ppid rotocol. (RTC 2013) quivalent to high averages and small standard deviations. for first column source and remaining values, alsoi on temporal evolution for individual ISPs. (from last f12 that Sharpe PID isrepresents is calculated from Sharpe ratio using Equations 2 and 3 g and provisioning, although columns destination PID. Thus,toa explore host in PIDwhich Comcast There are some interesting trends here, we below. positive (i.e., it is above risk-free investment line). maller values for Sharpe ratios are due to eir high David Goergen, Vijay Gurbani (Bell Labs, USA), adu Sbefore tate or Thomas Engel. Making historical connec1ons: Building Applica1on will R always hosts discuss below prefer showing howintocomcast create first cost(cost: maps.0), ause for this. Alternatively, In ALTO, a lower value for a cost indicates a higher andard deviations, or to small averages. A negative Sharpe 12 followed by hosts in PID Mediacom (cost: 0.316) if it wants to (submifed to CNSM 2014) Layer Traffic Op1miza1on (ALTO) network and maps public broadband data X First,cost data doesfrom not show any correlation between end in Sharpe ratio for i preference forreturn traffic to be sentperform from a source to a destination. optimize upload bandwidth, or hosts in PID Time Warner 8i 2 P ID, Cui = Sm Wi (2) tio implies that risk-free rate of would upload throughput and latency among ISPs. However, mized by Time Warner and Therefore, actual cost for each PID is furr (cost: 0.32) if it wants to minimize latency. The costs serve m=1 calculated Figure 6b does show that most ISPs deliver acceptable latency etter portfolio being analyzed. ward than trend that starts around 6/11/14 to connect disconnected components of Figure according to previously risk-free investment benchmark we establish. as: Here, Cui is upload throughput cost for each PID i, 5; Figure 7 depicts links formed by peers in We create twountil costjuly maps, each cost map is specific to sustained uptick This is less case with upload throughput (Figure 6a), largest where which is calculated over summation of value of each PID is(comcast) to peers inamong or ISPs PIDs that based on what upload re a wider variation deliver we month s Sharpe ratio for PID i (S i ) and multiplying by ion of Insight Communica-Cost map, Cu is created for P2Pclass of applications. m bandwidth of Table III. The width of edges between 1 2consider i an cost 3 acceptable upload speed. Second, a number of a weight associated with PID. To reward Sharpe ratios 8i 2 P ID, Def ault_cost = dmax(c, C,..., C )e February 29, 2012 Insight u is a measure of preference; i.e., PIDs connected PIDs ass of applications that seeks out PIDs that host peers with u uisps (3) showi continuous improvement over year in upload that are positive, we weigh sum by fraction of months i est cable operator in US

Big Data and Security Research Questions: Securing Big Data architectures Which data is accessed and processed? What happens to outcome? How to protect data in motion? Big Data approaches for advanced security analytics and advanced Persistent Threat (APT) detection massive data collecting data and event tracking at large scale deeper analytics on unstructured data (honeypots, firewall, DNS); consolidated view of security-related information; real-time analysis of streaming AAA security data in order to profile human behavior in loop Predict attacker behavior on or targets J. François, S. Wang, R. State, and T. Engel, BotTrack: Tracking Botnets using NetFlow and PageRank, in IFIP/TC6 NETWORKING 2011, Springer, Ed., Valencia, Spain, May 2011 Wagner Cynthia, J. François, R. State, and T. Engel, Machine Learning Approach for IP-Flow Record Anomaly Detection, in IFIP/TC6 NETWORKING 2011, Springer, Ed., Valencia, Spain, May 2011. Hommes, Stefan State, Radu Zinnen, Andreas Engel, Thomas. Detection of Abnormal Behaviour in a Surveillance Environment Using Control Charts. IEEE International Conference on Advanced Video and Signal based Surveillance (AVSS) (2011), no. 8th, pp. 113-118 S. Wang, R. State, M. Ourdane T. Engel. Mining NetFlow Records for Critical Network Activities. AIMS 2010: 135-146 S. Wang, R. State, M. Ourdane T. Engel. FlowRank: ranking NetFlow records. IWCMC 2010: 484-488 6/11/14

Botnet tracking Jerome Francois, Radu State, Thomas Engel Botnets: Army of controlled compromised machines Powerful afack vector (spam, DDoS, espionage...) P2P Bots detec=on: Well interconnected machines to maintain underlying network Graph Analysis (PageRank/Google) Improvement by leveraging honeypots Source: hfp://www.csoonline.com/ar1cle/348317/what- a- botnet- looks- like, Scof Berinato 2012-05- 08 Network Security 18

Botnet tracking BotTrack / BotCloud (MapReduce version): Results: Stealthy botnet detec1on (1% of IP addresses) High accuracy ~ 99% Scalability (60,000 flows / second) Publica=ons: BotTrack: Tracking Botnets Using NetFlow and PageRank, François J., Wang S., State R., Thomas E., IFIP Networking 2011 BotCloud: Detec<ng Botnets Using MapReduce, François J., Wang S., Bronzi W., State R., Engel T., IEEE Interna1onal Workshop on Informa1on Forensics and Security - WIFS'11 2012-05- 08 Network Security 19

Financial Data Analysis Research Ques1ons: Model complex rela1onships for co- lending in EU banking zone Modeling of financial instruments Mining of highly unstructured data formats Integrate economical (NYSE), regulatory (SEC) and media news (chairman, board member informa1on obtained from social media Twifer, Google Trend) Assess and model risk based on graph modeling and Distributed computa1ons Analyze loan interests rates (Libor) with respect to addi1onal loan rates and economic indicators. Expected Outcomes 6/11/14 Link to local (Luxembourg) economy Impact to EU regulatory bodies Figure 1: Co-lending network for 2005. that are connected to ones it is directly connected with. Therefore, even if a bank has very few co-lending relationships itself, it may impact entire system if it is connected to a few major lenders. Since matrix L represents pairwise connectedness of all banks, we may write impact of bank i on system as following equation: x i = N j=1 Lijxj, i. This may be compactly represented as x = L x, wherex =[x 1,x 2,...,x N] R N 1 and L R N N. We pre-multiply left-hand-side of equation above by a scalar λ to get λ x = L x, i.e.,an eigensystem. The principal eigenvector in this system gives loadings of each bank on main eigenvalue and represents influence of each bank on lending network. This is known as centrality vector in sociology literature [5] and delivers a measure of systemic effect a single bank may have on lending system. Federal regulators may use centrality scores of all banks to rank banks in terms of ir risk contribution to entire system and determine best allocation of supervisory attention. The data we use comprises a sample of loans filings made by financial institutions with SEC. Our data covers a Development of such ac1vi1es at SnT 2006 2008 2007 2009 Figure 2: Co-lending networks for 2006 2009. No1ce: Figure from (1) sented in Figure 1 for 2005. We see that re are three large components of co-lenders, and three hub banks, with connections to large components. There are also satellite colenders. In order to determine which banks in network are most likely to contribute to systemic failure, we compute normalized eigenvalue centrality score described previously, and report this for top 25 banks. These are presented in Table 1. The three nodes with highest centrality are seen to be critical hubs in network se are J.P. Morgan (node 143), Bank of America (node 29), and Citigroup (node 47). They are bridges between all banks, and contribute highly to systemic risk. Figure 2 shows how network evolves in four years after 2005. Comparing 2006 with 2005 (Figure 1), we see that re still are disjointed large components connected by a few central nodes. From 2007 onwards, as financial crisis begins to take hold, co-lending activity diminished

Ques1ons? 21 6/11/14