Using NetFlow for Anomaly Detection in Operational Networks Maurizio Molina (DANTE) maurizio.molina@dante.net 1 st COST TMA PhD Winter school Torino, 10 th Feb, 2010
General Outline Introduction to IP flows IP flow monitoring systems and standards Understanding sampling An example deployment: GÉANT2 Using IP flows for network Anomaly Detection Preparing for the lab
Part 1/6 Introduction to IP flows What are they? How are they measured? What applications use these measurements? IP flow monitoring systems and standards Understanding Sampling An example deployment: GÉANT2 Using IP flows for network Anomaly Detection Preparing for the lab
IP flows IP Flows are groups of IP packets sharing a common characteristic, e.g. IP src/dst address src/dst ports Transport layer protocol Type Of Service (TOS) field Flows can be long lasting... or have a limited lifetime... and packets may belong to more than one flow
Measurement category IP flow monitoring is a single point, passive network measurement Routers just observe and report info about transiting flows The receiver if IP flow info is called Collector Collector Netflow Router
IP flows measurement Flows can be long lasting... at flow end or have a limited lifetime... Reported flow information -what: src IP, dst IP, ports -when: Start time End time # packets # bytes Periodically for long lasting flows Other and packets may belong to more than one flow t
Information obtained by IP flow monitoring It s time and volume summary information Pkt size No inter-pkt arrival times No single Pkt sizes All you have is a labelled brick time Tstart Tend #packets #bytes average Bytes/Pkt average Byte/s or Pkt/s Src IP, dst IP, ports, Protocol Volume Duration
So what can you do? Compose bricks having common features and build traffic profiles Bytes/s Overall Selective (e.g. from Subnet X to Subnet Y, or from Server Z on port 80 to any address ) Time 5min. Possibly, work with discrete time bins -flow contained in single bin => no problems -flow spanning multiple bin => split bytes and packets linearly
Applications using IP flow info Network Planning Discovery of network usage and application patterns who talks to whom E.g. AS/AS matrixes what applications are used (if they can be recognised ) traditional IP flow monitoring just goes up to Layer 4 Traffic Engineering Billing / Accounting Anomaly Detection Security Network Operations
Applications using IP flow info (cont.) Application Time Granularity Space Granularity Traffic Engineering (minutes) Billing / Accounting (minutes-months) Network Planning (months) Security (minutes-days) Discovery of usage and application patterns (months)
Part 2/6 Introduction to IP flows IP flow monitoring systems and standards Architecture Standards and their evolution Understanding Sampling An example deployment: GÉANT2 Using IP flows for network Anomaly Detection Preparing for the lab
General architecture of a IP flow monitoring system Meter: Filters packets, timestamps them and associates Pkts to flow(s) Flow cache: Creates/Removes/Updates flow records Flow Key Flow start time Flow last update time # Pkts # Bytes.. Exporter: Reads Flow cache, prepares and sends export packets Exp HD Router functionality or dedicated Probe info info info Netflow v5/v8/v9 IETF IPFIX Database Exp HD info info info Collector: Receives export packets, interfaces to applications Analysis tools
Cisco Netflow: origin and evolution 1996 Initially designed at Cisco (Daren Kerr and Barry Bruins) as a switching path speedup Then realized that per-flow information had also other value http://www.cisco.com/en/us/docs/ios/solutions_docs/ netflow/nfwhite.html v5: first widely implemented version Fixed export format, no aggregation: each flow is reported separately v7: Specific to 6500 and 7600 Switches v8: 11 possible aggregation schema v9: flexible aggregation (template based). Chosen as baseline for IPFIX
Netflow: other Given names Juniper cflowd (v5, v8, v9) Huawei Netstream (v5, v8, v9) Avici Supports v5 and v9 Alcatel Supports v5 and v8
Netflow record content What info can Flow Records contain? Flow signature Volume and Duration What identifies a Flow Record? Pkt treatment In Router Src IP, Dst IP, Src Port, Dst Port, Protocol, Input If, TOS are key fields 5-tuple (most common definition) 7-tuple Source: Cisco
Comments on Netflow record fields Start and end times are relative to first and last flow s packet (not to record s export time ) TCP flags (S,F,A,P,U,R) are cumulative for the flow AS can be either src/dst or prev/next, not both! It s a configuration option It s obtained in the router via a routing lookup (it s not in the IP packets)! AS 101 AS 102 AS 103 AS 104 AS 105 AS 106 Source: Cisco Netflow Enabled Router If origin-as is configured, it will report: Src AS->101, Dst AS->106 If peer-as is configured, it will report: Src AS->103, Dst AS->105
Controlling the exporting Four conditions govern the expiration of flows from flow cache (and their exporting) Inactive timeout: if a flow has not been updated for more than IA_tout sec., export it Active timeout: if a flow was created more than A_tout sec. ago, export it End of flow detected: works for TCP only (FIN or RST Pkt) Internal flow cache management: if flow cache has more than X flows, or is more than Y% full, start exporting flows
Controlling the exporting (cont.) Inactive Timeout: if too small, will split the same flow Pkt size Info for this flow exported flows with low pkt rate R are more at risk when 1/(RS) IA_tout (S: sampling rate) if too high, too many flows in cache N= is the flow interarrival time is the service time (flow duration+ IA_tout) and is dominated by IA_tout Typical values of IA_tout: 10s 60s time
Controlling the exporting (cont.) Active Timeout: will periodically report info about the same flow Pkt size Info for this flow exported time If too small: More burden to collect/process the info Not clear any more what a flow is If too high: collectors working on discrete time slots need to go back in time Will break some implementations Typical values of A_tout: 5min-30min
Common configuration commands Cisco (CLI) ip flow-export version <version> [originas peer-as bgp-nexthop] ip flow-export destination <address> <port> ip flow-cache timeout inactive <seconds> ip flow-cache timeout active <minutes> ip flow-cache entries <number> Juniper (conf-file) cflowd collector-host-address { Autonomous-system-type (origin peer); port port-number; version version-number; (local-dump no-local-dump); }
Visualizing the configuration and flow cache on routers Cisco show ip cache [verbose] flow Will show flow cache configuration and statistics, and flow details show ip flow export Will show exporting process statistics Juniper show configuration forwarding-options sampling Will show flow collection configuration monitor start sampled Equivalent of unix tail -f command on a file where the flow records are dumped (not advised to create this file in production, because of additional load on Routing Engine)
Most commonly deployed version Fixed format Flow records exported in UDP packets 30 flow records in a 1500 bytes pkt Netflow v5 Content Bytes Description srcaddr 0-3 Source IP address dstaddr 4-7 Destination IP address nexthop 8-11 Next hop router's IP address input 12-13 Ingress interface SNMP ifindex output 14-15 Egress interface SNMP ifindex dpkts 16-19 Packets in the flow doctets 20-23 Octets (bytes) in the flow first 24-27 SysUptime at start of the flow last 28-31 SysUptime at the time the last packet of the flow was received srcport 32-33 Layer 4 source port number or equivalent dstport 34-35 Layer 4 destination port number or equivalent pad1 36 Unused (zero) byte tcp_flags 37 Cumulative OR of TCP flags trot 38 Layer 4 protocol (e.g. 6=TCP, 17=UDP) tos 39 IP type-of-service byte src_as 40-41 Autonomous system number of the source, either origin or peer dst_as 42-43 Autonomous system number of the destination, either origin or peer src_mask 44 Source address prefix mask bits dst_mask 45 Destination address prefix mask bits pad2 46-47 Pad 2 is unused (zero) bytes Source: Cisco
Netflow v7 and v8 v7 Specific to 6500 and 7600 Switches Similar to v5, but without AS, Interface, TCP flag and ToS info v8 Goal: reduce exported information, and primary flow cache size, with aggregation 11 aggregation schemes : AS, Destination-Prefix, Prefix, Protocol-Port, Source Prefix, AS-ToS, Destination-Prefix-ToS, Prefix-ToS, Protocol-Port-ToS, Source Prefix-ToS, Prefix-Port Source: Cisco
Netflow v9 Previous versions have all a fixed export format To overcome the fixed format, one could always export type, length, value A lot of overhead! or separate type, length from value Templates specify the type and length of carried info just the data is exported in Data Flow Sets Each Data Flow Set is preceded by an identifier pointing to the template needed to its decoding If templates are lost, data flow sets cannot be decoded! v9 Can run over multiple transports (not just UDP) Source: Cisco
IPFIX IETF standard, chartered in 2002 to Find or develop a basic common IP Traffic Flow measurement technology to be available on (almost) all future routers Netflow v9 selected as a baseline for IPFIX, but without backward compatibility constraints Cisco is the driving force behind IPFIX, but other vendors (NEC, Hitachi) are active (or observing) Status: main RFCs approved http://www.ietf.org/dyn/wg/charter/ipfix-charter.html Read RFC 5470 (architecture) first and then RFC 5101 (protocol) But not widely implemented and used yet
IPFIX what s new Formal definition of a large number of information elements to carry the elementary information big extension of the v5 table shown before E.g. absolute and delta counters, timestamps with [s], [ms], [ s], [ns] resolution Possibility to extend it and to define enterprise specific information elements Options templates and options flow records can be used to export configuration information about the metering process
IPFIX what s new (cont.) IPFIX can use Stream Control Transport Protocol (SCTP RFCs 2960, 3309, 3758), TCP or UDP as transport protocols Debate in the IETF, because UDP is not congestion aware TCP is heavy for line cards and exposes to Head of Line blocking SCTP is new and not widely implemented PR-SCTP is the preferred transport because it is congestion aware but with a simpler state machine than TCP An SCTP association can contain multiple streams. At minimum, an IPFIX implementation MUST have two associations, one for data an one for templates Reliable transport for templates, partly reliable (e.g. limited no of retransmissions) for data
IPFIX what s new (cont.) Simple devices can still use UDP as a transport But templates must then be periodically refreshed Security: If TCP is transport, use TLS If UDP or SCTP, use DTLS But mature implementation of DTLS over SCTP are missing, therefore Either use TLS over TCP Or use DTLS but without reliability Always use mutual X.509 certificates based authentication
Part 3/6 Introduction to IP flows IP flow monitoring systems and standards Understanding Sampling An example deployment: GÉANT2 Using IP flows for network Anomaly Detection Preparing for the lab
Sampling Most routers do deterministic 1:N sampling As long as there are a lot of flows, this is similar to randomly sampling the packets of every single flow, with probability S=1/N Sampling is independent of pkt size
Re-normalization Packets: multiply by N It s an un-biased estimator Bytes: multiply by N It s correct as long as the sampled packet population well represents the bytes/pkt distribution Flows: multiplying by N is wrong! No easy and universal formula (afaik) Hohn, Veitch - Inverting sampled traffic IEEE/ACM Transactions on Networking Volume 14, Issue 1 (Feb.2006) N.G. Duffield, - Sampling for Passive Internet Measurement: A Review,Statistical Science,Vol. 19, No. 3 http://www2.research.att.com/~duffield/papers/sts102.pdf
Practical sampling questions I counted h sampled packets for a flow, with a sampling rate of S; I estimate H=h/S packets in unsampled flow. How precise is this? => estimation of pkts in a flow problem A lot of (similar) flows have only one or few sampled packets. What was their original size? How many flows were missed? => estimation of no. of flows problem
Estimation of no. of packets in a flow The issue is to control the precision of the re-normalization S=sampling rate (e.g. 1/1000) H=true number of packets in a flow h=sampled packets of a flow N=true overall number of packets n=number of overall sampled packets Ĥ=h/S number of estimated packets in a flow p=h/n true proportion of pkts of a flow in overall pkts p =h/n estimated proportion of pkts of a flow in overall pkts Result: Ĥ - v < H < Ĥ + v, where v p is unknown => Worst case assumption: p =0 z 1 S / 2 h (1 p ' ) Source: T. Szeby - Deployment of Sampling Methods for SLA Validation with Non-Intrusive Measurements [2002]
Estimation of no. of packets in a flow (cont.) Absolute error: grows with Ĥ S=1/100: at least 400 sampled packets for a rel_error<10% In a 5 min bin, only flows with real rate>133pk/s fulfil this So, beware of estimates for low rate flows Consider them as qualitative Relative error: decreases with Ĥ
Estimation of no. of flows General problem very difficult Practical problem of interest: how many (short) flows F were there, if I sampled f flows? Assuming all flows of interest have same size N I first need P FS (Flow Sampled N) = 1-(1-S)^N Then I could use P FS for estimating F=1/P FS and adapt the formulas shown before for estimating the error But P FS depends on N, which is unknown!
Estimation of no. of flows (cont.) I could estimate N from the first points (k=1,2,3 ) of the empirical distribution of the f sampled flow sizes P ( k k 0 N ) P ( k 0 N ) N k!( N! k 1 )! S S k N (1 S ) N k E.g. S=1/100 Tricky, isn t it? Values are so close k N=2 N=3 N=4 N=10 1 99.50% 99.00% 98.50% 95.54% 2 0.50% 1.00% 1.49% 4.34% 3 0.00% 0.00% 0.01% 0.12%
Part 4/6 Introduction to IP flows IP flow monitoring systems and standards Understanding Sampling An example deployment: GÉANT The Network NetFlow collection setup The traffic view given by NetFlow (and routing) data Using IP flows for network Anomaly Detection Preparing for the lab
GÉANT Operates A Transit Network serving European NRENs Publicly funded (EU, Governements) Existing since early 90 Evolution of RARE, COSINE, EuropaNET, TEN-34, TEN-155 History of research networking: http://www.dante.net/server/show/conwebdoc.341/viewpage/2 18 POPs in Europe - 10 Gbps Links almost everywhere Already multiple 10G in parallel in some locations Trialling 40Gbit/s Both unusual research traffic & commodity traffic Commodity traffic via two transit providers (Telia & GBLX) Intercontinental Peerings with other research networks (Abilene, Canarie, ESNET, SINET,etc.)
GÉANT Peerings view
Netflow collection in GÉANT In GÉANT2, we collect Netflow v5 at every peering point with an external Autonomous System We use 1/100 sampling overall GEANT traffic is 30-50 Gbit/s This produces, with 1/100 pkt sampling, 12-20 sampled Kflow/s and an overall Netflow traffic to the collector of 10-20 Mbit/s Several Gbytes/day of disk space are needed to store flow records
NetFlow processing tools in GÉANT Two Netflow processing tools currently in use in GEANT NfSen (by Peter Haag) http://nfsen.sourceforge.net/ Open source Processes NetFlow only NetReflex (by Guavus Inc.) Commercial Selected after trial with two other tools Processes NetFlow, BGP and IS/IS (or OSPF) There are a lot of other Netflow processing tools : http://www.switch.ch/tf-tant/floma/software.html
NetReflex: a network wide traffic view Suppose there s a communication (flow) from Athens to London, taking a certain path in GÉANT
NetReflex: a network wide traffic view (cont.) With NetFlow only, I just know the entry point of that traffic (Athens)
NetReflex: a network wide traffic view (cont.) With NetFlow + BGP, I know the entry and exit points (Athens, London)
NetReflex: a network wide traffic view (cont.) With NetFlow + BGP + IS-IS, I know the entry and exit points, and the path in the network (Athens, Vienna, Milan, Geneva, Paris, London)
Clicking on a link you get the traffic contributed on that link by every Node to Node pair in the network NetReflex: detailed link level traffic view
NetReflex: traffic matrix view Node to Node traffic matrix Ordered by decreasing or increasing traffic Clickable time series
Part 5/6 Introduction to IP flows IP flow monitoring systems and standards Understanding Sampling An example deployment: GÉANT2 Using IP flows for network Anomaly Detection Preparing for the lab
Anomaly Detection What is an Anomaly? Unusually peak of traffic? Traffic drop? Day not abiding to normal weekly pattern? Traffic focused on certain parts of the networks/hosts/ports? New application generating different traffic patterns? Here comes the tool choice!
A Visual example (NfSen graph) Overall traffic entering GÉANT
A Visual example (NfSen graph) - cont. Zooming on 1 router and on UDP traffic only..
The essence of Network AD In the previous example there was A daily and weekly cycle A peak, evident on disaggregate traffic but similar to a statistical variation on the overall traffic In essence, a Network AD approach must Differentiate what is normal and what not.at some aggregation level, trading off: Detection probability (true positives) False positives Scalability
NetReflex approach to AD In GEANT deployment, NetReflex aggregation level is the Router-Router pair (POP-POP pair) 18X18 Matrix Abnormality Detection (what is normal, what s not) based on Principal Component Analysis Metrics used are not just volume variations but also entropy variation of traffic features Multiple 18X18 matrixes used in parallel
Traffic feature entropy Measures the concentration or dispersion of the distribution of a traffic feature Four features are of particular interest pkts per src IP in 5 minutes pkts per dst IP in 5 minutes pkts per src port in 5 minutes pkts per dst port in 5 minutes
Traffic feature entropy variation during an Anomaly fr a c tio n o f to ta l flo w s r e c e ive d p e r IP a d d r e s s fr a c tio n o f to ta l flo w s r e c e ive d p e r IP a d d r e s s 0.2 5 0.2 5 0.2 0.2 0.1 5 0.1 5 0.1 0.1 0.0 5 0.0 5 0 0 1 6 11 16 21 26 1 6 11 16 21 26 IP ( r a n k e d ) IP ( r a n k e d ) Normal Traffic more focused towards a few hosts The Entropy H is: H ( x ) i N 1 n S i log 2 n S i H varies between 0 ( one point takes all ) and log 2 N (uniform distribution) But.can it work in practice?
IP Features entropies proof of concept on GÉANT Network (2007) 10 days of GÉANT traffic TCP features entropies UDP features entropies IP feature entropies (after linear filtering)
Drilling down on a peak - Concentration of SRC and DST IPs and SRC ports - Dispersion of DST ports Portscan from 4 hosts, 29 bytes packets, one target -The bounce is just the result of a linear filter
End of theory part Thanks for your attention!
Part 6/6 Introduction to IP flows IP flow monitoring systems and standards Understanding Sampling An example deployment: GÉANT2 Using IP flows for network Anomaly Detection Preparing for the lab
Goal of the lab Let you drill down on anomalies detected by NetReflex in the GÉANT network Understand if they re true or false (there ARE) still some false positives Do you agree with the classification? Can you get more information than the one shown by default? Imagine you are a security engineer in a CERT and must collect intelligence to share it with the CERT of the attacking or victim network
NetReflex views to be used Use both the anomaly analysis and the search view of NetReflex The search view (aka Query Engine ) has data for last 15 days only (i.e. from the 27 th of Jan)
Analysing an anomaly - steps 0) access NetReflex Split in groups of 3 per PC Open Firefox or I.E. Go to URL shown on paper copy you received Login to one of the logins provided using credentials you received
Analysing an anomaly - steps 1) Familiarize with the anomaly analysis view Pie chart on the right refers to all days of the bar chart Change slider to 15 days back and verify Click on a single day bar Pie chart will restrict to that day Click again to return back to multiple days Click on a single type of anomalies Tab in the bottom will restrict to those Navigate to prev.-next. Tab page Note: heavy UDP/TCP transfers, traffic drop, unknown are not security-related Click on columns Ordering changes but only within displayed page!
Analysing an anomaly steps (cont.) 1) Familiarize (cont.) Check the other Pie chart tabs (ingress, egress, path) Click on a single portion of the pie Check how the tab in the bottom varies AS view: same story but less relevant for this lab. NetReflex detects on a POP by POP basis 2) Focus on a single anomaly (row of tab) Before double clicking: are there other anomalies close in time of the same type? Point to multipoint (or multipoint to point) anomalies may be flagged on several POP-POP pairs
Analysing an anomaly steps (cont.) 2) Focus.(cont) Now double click on row (e.g. 47390, 28 Jan.) Details tab Duration not often precise (e.g. here it is visually several hours not 5 minutes!) A POP POP time series will open normally an anomaly appears as a peak either in the bytes packets or flows view But rarely in all of them Try to switch to the other views (pk/bytes/flows) and filters (UDP, TCP, SSH, RPC, SMB, SQL ) Note: flows are NOT re-normalized, packets and bytes yes
Analysing an anomaly steps (cont.) 2) Focus.(cont) Now click on the blue anomaly dot (or anywhere else on the time series) Heavy hitter tab opens Top 5 src/dst IP and ports for 5 min interval Packets only => renormalized value in 5 min. bin. Move to prev/next 5 minutes, or open another heavy hitter tab How does the dominant dst IPs and dst ports change within and outside the anomaly? IPs in the heavy hitter are clickable Whois query => try it!
Analysing an anomaly steps (cont.) 3) Familiarize with the Query engine Select 28 Jan. and a random 5 min. slot between 3.00 am and 5.45 am Select the src IP of anomaly 47390 as src IP Change page size to 10,000 Hit search Quite a few hosts contacted, isn t it? Why? Add tcp flags to the selected columns Remove other columns if needed for more compact view Change query to dst IP, with same address IPs are responding looks even more than a scan Brute force pwd guessing attempt?
Analysing an anomaly - steps 4) Now your turn: choose anomalies in the last 15 days Make your own choice or pick some from the suggested list Refer to the next slides for tips on how to investigate anomalies Pls restrict QE searches to 15 min. maximum Double check date and hour before you hit submit If you want to query longer periods, pls ask! Be patient and wait for the query to end: do not login-logout Also be patient when time series are loading Imagine you are a security engineer in a CERT and must collect intelligence to share it with the CERT of the attacking or victim network Write down additional things you understood during the analysis 5) Have fun! And remember you signed an NDA ;-)
Analysis tips
Network Scans Scanner Targets Typical: keep a /16 or /24 fixed and vary lsb Single (or few) dst port Same packet structure Size, flags Can also be connection attempts Entry-exit points depend on target variation NetReflex may signal anomalies on multiple POP- POP pairs
Network Scans - analysis Keep the initial filter simple (e.g. src_ip == scanner address ). Do not filter on the destination port (even if the anomaly detection tool has identified one), or on the protocol (TCP or UDP or ICMP). Look at TCP flags (for TCP flows). Try to reverse the search, (e.g. dst_ip == scanner address ) to see if there is any return traffic to the scanner, if it is from the same (scanned) networks evidenced with the forward filter, and what the TCP flags for this return traffic are. Look at the number of packets and their average size
Port Scans Scanner Typical: Single target, scanned on multiple (all) ports Entry-exit points fixed Target
Port Scans - analysis Most obvious filter is src_ip == scanner address && dst_ip == scanned address Look for return traffic, if any
(D)DoS Syn floods Victim Sources Sources frequently spoofed Even if not spoofed, sources do not send the final ACK of the threeway TCP handshake on purpose Entry-exit points depend on sources variation NetReflex may signal anomalies on multiple POP- POP pairs
(D)DoS syn floods - analysis Set up filter dst_ip == target address, and check if the majority of the flows are single packet flows with the SYN flag set Check if/how target replied: No reply at all (SYN floods have been filtered by some intelligent firewall). Reset or ACK/Reset: the target doesn t have a service running on the dst port Some SYN/ACK replies, but in a much lower number than the incoming floods: this may indicate some SYN flood rate limiting firewall in front of the target or that the target is saturated and can only sustain that reply rate (in that case the attack would have been successful)
(D)DoS UDP floods Victim Sources Concentrated or distributed Sometimes target well known ports used by TCP services (ssh, http) To bypass poorly configured firewalls DNS uses UDP on port 53 Some false positives But also true positives, likely DNS cache poisoning attempts
(D)DoS UDP floods - analysis Look at average packet size (malicious floods frequently use small packets). Check simultaneous presence of TCP traffic May be an indication of false positive: tcp control of udp transfers Presence of UDP return traffic. May be an indication of false positive But if port is 53. Watch out! Use of well-known ports for bandwidth tests 5001, 14233 Use of well-known TCP ports (22,80,443 ) and presence of ICMP return traffic ICMP in response of UDP may indicate unwilling target But often ICMP is filtered so you won t always see that! Context information (DNS or whois queries on the end points) Do the end points looks like research?