Network monitoring in high-speed links Algorithms and challenges Pere Barlet-Ros Advanced Broadband Communications Center (CCABA) Universitat Politècnica de Catalunya (UPC) http://monitoring.ccaba.upc.edu pbarlet@ac.upc.edu NGI course, 30 Nov. 2010 1 / 36
Outline 1 Network monitoring 2 Addressing the technological barriers 3 Use cases 4 Addressing the social barriers 2 / 36
Outline 1 Network monitoring Introduction Active monitoring Passive monitoring Technological and social barriers 2 Addressing the technological barriers 3 Use cases 4 Addressing the social barriers 3 / 36
Introduction to network monitoring Process of measuring network systems and traffic Routers, switches, servers,... Network traffic volume, type, topology,... Monitoring is crucial for network operation and management Traffic engineering, capacity planning, BW management Fault diagnosis, troubleshooting, performance evaluation Accounting, billing, security,... Network measurements are important for networking research Design and evaluation of protocols, applications,... Traffic modeling and characterization Network monitoring is very challenging and datasets scarce From technological and social standpoint 4 / 36
Classification Classification of network monitoring tools and methods Hardware vs. software Online vs. offline LAN vs. WAN Protocol level Active vs. passive 5 / 36
Active monitoring Active tools are based on traffic injection Probe traffic generated by a measurement device Response to probe traffic is measured Pros: Flexibility Devices can be deployed at the edge (e.g., end-hosts) No instrumentation at the core is needed Measurement does not directly rely on existing traffic Cons: Intrusiveness Probe traffic can degrade network performance Probe traffic can impact on the measurement itself Main usages Performance evaluation (e.g., ping) Bandwidth estimation (e.g., pathload) Topology discovery (e.g., traceroute) 6 / 36
Passive monitoring Traffic collection from inside the network Routers and switches (e.g., Cisco NetFlow) Passive devices (e.g., libpcap, DAG cards, optical taps) Pros: Transparency Network performance is not affected No additional traffic is injected Useful even with a single measurement point Cons: Complexity Requires administrative access to network devices Requires explicit presence of traffic under study Online collection and analysis is hard (e.g., sampling) Privacy concerns Multiple and diverse usages Traffic analysis and classification,... Anomaly and intrusion detection,... 7 / 36
Technological and social barriers Datasets and platforms for research purposes are limited Outdated datasets Academic traffic only Anonymized and without payloads Technological barriers Internet was designed without monitoring in mind Collection of Gb/s and storage of TB/day Links speeds increase at faster pace than processing speeds Building monitoring apps is error-prone and time-consuming Social barriers Lack of coordination between projects ISPs have no incentive to share information Privacy and competition concerns Monitoring hardware is expensive and difficult to manage 8 / 36
Outline 1 Network monitoring 2 Addressing the technological barriers Bloom filters Bitmap algorithms Direct bitmaps and variants Bitmaps over sliding windows 3 Use cases 4 Addressing the social barriers 9 / 36
Technological challenges Few ns per packet Interarrivals 8ns (40Gb/s), 32ns (10Gb/s) Memory access times < 10ns (SRAM), tens of ns (DRAM) Obtaining simple metrics becomes extremely challenging Approaches based on hash tables do not scale Core of most monitoring algorithms E.g., Active flows, flow size distribution, heavy hitter detection, delay, entropy, sophisticated sampling,... Probabilistic approach: trade accuracy for speed Extremely efficient compared to compute exact answer Fit in SRAM, 1 access/pkt Probabilistic guarantees (bounded error) 10 / 36
Bloom filters 1 Space-efficient data structure to test set membership Based on hashing (e.g., pseudo-random hash functions) Examples of usage in network monitoring Replace hash tables to check if a flow has already been seen Definition of flow is flexible Advantages Small memory (SRAM) is needed compared to hash tables Limitations False positives are possible Removals are not possible (counting variants can support them) 1 B. H. Bloom. Space/time trade-offs in hash coding with allowable errors. Commun. ACM, 13(7), 1970. 11 / 36
Bloom filters Parameters k: #hash functions m: size of the bitmap p: false positive rate n: #elements in the filter (max) Figure: Example of a bloom filter 2 2 A. Broder and M. Mitzenmacher. Network Applications of Bloom Filters: A Survey. Internet Mathematics, 1(4), 2005. 12 / 36
Direct bitmaps (linear counting) 3 Space-efficient algorithms to count the number of unique items E.g., useful to count the number of flows over a fixed time interval Basic idea Each flow hashes to one position (and all its packets) Counting the number of 1 s is inaccurate due to collisions Count the number of unset positions instead E.g., 20KB to count 1M flows with 1% error Estimate formulae Flow hashes to a given bit: p = 1/b No flow hashes to a given bit: p z = (1 p) n (1/e) n/b Expected non-set bits: E[z] = bp z b(1/e) n/b Estimated number of flows: ˆn = b ln(b/z) 3 K.-Y. Whang et al. A linear-time probabilistic counting algorithm for database applications. ACM Trans. Database Syst., 15(2), 1990. 13 / 36
Bitmap variants 4 Direct bitmaps scale linearly with the number of flows Variants: Virtual, multiresolution, adaptive, triggered bitmaps,... 4 C. Estan, G. Varghese, M. Fisk. Bitmap algorithms for counting active flows on high speed links. IEEE/ACM Trans. Netw. 14(5), 2006. 14 / 36
Bitmaps over sliding windows Timestamp Vector (TSV) 5 Vector of timestamps (instead of bits) O(n) query cost Countdown Vector (CDV) 6 Vector of small timeout counters (instead of full timestamps) Independent query and update processes O(1) query cost 5 H. Kim, D. O Hallaron. Counting network flows in real time. In Proc. of IEEE Globecom, Dec. 2003. 6 J. Sanjuàs-Cuxart et al. Counting flows over sliding windows in high speed networks. In Proc. of IFIP/TC6 Networking, May 2009. 15 / 36
CDV and TSV performance 30-min trace, 271 Mbps, 1 query/s, 50K/10s-1.8M/min flows 7 relative error 0.000 0.005 0.010 0.015 0.020 w = 10 s, average error w = 10 s, 95 percentile of error w = 300 s, average error w = 300 s, 95 percentile of error w = 600 s, average error w = 600 s, 95 percentile of error 5 10 15 20 counter initialization value Bytes 2.5 x 106 TSV TSV (extension) CDV 2 1.5 1 0.5 0 10 30 60 120 300 600 window accesses/s 3 x 105 TSV TSV (extension) 2.5 CDV 2 1.5 1 0.5 0 10 30 60 120 300 600 window 7 A hash table would require several MBs with these settings 16 / 36
Outline 1 Network monitoring 2 Addressing the technological barriers 3 Use cases Load shedding Lossy difference aggregator 4 Addressing the social barriers 17 / 36
Overload problem Previous solutions focus on a particular metric They are not valid for any (arbitrary) monitoring application Monitoring systems are prone to dramatic overload situations Link speeds, anomalous traffic, bursty traffic nature... Complexity of traffic analysis methods Overload situations lead to uncontrolled packet loss Severe and unpredictable impact on the accuracy of applications... when results are most valuable!! 18 / 36
Overload problem Previous solutions focus on a particular metric They are not valid for any (arbitrary) monitoring application Monitoring systems are prone to dramatic overload situations Link speeds, anomalous traffic, bursty traffic nature... Complexity of traffic analysis methods Overload situations lead to uncontrolled packet loss Severe and unpredictable impact on the accuracy of applications... when results are most valuable!! Load Shedding Scheme Efficiently handle extreme overload situations Over-provisioning is not feasible 18 / 36
Case study: Intel CoMo CoMo (Continuous Monitoring) 8 Open-source passive monitoring system Framework to develop and execute network monitoring applications Open (shared) network monitoring platform Traffic queries are defined as plug-in modules written in C Contain complex computations 8 http://como.sourceforge.net 19 / 36
Case study: Intel CoMo CoMo (Continuous Monitoring) 8 Open-source passive monitoring system Framework to develop and execute network monitoring applications Open (shared) network monitoring platform Traffic queries are defined as plug-in modules written in C Contain complex computations Traffic queries are black boxes Arbitrary computations and data structures Load shedding cannot use knowledge of the queries 8 http://como.sourceforge.net 19 / 36
Load shedding approach 9 Working scenario Monitoring system supporting multiple arbitrary queries Single resource: CPU cycles Approach: Real-time modeling of the queries CPU usage 1 Find correlation between traffic features and CPU usage Features are query agnostic with deterministic worst case cost 2 Exploit the correlation to predict CPU load 3 Use the prediction to decide the sampling rate 9 P. Barlet-Ros, G. Iannaccone, J. Sanjuàs-Cuxart, and J. Solé-Pareta.. IEEE/ACM Transactions on Networking, 2010. 20 / 36
System overview Figure: Prediction and Load Shedding Subsystem 21 / 36
Traffic features vs CPU usage CPU cycles 4 2 3000 x 10 6 0 0 10 20 30 40 50 60 70 80 90 100 Packets 2000 1000 0 0 10 20 30 40 50 60 70 80 90 100 x 10 5 15 Bytes 10 5 5 tuple flows 0 0 10 20 30 40 50 60 70 80 90 100 3000 2000 1000 0 0 10 20 30 40 50 60 70 80 90 100 Time (s) Figure: CPU usage compared to the number of packets, bytes and flows 22 / 36
Traffic features vs CPU usage 2.8 x 106 2.6 2.4 CPU cycles 2.2 2 1.8 1.6 new_5tuple_hashes < 500 500 <= new_5tuple_hashes < 700 700 <= new_5tuple_hashes < 1000 new_5tuple_hashes >= 1000 1.4 1800 2000 2200 2400 2600 2800 3000 packets/batch Figure: CPU usage versus the number of packets and flows 23 / 36
Prediction methodology 10 Multiple Linear Regression (MLR) Y i = β 0 + β 1 X 1i + β 2 X 2i + + β p X pi + ε i, i = 1, 2,..., n. Y i = n observations of the response variable (measured cycles) X ji = n observations of the p predictors (traffic features) β j = p regression coefficients (unknown parameters to estimate) ε i = n residuals (OLS minimizes SSE) 10 P. Barlet-Ros et al. Load Shedding in Network Monitoring Applications, In Proc. of USENIX Annual Technical Conf., 2007. 24 / 36
Prediction methodology 10 Multiple Linear Regression (MLR) Y i = β 0 + β 1 X 1i + β 2 X 2i + + β p X pi + ε i, i = 1, 2,..., n. Y i = n observations of the response variable (measured cycles) X ji = n observations of the p predictors (traffic features) β j = p regression coefficients (unknown parameters to estimate) ε i = n residuals (OLS minimizes SSE) Feature selection Variant of the Fast Correlation-Based Filter (FCBF) Removes irrelevant and redundant predictors Reduces significantly the cost and improves accuracy of the MLR 10 P. Barlet-Ros et al. Load Shedding in Network Monitoring Applications, In Proc. of USENIX Annual Technical Conf., 2007. 24 / 36
Load shedding scheme 11 Prediction and Load Shedding subsystem 1 Each 100ms of traffic is grouped into a batch of packets 2 The traffic features are efficiently extracted from the batch (multi-resolution bitmaps) 3 The most relevant features are selected (using FCBF) to be used by the MLR 4 MLR predicts the CPU cycles required by each query to run 5 Load shedding is performed to discard a portion of the batch 6 CPU usage is measured (using TSC) and fed back to the prediction system 11 P. Barlet-Ros et al. Robust network monitoring in the presence of non-cooperative traffic queries. Computer Networks, 53(3), 2009. 25 / 36
Results: Load shedding performance 9 x 109 8 CPU usage [cycles/sec] 7 6 5 4 3 2 CoMo cycles Load shedding cycles Query cycles Predicted cycles CPU limit 1 0 09 am 10 am 11 am 12 pm 01 pm 02 pm 03 pm 04 pm 05 pm time Figure: Stacked CPU usage (Predictive Load Shedding) 26 / 36
Results: Load shedding performance 1 0.9 0.8 0.7 F(CPU usage) 0.6 0.5 0.4 0.3 0.2 0.1 CPU limit per batch Predictive Original Reactive 0 0 2 4 6 8 10 12 14 16 CPU usage [cycles/batch] x 10 8 Figure: CDF of the CPU usage per batch 27 / 36
Results: Packet loss 11 x 104 11 x 104 11 x 104 10 9 8 10 9 8 Total DAG drops Unsampled 10 9 8 7 7 7 packets 6 5 packets 6 5 packets 6 5 4 4 4 3 2 1 Total DAG drops 3 2 1 3 2 1 Total DAG drops Unsampled 0 09 am 10 am 11 am 12 pm 01 pm 02 pm 03 pm 04 pm 05 pm time (a) Original CoMo 0 09 am 10 am 11 am 12 pm 01 pm 02 pm 03 pm 04 pm 05 pm time (b) Reactive Load Shedding 0 09 am 10 am 11 am 12 pm 01 pm 02 pm 03 pm 04 pm 05 pm time (c) Predictive Load Shedding Figure: Link load and packet drops 28 / 36
Results: Error of the queries Query original reactive predictive application (pkts) 55.38% ±11.80 10.61% ±7.78 1.03% ±0.65 application (bytes) 55.39% ±11.80 11.90% ±8.22 1.17% ±0.76 counter (pkts) 55.03% ±11.45 9.71% ±8.41 0.54% ±0.50 counter (bytes) 55.06% ±11.45 10.24% ±8.39 0.66% ±0.60 flows 38.48% ±902.13 12.46% ±7.28 2.88% ±3.34 high-watermark 8.68% ±8.13 8.94% ±9.46 2.19% ±2.30 top-k destinations 21.63 ±31.94 41.86 ±44.64 1.41 ±3.32 50 45 40 35 30 Error (%) 25 20 15 10 5 0 Original Reactive Predictive 29 / 36
One-way delay measurement Traditional approaches are expensive Probing traffic (intrusive) Trajectory sampling Constant overhead per sample Alternative: LDA (Lossy Difference Aggregator) 12 Send only sums of timestamps Deal with packet loss (sampling + partition input stream) 12 R. Kompella et al. Every microsecond counts: tracking fine-grain latencies with a lossy difference aggregator. SIGCOMM, 2010. 30 / 36
Lossy Difference Aggregator (LDA) (delay = 1) 1,2,...,5 2,3,...,6 a network b 31 / 36
Lossy Difference Aggregator (LDA) (delay = 1) 1,2,...,5 2,3,...,6 a network b (2 1)+(3 2)+... n = 1 (send n timestamps) 31 / 36
Lossy Difference Aggregator (LDA) (delay = 1) 1,2,...,5 2,3,...,6 a network b (2 1)+(3 2)+... n = 1 (send n timestamps) (2+3+...) (1+2+...) 4 = tb t a n = 1 (send just one ts) 31 / 36
Lossy Difference Aggregator (LDA) (delay = 1) 1,2,...,5 2,3,...,6 a network b # 15 5 # 20 5 31 / 36
Lossy Difference Aggregator (LDA) (delay = 1) 1,2,...,5 2,3,...,6 a network b # 15 5 # 20 5 result: 20 15 5 = 1 31 / 36
Lossy Difference Aggregator (LDA) (delay = 1) 1,2,...,5 2,3,...,6 a network b # 15 5 # 17 4 packet loss: mismatch in number of packets! must protect against packet loss 31 / 36
Lossy Difference Aggregator (LDA) (delay = 1) 1,2,...,5 2,3,...,6 a network b # # 1 1 2 1 2 1 3 1 7 2 9 2 5 1 6 1 total: < 15, 5 > total: < 20, 5 > 20 15 5 = 1 31 / 36
Lossy Difference Aggregator (LDA) (delay = 1) 1,2,...,5 2,3,...,6 a network b # # 1 1 2 1 2 1 0 0 7 2 9 2 5 1 6 1 total: < 13, 4 > total: < 17, 4 > 17 13 4 = 1 31 / 36
Lossy Difference Aggregator (LDA) (delay = 1) 1,2,...,5 2,3,...,6 a network b s # s # 1 1 2 1 2 1 3 1 7 2 9 2 5 1 6 1 31 / 36
Dimensioning the packet sampling rate 13 PSR presents a classical tradeoff. The higher, the more packets it aggregates, but...... the higher the chance of counter invalidation We wish to maximize the exp. effective sample size: E [S] = (1 r) p n e n r p/b which is max. with p = b n r Original analysis obtained p 0.5 b n r our analysis doubles the sampling rate we improve sample size by 20% Best algorithm in terms of network overhead up to 25% loss 13 J. Sanjuàs-Cuxart et al. Validation and improvement of the Lossy Difference Aggregator to measure packet delays. TMA, 2010. 32 / 36
Outline 1 Network monitoring 2 Addressing the technological barriers 3 Use cases 4 Addressing the social barriers CoMo-UPC initiative 33 / 36
The problem Several researchers working on: Anomaly Detection (AD) Traffic Classification (TC) Real packet traces are needed to test novel AD and TC methods Several AD and TC algorithms require: Unanonymized IP addresses (or prefix-preserving) Payload inspection (e.g., IDS, DPI,... ) Traditional solution: Anonymized traffic traces Examples: NLANR, CAIDA, CRAWDAD,... 34 / 36
The problem Several researchers working on: Anomaly Detection (AD) Traffic Classification (TC) Real packet traces are needed to test novel AD and TC methods Several AD and TC algorithms require: Unanonymized IP addresses (or prefix-preserving) Payload inspection (e.g., IDS, DPI,... ) Traditional solution: Anonymized traffic traces Examples: NLANR, CAIDA, CRAWDAD,... Anonymization is not the right solution! Data owners: Privacy concerns Researchers: Not enough data Lack of recent publicly available traces 34 / 36
CoMo-UPC 14 Move the code to the data Instead of publishing anonymized data traces Significantly lowers the privacy concerns Traffic data do not leave provider premises Data providers keep the ownership of the data The source code can be inspected by the data owner Researchers have (blind) access to unanonymized traffic IP addresses and payloads can be processed...... but not stored or exported The CoMo system is based on this model Implement AD and TC methods as CoMo modules 14 http://monitoring.ccaba.upc.edu/como-upc 35 / 36
UPC traffic UPC access link 1 GE full-duplex ( 900 Mbps, 60K flows/s) Connects 40 departments and 25 faculties (10 campuses) 12 traffic traces are also available 36 / 36