Optimal NetFlow Deployment in IP Networks



Similar documents
Research on Errors of Utilized Bandwidth Measured by NetFlow

NetFlow Performance Analysis

NetFlow Aggregation. Feature Overview. Aggregation Cache Schemes

EMIST Network Traffic Digesting (NTD) Tool Manual (Version I)

Netflow Overview. PacNOG 6 Nadi, Fiji

Table of Contents. Cisco How Does Load Balancing Work?

Cisco NetFlow TM Briefing Paper. Release 2.2 Monday, 02 August 2004

Configuring NetFlow Switching

NetFlow v9 Export Format

Network Management & Monitoring

Network Monitoring and Management NetFlow Overview

ICND2 NetFlow. Question 1. What are the benefit of using Netflow? (Choose three) A. Network, Application & User Monitoring. B.

A Passive Method for Estimating End-to-End TCP Packet Loss

Symantec Event Collector for Cisco NetFlow version 3.7 Quick Reference

and reporting Slavko Gajin

Network Performance Monitoring at Small Time Scales

Introduction to Netflow

NetStream (Integrated) Technology White Paper HUAWEI TECHNOLOGIES CO., LTD. Issue 01. Date

Emerald. Network Collector Version 4.0. Emerald Management Suite IEA Software, Inc.

Catalyst 6500/6000 Switches NetFlow Configuration and Troubleshooting

Chapter 4 Rate Limiting

Gaining Operational Efficiencies with the Enterasys S-Series

IP Accounting C H A P T E R

Configuring SNMP and using the NetFlow MIB to Monitor NetFlow Data

Configuring a Load-Balancing Scheme

J-Flow on J Series Services Routers and Branch SRX Series Services Gateways

Deploying in a Distributed Environment

Performance Evaluation of Linux Bridge

Network congestion control using NetFlow

NetFlow Subinterface Support

Cisco NetFlow Generation Appliance (NGA) 3140

Enabling NetFlow and NetFlow Data Export (NDE) on Cisco Catalyst Switches

Expert Reference Series of White Papers. Binary and IP Address Basics of Subnetting

TE in action. Some problems that TE tries to solve. Concept of Traffic Engineering (TE)

Agenda. sflow intro. sflow architecture. sflow config example. Summary

Sampled NetFlow. Feature Overview. Benefits

How-To Configure NetFlow v5 & v9 on Cisco Routers

Traffic Monitoring in a Switched Environment

Tue Apr 19 11:03:19 PDT 2005 by Andrew Gristina thanks to Luca Deri and the ntop team

Configuring NetFlow. Information About NetFlow. Send document comments to CHAPTER

NetFlow-Lite offers network administrators and engineers the following capabilities:

Cisco IOS Flexible NetFlow Technology

Impact of BGP Dynamics on Router CPU Utilization

SolarWinds Technical Reference

SolarWinds Technical Reference

Customer White paper. SmartTester. Delivering SLA Activation and Performance Testing. November 2012 Author Luc-Yves Pagal-Vinette

Configuring a Load-Balancing Scheme

SolarWinds Technical Reference

2004 Networks UK Publishers. Reprinted with permission.

Reformulating the monitor placement problem: Optimal Network-wide wide Sampling

Appendix A Remote Network Monitoring

Cisco IOS NetFlow Version 9 Flow-Record Format

The Value of Flow Data for Peering Decisions

How To Balance On A Cisco Catalyst Switch With The Etherchannel On A Fast Ipv2 (Powerline) On A Microsoft Ipv1 (Powergen) On An Ipv3 (Powergadget) On Ipv4

CISCO IOS NETFLOW AND SECURITY

Configuring a Load-Balancing Scheme

Experimentation driven traffic monitoring and engineering research

Application of Netflow logs in Analysis and Detection of DDoS Attacks

NetFlow Policy Routing

Viete, čo robia Vaši užívatelia na sieti? Roman Tuchyňa, CSA

JUST-IN-TIME SCHEDULING WITH PERIODIC TIME SLOTS. Received December May 12, 2003; revised February 5, 2004

Configuring NetFlow-lite

Lab Testing Summary Report

Chapter 7 Configuring Trunk Groups and Dynamic Link Aggregation

Network Performance Management Solutions Architecture

Overview of Network Traffic Analysis

Performance Evaluation of VMXNET3 Virtual Network Device VMware vsphere 4 build

Enabling NetFlow on Virtual Switches ESX Server 3.5

Managing Dynamic Configuration

Network Monitoring On Large Networks. Yao Chuan Han (TWCERT/CC)

SolarWinds Technical Reference

Monitoring high-speed networks using ntop. Luca Deri

Efficient and Robust Allocation Algorithms in Clouds under Memory Constraints

Procedure: You can find the problem sheet on Drive D: of the lab PCs. Part 1: Router & Switch

LogLogic Cisco NetFlow Log Configuration Guide

Network traffic monitoring and management. Sonia Panchen 11 th November 2010

WHITE PAPER. Understanding IP Addressing: Everything You Ever Wanted To Know

Cisco Configuring Basic MPLS Using OSPF

Building a better NetFlow

Introduction to Cisco IOS Flexible NetFlow

Echidna: Efficient Clustering of Hierarchical Data for Network Traffic Analysis

IPV6 流 量 分 析 探 讨 北 京 大 学 计 算 中 心 周 昌 令

Overview. Why use netflow? What is a flow? Deploying Netflow Performance Impact

WhatsUpGold. v NetFlow Monitor User Guide

Table of Contents. Introduction...9. Installation Program Tour The Program Components...10 Main Program Features...11

NetFlow: What is it, why and how to use it? Miloš Zeković, ICmyNet Chief Customer Officer Soneco d.o.o.

Traffic Monitoring using sflow

Xiaoqiao Meng, Vasileios Pappas, Li Zhang IBM T.J. Watson Research Center Presented by: Payman Khani

Traffic monitoring with sflow and ProCurve Manager Plus

Load Distribution in Large Scale Network Monitoring Infrastructures

Transcription:

Optimal NetFlow Deployment in IP Networks Hui Zang 1 and Antonio Nucci 2 1 Sprint Advanced Technology Laboratories 1 Adrian Court, Burlingame, CA, USA hzang@sprintlabs.com 2 Narus, Inc. 5 Logue Avenue, Mountain View, CA, USA anucci@narus.com Abstract. This paper investigates the problem of deploying NetFlow with optimized coverage and cost in an IP network. Deploying a network-wide monitoring infrastructure in operational networks is necessary for practical reasons and Cisco NetFlow is a promising solution. However, several cost factors are associated with enabling NetFlow given the current conditions in such a network. We argue that enabling NetFlow to cover a major portion of traffic instead of the entire traffic will achieve significant cost savings while at the same time give operators enough monitoring capabilities. Therefore we aim to solve the Optimal NetFlow Location Problem (ONLP) for a given coverage ratio. We analyze various cost factors to enabling NetFlow in such a network. We model the problem as an Integer Linear Program (ILP). Although we are able to obtain optimal solutions for Sprint s North America Network by solving the ILP, two heuristic algorithms, Max-Plus (MP) and Least-Minus (LM), are developed to cope with larger-sized problems, given the NP-hard nature of the ONLP problem. The performance of the ILP and heuristics is demonstrated by numerical results and the LM heuristic is able to achieve sub-optimal solutions within 1 2% difference from the optimal solutions in a mixed router environment. It is observed that we can achieve 55% cost savings by covering 95% instead of 1% of the network traffic. The problem and the proposed methodology can be generalized to optimal deployment of new services and features in any types of networks. Keywords: NetFlow, Integer Linear Programming, Optimal Placement 1 Introduction Operating a large IP network without a detailed, network-wide knowledge of the traffic demands is challenging. An accurate view of the traffic demands is crucial for a number of important tasks, such as failure diagnosis, capacity planning and forecasting, routing and load-balancing policy optimization, attack identification, etc. It is obvious to operators now that network monitoring and traffic measurement is a necessity and Cisco s This work was done while Antonio Nucci was at Sprint Advanced Technology Laboratories.

2 NetFlow [1] emerges as a viable solution to this problem. NetFlow has received attention from both industry and academic researchers. For example, NetFlow data has been used to examine the accuracy of traffic matrix estimation techniques [2]. The prior work on NetFlow has been focusing on performance issues. Reference [3] compares NetFlow to SNMP and packet-level data collection, while [4] proposes new sampling techniques that can be used by NetFlow. In this paper, we study the issues in the deployment of NetFlow. NetFlow is a set of features available on Cisco routers and other switching devices that provide network operators with access to IP flow information from their data networks [1]. The NetFlow infrastructure consists of two main components: NetFlow Data Export (NDE) and NetFlow Collector (NFC). The NDE is a module configured on routers and captures each IP flow traversing a router. 3 When a timer expires or the NetFlow cache becomes full, IP flow statistics, such as number of IP flows, number of packets and bytes associated to each flow, source/destination AS numbers, source/destination prefix masks, etc, are exported to a NFC as UDP packets. IP networks usually contain a large diversity of routers and not all interfaces on all routers can run NetFlow. Although NetFlow can be configured at per-interface basis, NetFlow-supporting capability is determined by the linecard and the router. There are three types of linecards in terms of NetFlow support: 1) linecards that support NetFlow in most traffic conditions, 2) linecards that do not support NetFlow, and 3) linecards that support NetFlow in only certain (light) traffic conditions. Care must be taken for type 3) linecards since turning on NetFlow could potentially impact linecard s performance on packet forwarding, i.e. cause losses and large latency, or generate inaccurate flow statistics. Linecards of types 2) and 3) can usually be upgraded to a newer configuration to support NetFlow. Enabling NetFlow at specific router interfaces is not enough. The IP flow statistics exported by NDE modules at each router must be collected by NFCs. Operators process all the data stored in NFCs to gather the information they need. 4 There are two problems when NFCs are considered. First, only a limited number of routers can be served by the same NFC. Second, carriers prefer to collocate NFCs with the NDEs that they serve to avoid the flooding of large amount of information over long-haul IP links. Therefore, in order to enable NetFlow and utilize the data properly, operators need to identify: 1) a proper configuration for each router enabled to support NetFlow (NDE); and 2) a proper location for each NetFlow Collector (NFC). The goal of this paper is to provide a methodology and a set of recommendations to optimizing the NetFlow deployment process. More precisely we are interested in identifying which routers and which linecards on routers should be NetFlow-enabled, such to cover a major portion of network traffic while minimizing the total capital investment required. We refer to the problem of covering a given fraction of traffic on the selected routers while minimizing the total cost as the Optimal NetFlow Location Problem (ONLP). The solution to this problem will assist an operator in two situations: i) For an operator who has decided to deploy NetFlow, identify the proper locations of routers to enable NetFlow 3 An IP flow is identified as the combination of seven fields as Source and Destination IP addresses, Source and Destination Port numbers, IP protocol type, ToS bytes and Input Logical Interface (ifindex). 4 NetFlow Data Analyzer (NDA) is a NetFlow-specific traffic analysis tool that enables the users/operators to retrieve, display and analyze NetFlow data collected from several NFC modules.

SPRINT ATL RESEARCH REPORT RR5-ATL-61624 - JUNE 25 3 to achieve a lowest capital investment; ii) For an operator who has not decided to deploy NetFlow, obtain a partial NetFlow deployment to achieve a best-coverage with a limited investment for the operator to examine the functions and benefits of NetFlow. We formulate ONLP into an Integer Linear Program (ILP) model. We also propose two efficient heuristic algorithms to solve it. We demonstrate results from both solving the ILP and applying the heuristics and show great cost benefits can be achieved by carefully choosing the locations of NetFlow deployment. We target NetFlow and IP networks in this paper to demonstrate how location optimization for a given network functionality should be pursued without losing the generality that, the methodology proposed in this paper can also be applied to other services/features in other types of networks. The rest of the paper is organized as follows. In Section 2, we formally state ONLP and propose an Integer Linear Programming model which can be solved for the optimal solutions of ONLP. In Section 3, we introduce two efficient heuristic algorithms to solve ONLP for larger-sized networks, for which the optimal solutions are too expensive to compute. Numerical results are presented in Section 4 demonstrating the trade-off between traffic coverage and required investment and performances achieved by both the ILP and the heuristics. Section 5 provides recommendations on a NetFlow deployment strategy with the best coverage-cost trade-off and concludes the paper. 2 Optimal NetFlow Location Problem: Problem Statement and ILP Formulation We consider a network with a set of routers and a set of interfaces on these routers. Our goal is to monitor a portion of the traffic switched by these interfaces by enabling NetFlow on the linecards that these interfaces reside on. The solution to the NetFlow location problem can then be applied a set of interfaces in a given network which switch a particular traffic type independently, i.e., for two routers R 1 and R 2 identified, there is no flow f T that is switched by both R 1 and R 2 in the same ingress/egress direction. We can formally state the Optimal NetFlow Location Problem (ONLP) as follows. Given: The routers in a network R = {R 1, R 2,..., R N }, and for each router R i R, a set of linecards I R i = {I R i 1, I R i 2,..., I R i S i }. A set of PoPs P = {1, 2,..., L}, and for each PoP i, the set of routers associated: P i R. P i P j = φ for i, j : i j and 1 i L P i = R. A traffic volume t i,j associated with each linecard I R i j on each router R i. A cost function F for any router R i to have NetFlow enabled at a subset of linecards I R i I R i, F : R I Z + {}, where Z + denotes the set of positive integers, and I = 1 i N I R i. A cost function C for the collectors deployed at PoP i when n (n ) routers in P i are NetFlow-enabled, C : Z + {} Z + {}. A coverage ratio D : < D 1. We need to find a subset of routers R R such that for each R i R, NetFlow is enabled on a non-empty subset of linecards I R i I R i, and at least D of the total traffic

4 is covered by NetFlow: i:r i R j:i R i j I R i t i,j D 1 i N 1 j S i t i,j, while at the same time, minimizing i:r i R F (R i, I Ri ) + 1 j L C( R P j ), where denotes the cardinality of a set. We formulate the Optimal NetFlow Location Problem (ONLP) as an Integer Linear Program (ILP). Different constraints may be applied to different routers. We consider Cisco GSR routers [5] and 75 routers [6] in this exercise. In Appendix C, we discuss the details of their capability in supporting NetFlow and we also set up a testbed to study the impact of NetFlow on 75 routers and determine the need and cost of upgrading a 75 linecard. Although totally different methods are applied to obtain cost figures for both families of routers, from the modeling perspective, the main differences between both families of routers are the following. First, when upgrading a 75 linecard, only the processor and memory are upgraded and the interfaces on the linecard remain unchanged, while the entire linecard is replaced when upgrading a GSR linecard which implies that the number of interfaces on the linecard may change with the upgrade. Second, a router consists of a Route Switch Processor (RSP) and a number of linecards. When upgrading a 75 router s linecards, sometimes we need to upgrade the RSP on this router as well. However, when upgrading a GSR router s linecards, we do not need to upgrade the GSR s RSP because most of the processing is done by the linecards. These differences will be reflected in the ILP formulation. 2.1 Notation Let G 75 and G GSR be the set of all 75 and GSR routers, respectively. Let P = {1, 2,..., L} be the set of all PoPs in the network and P i represent the set of routers belonging to PoP i. A router is present in one and only one PoP. For a router g, let S(g) be the set of slots on router g, whose cardinality is denoted by S(g). Let t(g, s) be the traffic processed at slot s on router g. We define the specific notations for 75 routers, GSRs, collectors and traffic coverage respectively. 75 routers Let c(g) be the minimal cost to upgrade the current configuration of router g to one that supports NetFlow. c(g) = if the current one supports NetFlow. Binary parameter r(g) = 1 if such an upgrade is available, and r(g) = otherwise. Let c(g, s) be the minimal cost to upgrade the current configuration at slot s, router g to one that supports NetFlow. c(g, s) = if the current configuration supports NetFlow. Binary parameter r(g, s) = 1 if such an upgrade is available, and r(g, s) = otherwise. GSR routers Let T be the set of all linecard types present on the routers in G GSR. For each g G GSR we define T (g) as the set of linecard types present on router g. Each

SPRINT ATL RESEARCH REPORT RR5-ATL-61624 - JUNE 25 5 linecard type t T may or may not be upgraded to another linecard type that supports NetFlow. Let r(t) be a binary parameter which equals to 1 if linecard type t can be upgraded to a new version supporting NetFlow and otherwise. Let c(t) represent the cost to upgrade if r(t) = 1. For each router g G GSR and for each t T (g) we define V g (t) as the set of slot-indices where a linecard of type t is present. Let p g,s (t) represent the number of used ports of the linecard of type t T (g) in slot s on router g in the current configuration. Let a g (t) denote the number of available ports in the upgraded version of linecard t T. Collectors Let C represent the cost of a single collector. Let N be the maximum number of routers that can be supported by a single collector. According to [1], N = 5 and varies with traffic and the NetFlow sampling rate. In this study, we assume N to be constant since so far there has been no public documentation on how N varies. The model can be easily extended to incorporate different constraints on N. Traffic Coverage We define D ( < D 1) as the minimum fraction of traffic that needs to be covered by NetFlow. 2.2 Decision Variables The following decision variables are to be solved: Binary variable η(g, s) for g G GSR G75, s S(g) equals to 1 if slot s on router g is selected to run NetFlow, and otherwise. Binary variable γ(g) for g G 75 equals to 1 if router g is selected to run NetFlow, and otherwise. Integer variable ν g (t) describes the number of linecards of type t T (g) on router g G GSR that need to be upgraded to run NetFlow. Integer variable NC i is the number of collectors needed at PoP i to cover all the routers that have NetFlow enabled. 2.3 Objective The objective of the ONLP problem is to minimize the total cost expressed by F = F 75 + F GSR + F Col, where F 75 = g G 75 (c(g)γ(g)+ s S(g) c(g, s)η(g, s)), F GSR = g G GSR t T (g) ν g(t)c(t), and F Col = i P NC i C. 2.4 Constraints Relationship between variables γ and η for 75 routers: γ(g) η(g, s) γ(g) S(g) g G 75 (1) s S(g)

6 Constraint (1) links the variables γ associated to each router with variables η associated to each slot. The left inequality in (1) forces γ(g) to be if none of its slots has been selected to run NetFlow. The right inequality in (1) forces γ(g) to be 1 if one or more of its slots have been selected to run NetFlow. Relationship between r(g) and γ(g), and r(g, s) and η(g, s): r(g) γ(g) g G 75 (2) r(g, s) η(g, s) g G 75 GGSR, s S(g) (3) Constraints (2) and (3) guarantee that a router/slot can be selected to have NetFlow enabled only if its current configuration supports NetFlow or it can be upgraded to another configuration that supports NetFlow. Number of interfaces on GSR routers: a g (t)ν g (t) η(g, s)p g,s (t) g G GSR, t T (g) s V g(t) (4) Constraint (4) guarantees that we invest in the minimum number of linecards necessary according to the selection we made. For example, if router g has two linecards of type t with one port being used on each, and the upgraded version of linecard type t has four ports available, then Constraint (4) implies that only one upgraded version of linecard type t is necessary, i.e. ν g (t) 1. When the total cost is minimized by the objective function, ν g (t) will be forced to be 1. Fraction of the total traffic to be covered by enabling NetFlow on specific routers and slots: g G 75 s S(g) t(g, s)η(g, s) + g G GSR s S(g) t(g, s)η(g, s) D g G 75 GGSR s S(g) t(g, s) (5) Constraint (5) ensures that the final solution selected must cover at least a D fraction of the total traffic. It is clear that the larger D is, the larger will be the number of slots enabled to support NetFlow and the associated deployment cost. The number of collectors needed per PoP: N NC i g P i γ(g) NC i i P (6) Constraint (6) ensures that for any PoP, if there are routers with NetFlow enabled, the number of collectors in this PoP will be sufficient to cover all these routers. At the same time, no collectors should be placed at any given PoP where no router is enabled with NetFlow.

SPRINT ATL RESEARCH REPORT RR5-ATL-61624 - JUNE 25 7 3 Heuristic Algorithms We can prove that the Optimal NetFlow Location Problem (ONLP) is NP-hard by reducing the NP-complete problem Knapsack [7] to ONLP (Appendix A), which means that, there exist problem instances that are not likely to be solved within reasonable amount of time. For example, size of the network studied, changes in the network traffic distribution, changes in the pricing of the upgrade options, are crucial factors for which we may encounter problems in solving the ILP model to optimality. Therefore, heuristic algorithms are needed. We develop two heuristic algorithms in this section. To simplify the discussion, we assume that there is no need to upgrade 75 RSPs. This assumption is verified by our network data which shows the current CPU utilization on 75 RSPs is extremely low (Appendix C). Hence we only consider three types of cost in the heuristics associated respectively with: i) collectors, ii) GSR linecard upgrade, and iii) 75 linecard upgrade. The heuristics can be easily extended if 75 RSP cost were to be included. The input and output of the two heuristics are the same as those of the ILP model. We remind the reader that t(g, s) and c(g, s) are the traffic and the cost of upgrade associated with slot s on router g, respectively. In addition, the following notations/variables are used in the heuristics: T total, the total traffic under consideration. The target is to cover D T total by NetFlow. T covered, the variable representing the traffic that is covered by NetFlow. C total, the variable representing the total cost of deployment which is the objective in the ILP. To make the presentation concise, we assume all linecards are upgradeable to support NetFlow. The heuristics can be easily generalized to cover the other case. We first develop a heuristic called Max-Plus (MP) and a formal specification is in Algorithm 1. In MP, we start with a network with no NetFlow and keep adding NetFlowenabled router slots until the required traffic coverage is met. Collectors are added as needed. The admissibility of a slot is based on traffic flowing through the slot and the associated cost for enabling NetFlow, including the necessary collector deployment as well. After each selection, slot with the currently largest traffic/cost ratio will be added as NetFlow-enabled. The second heuristic, called Least-Minus (LM) approaches the problem from the opposite direction and a formal specification can be found in Algorithm 2. In LM, we start with a network with full NetFlow coverage and keep removing NetFlow-enabled router slots and collectors until the traffic coverage is right at or below the required threshold. The admissibility of a slot for NetFlow removal is also based on traffic associated and the cost for enabling NetFlow on this slot, including both the upgrade cost and a fair share of the collector cost at the PoP. After each selection, slot with the currently lowest traffic/cost ratio will be removed. At the end, if the resulted traffic coverage is below the requirement, the last slot that has been removed (and its associated collector if applicable) is added back.

8 Algorithm 1 Heuristic I - Max-Plus (MP) 1. Initialize T covered =, and C total =. Set T remaining = T total D T covered (7) 1.1 Examine all slots without NetFlow enabled. For each slot s on router g at PoP p, calculate C collector (g, s), as the additional collector cost at PoP p if slot s were to be selected to enable NetFlow. if router g has NetFlow on, or if collectors C collector (g, s) = at PoP p can support one more router (8) C otherwise CostP erbit(g, s) = (c(g, s) + C collector (g, s))/min(t(g, s), T remaining) (9) 1.2 Enable NetFlow on slot s at router g with the smallest CostP erbit(g, s). Set T covered = T covered + t(g, s), and C total = C total + c(g, s) + C collector (g, s). Update T remaining by Eqn. (7). 1.3 Repeat Steps 1.1 through 1.2 until T remaining and return. 4 Numerical Results In this section, we present numerical results obtained by applying the ILP model and the heuristics on Sprint s North America IP backbone network (SNAIB-NET) with real traffic. We consider traffic carried on all links between gateway (GW) routers and backbone (BB) routers. We choose to enable NetFlow on gateway routers because it is more cost-effective to upgrade gateway routers than backbone routers as we found out by going through the router configurations. 4.1 Platform and Speed We solve the ILP models using CPLEX [8] running on a 2.4 GHz Xeon processor with 1 GB RAM space. The time it takes to solve the ILP models for SNAIB-NET gateway routers ranges from a few seconds to 3 minutes. Note that we solved for several hundred of routers which is a subset of SNAIB-NET. Therefore, for networks of sizes less than hundreds of routers, it is feasible to use the ILP model to find an optimal solution for ONLP. The heuristics runs much faster - it takes sub-seconds to seconds for each heuristic to solve the problem for all coverage ratios. 4.2 The ILP Model and the Heuristics In this subsection, we present the solutions from the ILP and two heuristics and compare the performance achieved by the heuristics with the optimal solution obtained from the ILP model. Figure 1(a) shows the normalized cost obtained from solving the ILP model and the two heuristics to achieve different coverage ratios from 5% to 1%. The costs are normalized by the cost required to provide 1% coverage, which is the same from all three methods. We notice that the cost to achieve 95% coverage is only about 45% of the cost that is required for 1% coverage. In Fig. 1(b), we plot the relative difference, i.e., the cost difference normalized by the optimal value between the results obtained by each heuristic and those obtained by solving the ILP. We can see that the two heuristics perform differently in terms of optimality. LM performs significantly better than MP. At 5% coverage, the solution from MP is 7% higher than the optimal solution while the

SPRINT ATL RESEARCH REPORT RR5-ATL-61624 - JUNE 25 9 Algorithm 2 Heuristic II - Least-Minus (LM) 2. For each slot s on router g at PoP p, enable NetFlow. Set C total = c(g, s) + NC(p) C (1) g s S(g) p T covered = t(g, s) (11) g s S(g) T extra = T covered T total D (12) 2.1 Go to Step 2.3 if T extra. Otherwise, examine all slots with NetFlow enabled. For each slot s on router g at PoP p, calculate C collector (g, s) as how much it is responsible for the collector cost at PoP p. Let N r(p) denote the number of routers with NetFlow enabled at PoP p and N s(g) denote the number of slots with NetFlow enabled at router g. C collector (g, s) = NC(p) C/(N r(p) N s(g)) (13) CostP erbit(g, s) = (c(g, s) + C collector (g, s))/t(g, s) (14) 2.2 Find a slot with the largest CostP erbit(g, s) and remove NetFlow at this slot. Assume this slot is slot s on router g in PoP p. Calculate the reduction of the number of collectors at PoP p as 1 if router g has no other NetFlow-enabled slots, and the remaining collector (p) = routers at PoP p can be served with one less collector (15) otherwise Update NC(p) = NC(p) collector (p), C total = C total c(g, s) C collector (p), and T covered = T covered t(g, s). Recalculate T extra by Eqn. (12). Go back to Step 2.1. 2.3 If T extra =, Return. Otherwise, return after enabling NetFlow back to the slot picked by the last execution of Step 2.2, and update the number of collectors. difference between the results from LM and the optimal results is less than 2% for 5% coverage and constantly less than 1% for coverage ratios greater than 5%. LM performs better because it adopts an amortized collector cost in determining which slot to be NetFlow disabled. However, MP only considers the full collector cost when a collector is to be added as amortized cost cannot be obtained similarly due to the lack of information on how many slots will be enabled later on. Note that in practice, it is not trivial to determine the feasibility of running NetFlow on a 75 linecard at a particular network location and to obtain the upgrade cost of a 75 linecard. We refer the readers to Appendix B for details. 11 1 9 ILP Heuristic MP Heuristic LM 8 7 Heuristic MP Heuristic LM Normalized Cost (%) 8 7 6 5 4 3 Relative Difference (%) 6 5 4 3 2 2 1 1 5 55 6 65 7 75 8 85 9 95 1 Traffic Cover Ratio D (%) (a) Normalized cost obtained from ILP and heuristics 5 55 6 65 7 75 8 85 9 95 1 Traffic Cover Ratio D (%) (b) Difference between heuristic results and ILP results Fig. 1. Results from the ILP model and the heuristic algorithms.

1 5 Conclusions In this paper, we studied the optimization problem for NetFlow deployment in an IP network. Specifically, we considered a partial NetFlow deployment to achieve the lowest cost for a given coverage ratio, which is the Optimal NetFlow Location Problem (ONLP). We developed an ILP model and two heuristic algorithms to select routers and slots to support NetFlow and the associated configurations such that a certain amount of network traffic is covered at a minimum cost. We solved ONLP for Sprint s IP backbone network in north America. We presented numerical results from applying the ILP model and two heuristics. We demonstrated that, it is possible to achieve significant cost savings by adopting a partial NetFlow deployment strategy, i.e., to cover a major portion of the network traffic instead of the entire traffic. A good coverage ratio is suggested as 95%, with 55% cost reduction. Although our discussion was focused on Cisco NetFlow and the results were collected from Sprint s operational IP backbone network only, the results can be referenced in similar practices and the methodology proposed can be extended and applied to a wide variety of network location problems to enable different features and services. Besides NetFlow from Cisco, other vendors also support similar flow-based monitoring services, such as sflow [9], and our methodology can be applied to the deployment of sflow as well. In addition, as ongoing work, we are extending our approach to network monitoring functions of finer granularity such as packet trace collection. Acknowledgment We thank Travis Dawson and Beng-Ong Lee at Sprint ATL for their support in the 75 router testing and answers to our various NetFlow-related questions. References 1. NetFlow Services Solutions Guide, Cisco white paper. 2. A. Soule, A. Nucci, R. Cruz, E. Leonardi and N. Taft, How to Identify and Estimate the Largest Traffic Elements in a Dynamic Environment, Proceedings of ACM Sigmetrics, New York, USA, July 24. 3. R. Sommer and A. Feldmann, NetFlow: Information Loss or Win? Proceedings of Internet Measurement Workshop, Marseille, France, Nov. 22. 4. C. Estan, K. Keys, D. Moore, and G. Varghese, Building a Better NetFlow, Proceedings of ACM Sigcomm, Portland, OR, USA, August 24. 5. Cisco 12 Series Router, Cisco white paper. 6. Cisco 75 Series Router, Cisco white paper. 7. M. R. Garey and D. S. Johnson, Computers and Intractability, A Guide to the Theory of NP-Completeness, Bell Telephone Laboratories, Inc., 1979. 8. http://www.ilog.com/products/cplex. 9. http://www.sflow.org.

SPRINT ATL RESEARCH REPORT RR5-ATL-61624 - JUNE 25 11 Appendix A: ONLP is NP-Hard In this section, we prove that the Optimal NetFlow Location Problem (ONLP) is NP-hard. First, we prove the following decision ( yes/no ) version of NetFlow Location Problem reduces to ONLP: Given a traffic amount T and a cost C, is it possible to upgrade the network with cost no more than C and cover at least traffic amount T? We name the decision version of the NetFlow Location Problem DNLP. If ONLP is solved with the optimal cost C to cover traffic amount T, for any cost C C, the answer to DNLP is yes and for C < C, the answer is no. Therefore, DNLP is solvable if ONLP is solvable. We then prove that DNLP is NP-hard by transforming a known NP-hard problem - the Knapsack problem to DNLP. A formal statement of the Knapsack problem is as follows [7]. A finite set U, a size s(u) Z + and a value v(u) Z + for each u U, a size constraint B Z +, and a value goal K Z +. Question: Is there a subset U U such that u U s(u) B and u U v(u) K. We restrict DNLP to the case that the cost of a collector is zero, the cost of a 75 RSP is zero, and there is no GSR in the network. Now we focus on 75 router slots since they are the sole source of upgrade cost. For each slot, there is traffic t and upgrade cost c associated. There is a one-to-one mapping from Knapsack to DNLP. For each u U with size s(u) and value v(u), construct a router slot s with traffic t = v(u) and cost c = s(u). With the one-to-one mapping, it is obvious that Knapsack is solvable if and only if this restricted version of DNLP is solvable. Since Knapsack is known as an NP-complete problem [7], DNLP is NP-hard. Therefore, the optimization version, ONLP, is NP-hard. Appendix B: Generating Inputs In this section, we generate input for SNAIB-NET to the ONLP ILP and heuristics which includes the traffic at all GSR/75 slots and the upgradability and cost to upgrade to a NetFlow-supporting configuration for all GSR/75 slots and 75 routers. For both GSRs and 75 routers, traffic processed at a slot can be obtained by simply processing the SNMP data, which records the load of each interface at 5-minute intervals. Since the GSR linecard upgrades are one-to-one mapping based on the engine type which we already know, it is straightforward to generate the input data on the GSRs. For the 75 routers, in order to determine the upgradability and upgrade cost for a router/slot configuration to support NetFlow, we set up a Testbed in Appendix C and investigate the impact of enabling NetFlow in terms of CPU and memory utilization on the most common router configurations with different RSP and VIP models under different traffic scenarios. Our findings can be summarized into a few rules. Let c (r) (g) represent the RSP CPU utilization at 75 router g before NetFlow is enabled. For a RSP configuration to support NetFlow, c (r) (g) must be below threshold R CP U : c (r) (g) R CP U, where R CP U can be as high as 99%. There is no memory constraint for RSPs.

12 Let c (v) (g, s) represent the VIP CPU utilization at slot s on 75 router g before NetFlow is enabled. For a VIP configuration to support NetFlow, c (v) (g, s) must be below threshold T H CP U : c (v) (g, s) T H CP U, where T H CP U is at most 9% since enabling NetFlow can increase the CPU utilization by 1%. In practice, we may want R CP U and T H CP U to be even lower to accommodate temporal burstiness in CPU utilization. Let m (v) (g, s) represent the total VIP memory capacity at slot s on 75 router g, and let m (v) req(g, s) be the memory required when NetFlow is enabled. m (v) (g, s) must be greater than or equal to m (v) req(g, s) for slot s to support NetFlow: m (v) (g, s) m (v) req(g, s). Let MEM be the difference between m (v) req(g, s) and the VIP memory usage before NetFlow is enabled at slot s on router g. MEM follows a step function of a given m (v) (g, s) as shown in Appendix 5. A higher VIP version always provide the same or higher memory capacity options. Let CP U be the difference of CPU utilization between two adjacent VIP families under the exactly same configuration and traffic condition, CP U = 2% according to Appendix 5. For example, by upgrading a VIP2-5 to a VIP4-5, or from a VIP4-8 to a VIP6-8, the CPU utilization will be reduced by 2%. Given the constraints above, for every router/slot configuration, if we know the current CPU utilization and memory usage, we can, for every upgrade option, calculate the CPU utilization and memory usage under the current traffic condition. Therefore, to obtain the upgradability and upgrade cost to support NetFlow, we enumerate all possible configurations for a given router/slot and choose the one that satisfies the above constraints with the minimum cost. The router/slot cannot be upgraded if no possible configurations support NetFlow. By processing router data, we found that all current router configurations support NetFlow since c (r) (g) is always low. We are interested in the CPU and memory utilization on the slots. To get an accurate measurement, we collect CPU and memory utilizations during peak hours for five consecutive weekdays. We use the collected data to calculate their minimum, maximum, average, and 95-percentile statistics and plot their cumulative distribution function (CDF) in Fig. 2. We can see from Fig. 2(a) that the four CPU curves are far apart and we expect that using each statistics as input to the problem would produce a different solution. 5 No significant difference is observed in terms of memory usage for the four memory statistics (Fig. 2(b)). Later on we use the average memory statistics as the input to the problem. Since 75 and GSR routers are characterized by different constraints and requirements, we decide to apply our methodology, both ILP and heuristics, in three different scenarios: i) the 75 case, ii) the GSR case, and iii) the combined case. In the 75/GSR case, only 75/GSR routers and their associated traffic are considered, while in the combined case, both types of routers and all traffic are considered. Appendix C: NetFlow Support by Cisco GSR and 75 routers For both 12 series (GSR) and 75 series routers, NetFlow can be enabled at interface level but the NetFlow-supporting capability is determined by the linecard and the router. 5 The minimum CPU utilization shall not be used in practice because they do not reflect the actual requirement. We show the results of using this metric only for comparison purpose.

SPRINT ATL RESEARCH REPORT RR5-ATL-61624 - JUNE 25 13 1 1.8.8 CDF(x).6.4 CDF(x).6.4.2 Minimum Average 95 Percentile Maximum 2 4 6 8 1 CPU Utilization (%) (a) CPU utilization.2 Minimum Average 95 Percentile Maximum 55 6 65 7 75 Memory Usage (MB) (b) Memory usage Fig. 2. CDF plots for CPU and memory utilizations on 75 router slots. We summarize the different factors in supporting NetFlow by the two router families and argue that GSR series contributes to a major fraction of the upgrade cost. 75 series routers potentially support NetFlow. However, proper functioning of Net- Flow is determined by the following factors: i) traffic load in terms of bits per second (bps) and packet per second (pps), ii) number of active flows, iii) RSP (Route Switch Processor, the central processor of the router) type and memory capacity, and iv) VIP (Versatile Interface Processors, the processor of a 75 linecard) type and memory capacity. Therefore, the decision of whether or not a 75 router or its linecards need an upgrade depends on both the router/linecard configuration and the traffic condition. The traffic load information can be obtained through SNMP. However, it is difficult to obtain the number of active flows without turning on NetFlow. Therefore, we use packet traces that have been collected from several links in the network to identify the typical number of active flows going through a certain interface type on a 75 series router. By testing the combination of traffic load, number of active flows, RSP type/memory, and VIP type/memory, we can determine whether or not a certain router/linecard configuration supports NetFlow at a given network location. For GSRs, their capability of supporting NetFlow is determined by the engine type. Some fully support NetFlow (Engine 3 and 4+), some do not support (Engine 4) and some support with limitations (Engine, 1, and 2). 6 Our strategy is to upgrade all nonsupporting and supporting-with-limitation linecards to the fully-supporting ones. One important constraint is that during upgrade we must keep the same interface speed. As a consequence some linecards equipped with certain low-speed interfaces do not have any corresponding upgrading option. Therefore, whether a GSR linecard needs upgrade or not is solely based on its Engine type, while the capability to support NetFlow by 75 routers depends on both configuration and traffic conditions at the routers. We need to identify the traffic conditions under which a certain router configuration may support NetFlow without experiencing any degradation in the packet forwarding process. As a result, we determine whether a router configuration needs an upgrade or not. We do that by setting up a TestBed and running several experiments. We use the results to formalize a set of requirements that 6 These linecards either cannot support NetFlow with other desired features or have a performance limitation (e.g., low pps) which will be worsened when NetFlow is enabled.

14 each router configuration must satisfy to support NetFlow properly. These requirements are used to generate input to the Integer Linear Program (ILP) or heuristics. 5.1 TestBed Configuration We set up a TestBed using the configuration shown in Fig. 3. Traffic is generated by an Agilent Router Tester and routed by a 7513 router, which is a member of the 75 family. Router Tester is connected to the 7513 router using multiple POS OC3 and Fast Ethernet links. To provide useful results, we configured the 7513 router by using one of the most common 75 router s configuration in the operational network. Note that although we only test a single 7513 router, with the help of Router Tester, we are actually emulating a real network environment with both customer routers (cus) and backbone routers (BB). A 75 router has a central processor and several linecards. The central processor is referred to as a Route Switch Processor (RSP). There are various RSP models with different processing powers. The most common RSP on Sprint gateway routers is RSP4. We will examine the impact of NetFlow on RSP4 in our testing. The linecards on 75 routers are called Versatile Interface Processors (VIPs) [6]. Each VIP has its own processor. There are different VIP families with different levels of processing powers. About 9% VIPs on 75 routers in Sprint network are VIP2-5s. VIP2-5s can have up to 128 MB memory (DRAM)[6]. A good upgrade candidate of them are VIP4-8s which have higher CPU power and up to 256 MB memory capacity. Therefore we focus on these two types of VIPs in our testing: VIP2-5 and VIP4-8. A VIP can be used on different types of routers belonging to the 75 family. Therefore, although we only use a specific 7513 router in our TestBed, we expect similar results on other 75 family routers since they all use similar types of VIPs with similar configurations. Cus Cus Cus FE FE FE 7513 OC3 OC3 BB BB Agilent Router Tester Fig. 3. Testbed setup. In order to create a set of representative testing scenarios we use traces collected from three different OC3 links on Sprint s gateway routers. The traces collected capture a diversity in time behaviour for the three links since they refer to a collection process spanning 6 months period. We plot the typical number of active flows and packet size distributions in Figure 4. The number of active flows is the major factor in determining CPU utilization and memory usage on VIPs while the packet size distribution is an important factor in packet forwarding performance. As shown in Figs. 4 (a) through (c), the typical number of active flows on a OC3 gateway link ranges from a few thousand to 35k. We use 6k in our testing as a worst

SPRINT ATL RESEARCH REPORT RR5-ATL-61624 - JUNE 25 15 4 Copyright (c) 22 24 Sprint ATL 18 Copyright (c) 22 24 Sprint ATL 35 16 14 3 Active Flows ( ) 25 2 15 Packets (%) 12 1 8 6 4 1 2 5 18: 21: : 3: 6: 9: Time of day (HH:MM UTC) (a) Active flows on Link 1 Copyright (c) 22 24 Sprint ATL 14 4 576 15 Packet size (bytes) (d) Packet size distribution on Link 1 Copyright (c) 22 24 Sprint ATL 5 12 45 4 Active Flows ( ) 1 8 6 4 Packets (%) 35 3 25 2 15 1 2 5 6: 12: 18: : Time of day (HH:MM UTC) (b) Active flows on Link 2 Copyright (c) 22 24 Sprint ATL 2.5 4 576 15 Packet size (bytes) (e) Packet size distribution on Link 2 Copyright (c) 22 24 Sprint ATL 7 2 6 5 Active Flows ( ) 1.5 1 Packets (%) 4 3 2.5 1 6: 12: 18: : Time of day (HH:MM UTC) (c) Active flows on Link 3 4 576 15 Packet size (bytes) (f) Packet size distribution on Link 3 Fig. 4. Active flows and packet size distribution on three links. case working scenario and expect the actual number of active flows to be lower. We can observe from the Figs. 4 (d) through (e) that packet size distribution have three modes: 4-byte, 576-byte, and 15-byte. However, the distribution around the three modes is different on different links. We will cover the three packet sizes in our testing. Based on our discussion above, we will use the following as configurations in our tests: RSP4 VIP2-5 and VIP4-8 Up to 6k flows Packet sizes at 4-byte, 576-byte, and 15-byte and a proper mix of them as on the three OC3 links. 5.2 NetFlow Impact on Forwarding Performance In Fig. 5 we show the impact of enabling NetFlow on the maximum forwarding rate for a wide range of packet sizes, from 4-bytes to 15-bytes. We can see that for packet sizes below 256-byte, NetFlow affects the maximum forwarding rate on both VIP2-5 and VIP4-8, while no significant impact is observed for larger packet sizes. This is due to the fact that with small packet sizes, the maximum forwarding rate is limited by packet per second (pps) which will be reduced when NetFlow is enabled. For larger packet sizes, the maximum forwarding rate is limited by the interface speed which is NOT affected by

16 1 Normalized Maximum Forwarding Rate (%).9.8.7.6.5.4.3 VIP2 5 Netflow Off VIP2 5 Netflow On.2 VIP4 8 Netflow Off VIP4 8 Netflow On.1 3 6 9 12 15 Packet Size (Bytes) Fig. 5. NetFlow impact on maximum forwarding rate. NetFlow. Since it is the actual limiting factor, we focus on the CPU utilization of a VIP on the feasibility for enabling NetFlow, which is studied in Section 5.4. 5.3 NetFlow Impact on RSP Figures 6(a) and 6(b) show the behavior of a RSP4 before and after enabling NetFlow. 6k flows are tested at aggregated rate of 2 Mbps. NetFlow was enabled at approximately 19:3. They demonstrate that there is NO negative impact to the RSP when enabling NetFlow. More specifically: NetFlow s impact on RSP CPU is negligible. We can safely conclude that unless a RSP CPU is utilized at peak, i.e., 99%, turning on NetFlow will not have any negative impact to RSP CPU. NetFlow s impact on RSP memory usage is negligible. the main reason is because, by implementation, NetFlow can move the execution of some tasks such as Access Control Lists (ACLs) from the RSP to the corresponding VIP. Therefore, there is no RSP memory constraint for enabling NetFlow but we expect a significant impact on the VIP memory usage. 5.4 NetFlow Impact on VIPs In this section we study the impact of enabling NetFlow on VIP cards in terms of CPU utilization and memory usage. As a first test, we vary the number of flows from 2.5k to 6k, and we fix the packet size to be relatively small 256-byte. As we can see from Fig. 7, NetFlow increases the CPU utilization on a VIP2-5 by 5 1%. We also observe in the tests that when the CPU utilization is at 99% on a VIP, NetFlow is not recording flow information correctly therefore not exporting the correct flow data. There is no forwarding performance degradation. As a second experiment, we fix the number of flows to 6k flows and we vary the packet size distribution according to the three-modes empirical distribution observed from traces collected from OC3 links. We summarize the results in Table 1. Our objective is to see how VIP4-8 outperforms VIP2-5 in terms of CPU utlization.

SPRINT ATL RESEARCH REPORT RR5-ATL-61624 - JUNE 25 17 (a) CPU utilization on a RSP4 (b) Memory usage on a RSP4 (c) Memory usage on a VIP2-5 Fig. 6. NetFlow impact on a RSP4 and a VIP2-5. We observe first that for both VIP2-5 and VIP4-8 the CPU utilization increases about 5% and never exceeds 1%. Second, notice that VIP4-8 is able to support NetFlow by requiring about 2% less CPU than VIP2-5 under the same traffic conditions. We point out that some traffic load configurations were not tested on all distributions because they exceed the maximum forwarding rate for that particular average packet size. We would like to extend this study by testing the performance of VIP6-8 but due to lack of resource we cannot make it. However, in the following we assume the CPU utilization reduction by upgrading VIP4-8 to VIP6-8 is again 2%. This number can be adjusted in the ILP model if discrepency is revealed in future tests. 1 9 8 CPU Utilization (%) 7 6 5 4 3 2 5% OC3 Load, Netflow Off 5% OC3 Load, Netflow On 25% OC3 Load, Netflow Off 25% OC3 Load Netflow On 5% OC3 Load, On/Off Difference 25% OC3 Load, On/Off Difference 1 1 2 3 4 5 6 # of Flows x 1 4 Fig. 7. NetFlow impact on CPU utilization on VIP2-5.

18 Table 1. CPU test results on traffic with 6K flows and different packet size distributions. Link OC3 VIP2-5, NetFlow VIP4-8, NetFlow ID Load OFF ON OFF ON 1 25% 85% 9% 63% 69% 5% 94% 98% 83% 87% 2 25% 97% 98% 3 25% 48% 54% 33% 37% 5% 69% 73% 49% 54% 75% 9% 94% 65% 69% According to Cisco s document [1], memory usage on a VIP will increase when NetFlow is enabled because an extra amount of memory will be allocated as NetFlow cache. The size of the NetFlow cache allocated is determined by the total size of the VIP memory. Table 2 summarizes the increase of memory usage on VIP2-5s when enabling NetFlow. Table 2. Required DRAM by NetFlow on Cisco 75 VIPs. DRAM Default NetFlow DRAM Required by Capacity Cache Entries NetFlow Cache 256 MB 256K 16 MB 128 MB 128K 8 MB 64 MB 64K 4 MB 32 MB 32K 2 MB 16 MB 2K 128 KB We verify the memory impact of NetFlow on our TestBed. Figure 6(c) shows the memory usage is increased by 8 MB when NetFlow is enabled on a VIP2-5 with 128 MB memory. Therefore, to have NetFlow safely enabled, we have to make sure the free memory space under normal working conditions without NetFlow is larger than 8 MB. To conclude what we have learned from the TestBed: NetFlow has negligible impact on RSP CPU and memory utilization. As a consequence RSP CPU loaded up to 99% can still support NetFlow without any upgrade. VIP CPU is increased by less than 1% by NetFlow. VIP memory increased by NetFlow is a step function of the total VIP memory capacity, as shown in Table 2. Upgrading to the closest higher VIP type (e.g., from 2-5 to 4-8, from 4-8 to 6-8) reduces the CPU utilization by 2%. We would like to thank Beng-Ong Lee and Travis Dawson for completing most of the testbed experiments.