Data Streaming Algorithms for Estimating Entropy of Network Traffic

Transcription

1 Data Streaing Algoriths for Estiating Entropy of Network Traffic Ashwin Lall University of Rochester Vyas Sekar Carnegie Mellon University Mitsunori Ogihara University of Rochester Jun (Ji) Xu Georgia Inst. of Technology Hui Zhang Carnegie Mellon University ABSTRACT Using entropy of traffic distributions has been shown to aid a wide variety of network onitoring applications such as anoaly detection, clustering to reveal interesting patterns, and traffic classification. However, realizing this potential benefit in practice requires accurate algoriths that can operate on high-speed links, with low CPU and eory requireents. In this paper, we investigate the proble of estiating the entropy in a streaing coputation odel. We give lower bounds for this proble, showing that neither approxiation nor randoization alone will let us copute the entropy efficiently. We present two algoriths for randoly approxiating the entropy in a tie and space efficient anner, applicable for use on very high speed (greater than OC-48) links. The first algorith for entropy estiation is inspired by the structural siilarity with the seinal work of Alon et al. for estiating frequency oents, and we provide strong theoretical guarantees on the error and resource usage. Our second algorith utilizes the observation that the perforance of the streaing algorith can be enhanced by separating the high-frequency ites (or elephants) fro the low-frequency ites (or ice). We evaluate our algoriths on traffic traces fro different deployent scenarios. Categories and Subject Descriptors C.2.3 [Coputer Systes Organization]: Coputer- Counication Networks: Network Operations Network Monitoring; Supported in part by grants Xerox/NYSRAT #C43 and NSF-EIA-256. Supported in part by NSF grant NETS-NBD and NSF CAREER Award ANI Supported in part by grants NSF CNS and ANI and U.S. Ary Research Office contract nuber DAAD Perission to ake digital or hard copies of all or part of this work for personal or classroo use is granted without fee provided that copies are not ade or distributed for profit or coercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific perission and/or a fee. SIGMETRICS/Perforance 6, June 26 3, 26, Saint Malo, France. Copyright 26 ACM /6/6...$5.. General Ters Algoriths, Measureent, Theory Keywords Traffic Analysis, Data Streaing. INTRODUCTION In network traffic flow analysis there has been a shift of focus fro siple volue-based analysis to network flow distribution-based analysis. Much work has been published for aking inference about the network status fro such statistics [2, 7, 24]. Intrinsically, distribution-based analysis could capture the network status ore succinctly than volue-based analysis would, but it requires appropriate etrics to encapsulate and capture features of the underlying traffic distribution. The standard quantities in assessing distributions are the oents (the ean, standard deviation, skewness, kurtosis, etc.). A nuber of recent epirical studies [7, 7, 23, 24] have suggested the use of entropy as a succinct eans of suarizing traffic distributions for different applications, in particular, in anoaly detection and in fine-grained traffic analysis and classification. With respect to anoaly detection [7], the use of entropy for tracking changes in traffic distributions provides two significant benefits. First, the use of entropy can increase the sensitivity of detection to uncover anoalous incidents that ay not anifest as volue anoalies. Second, using such traffic features provides additional diagnostic inforation into the nature of the anoalous incidents (e.g., aking distinction aong wors, DDoS attacks, and scans) that is not available fro just voluebased anoaly detection. With respect to fine-grained traffic analysis and traffic classification [24], the entropy of traffic feature distributions offers useful inforation to easure distance aong (traffic) clusters. While these recent studies deonstrate that using the entropy of traffic distributions has treendous value for network onitoring applications, realizing the potential benefit requires efficient algoriths for coputing the entropy. In general, coputing traffic statistics on high-speed links is a hard task, because it is infeasible for traditional ethods to keep up with the line-rates, due to constraints on available processing capacity. In addition, constraints iposed on eory ake it alost ipossible to copute the statistics per flow, or even to aintain per-flow state. Then, the

2 use of sapling coes as a natural solution. Sapling based ethods [5, 6] have been shown to be able to reduce the processing and eory requireents, and to be suitable for capturing soe traffic statistics. However, one ust trade off accuracy for efficiency the estiates obtained fro sapled data ay have large errors []. One ay then naturally wonder whether there are efficient ethods for accurately estiating the entropy. In particular, we ask the following questions: What aount of resources (tie and space) do we provably need to capture the entropy of a strea of packets on a high-speed link? Are there efficient algoriths for entropy coputation that can operate on high-speed links which have low eory and CPU costs? To address these questions, data streaing algoriths assue significance. Data streaing algoriths [9] for coputing different statistics over input streas have recently received treendous interest fro the networking and theory counities. Data streaing algoriths have the desirable property that both the coputational and eory requireents are low. This property akes the ideal for such high-speed onitoring applications. They are also guaranteed to work with any distribution, which akes the useful in dealing with data for which the distribution is not known. The contribution of this paper is the investigation and application of streaing algoriths to copute the entropy over network traffic streas. The challenge is to design algoriths for estiating entropy that are lightweight in ters of both eory and coputational coplexity. We present two algoriths for coputing the entropy in a streaing odel. The first algorith is based on the insight that estiating the entropy shares structural siilarity with the well-known proble of estiating the frequency oents [2]. Despite the apparent structural siilarity, providing theoretical approxiation and resource guarantees for entropy estiation is a challenging task. Our contributions are the identification of appropriate estiator functions for calculating the entropy accurately, and providing proofs of approxiation guarantees and resource usage. The theoretical guarantees hold for arbitrary streas, without aking any assuptions regarding the underlying distributions and structural properties of their distribution. Network traffic data-streas have considerable underlying structure (e.g., they ay have a Zipfian or power-law distribution), which suggests that we can optiize algoriths further by leveraging this fact. Our second algorith builds on the basic streaing algorith, but can substantially iprove the efficiency based on techniques for separating the large (elephant) flows fro the sall (ice) flows. We use a lightweight sapling ethod that enables sieving out the elephant flows fro the strea, and extend the earlier algorith to utilize this separation to achieve better perforance in practice. We evaluate our algoriths on real traffic traces collected fro three different deployent scenarios. The first streaing algorith outperfors traditional sapling based approaches, and provides uch lower estiation errors while This approach has also been independently proposed by Chakrabarti et al. []. We will discuss this and other approaches in Section 8, highlighting that while our intellectual trails cross each other on soe results, our approaches and evaluations differ substantially in others. using siilar (or lesser) eory resources. Interestingly, we notice that the observed errors are an order of agnitude saller than the theoretical error guarantees. While it has proved difficult to provide rigorous theoretical (i.e., worstcase) guarantees for the second algorith (which akes use of the elephant-ice separation), we find that the observed errors are further reduced with this approach. The reainder of this paper is organized as follows. We introduce the notation that we will use and forally define the proble in Section 2. In Section 3 we prove that any (deterinistic) approxiation algorith or (exact) randoized algorith ust use a linear aount of space. Section 4 outlines the basic streaing algorith and provides theoretical approxiation guarantees, while Section 5 provides iproveents based on the technique of separating the elephant and ice flows. We evaluate our algoriths on realworld traces in Section 6, confiring the effectiveness of our approaches. We discuss soe features of our algoriths in Section 7 and related work in Section 8, before concluding in Section PROBLEM FORMULATION We first outline the notation used in the reainder of the paper, and forulate the proble of estiating entropy in a streaing context. Throughout this paper we will assue that all ites coing over the strea are drawn fro the set [n] = {, 2, 3,..., n}. For exaple if we are interested in easuring the entropy of packets over various application ports, then n is the nuber of ports (axiu of ports for each protocol). Siilarly, if we are interested in easuring the entropy of packets over unique source or destination addresses in the traffic strea, then n would have a axiu value of 2 32 for 32-bit IPv4 addresses. We will denote the frequency of ite i [n] (e.g., the nuber of packets seen at port i) by i and the total nuber of ites in the strea by, i.e., = n i. The jth ite observed in the strea will be denoted by a j [n]. We define n to be the nuber of distinct ites that appear in the strea, since it is possible that not all n ites are present. As a siple exaple consider a strea drawn fro a set of n = 4 different possible objects {A, B, C, D}. Let the strea X = (A, A, B, B, C, A, B, A, C). For this strea, the total nuber of ites = = 9, with the nuber of distinct ites n = 3. Note that all our analysis is in ters of, rather than n, since in general n >>. The natural definition of entropy (soeties referred to as saple entropy) in this setting is the expression H n i log ( i ). Intuitively, the entropy is a easure of the diversity or randoness of the data coing over the strea. The entropy attains its iniu value of zero when all the ites coing over the strea are the sae and its axiu value of log when all the ites in the strea are distinct. Unless otherwise specified, all logariths in this paper are to the base 2 and we define log =. For our exaple strea X, the entropy H(X) = (4/9) log (4/9) (3/9) log (3/9) (2/9) log (2/9) =.53. Often it is useful to noralize this nuber to copare entropy estiates across different easureent epochs. For this purpose, we define the standardized entropy to be H/ log. In our exaple, the standardized entropy is.53/ log 9 =.48.

3 To copute the entropy, n i i H = log ( ) [ = i log i ] i log i i = log () i log i, it suffices to copute S i i log i, since we can keep a count of exactly with log bits. For the reainder of this paper we will concern ourselves with estiating the value S. The easure of accuracy we use to evaluate our estiates is the notion of relative error, which is defined to be S S /S, where S is the estiated value and S the true value. For practical applications in traffic onitoring, we require that the relative error be low (say less than 2-3%), so that the accuracy of applications such as anoaly detection and traffic clustering is not affected. An accurate estiate of S ay not necessarily give an accurate estiate of H. In particular, when H is very sall and S is close to its axiu value, a sall relative error estiate of S ay not correspond to a sall relative error estiation of H. Let S be the estiated value of S and H the estiated value of H coputed fro S, i.e., H = log () S/. Suppose we have an algorith to copute S with relative error at ost ɛ. Then, the relative error in estiating H can be bounded as follows: H H H = = i log () S/ log () + S/ H S S H ɛ S H. Note that the relative error in H actually depends on the ratio S, which can theoretically becoe arbitrarily high if H H is close to zero. However, given reasonable lower bounds for how sall H can get, an algorith that can give an approxiation of S with relative error at ost ɛ can be converted to one that gives an approxiation of H with relative error ɛ = Θ(ɛ). Specifically, since we know that S log, if we assue a lower bound of α log for H (for soe constant α) then the relative error in estiating H is at ost ɛ = ɛ/α. Thus any approxiation schee for S can be converted to one for H if we can assue a lower bound on the entropy. Our evaluations (Section 6.2) confir that the errors for H and S are coparable. 3. LOWER BOUNDS In this paper we will present a randoized approxiation algorith that uses O(log ) space for coputing the value S of a strea. Before we do this, we would like to answer the first question of how uch effort is required to estiate the entropy of a given traffic distribution. We will deonstrate that any exact randoized algorith or any deterinistic approxiation algorith needs at least linear (in the length of the strea) space. This otivates the need to use both randoization and approxiation. We first deonstrate that any randoized algorith to copute S ust use Ω() space by reducing the counication coplexity proble of set intersection to it. Using counication coplexity is a coon way to prove lower bounds for streaing algoriths [2, 8]. We show here how to apply it to the coputation of S (and hence the entropy H). In the counication coplexity odel two parties (typically called Alice and Bob), who have non-overlapping but jointly coplete parts of the input, wish to copute soe function of the input. The counication coplexity of the function at input size n is then the largest nuber of bits that the parties have to counicate using the best protocol to copute the function, for any input of size n. There are no bounds on the coputational power of either party and the only resource being easured is the nuber of bits counicated. For the proble of set intersection, Alice and Bob have subsets A and B of {,..., N} as input. The question is then whether the sets A and B have any eleents in coon. It is known that the deterinistic counication coplexity of this proble is Θ(N) [5]. It was shown by Kalyanasundara and Schnitger in [] that any counication coplexity protocol for set intersection that has probability of error at ost δ, for any δ < /2, ust use Ω(N) bits of counication. We ake use of this result in the proof. Theore. Any randoized streaing algorith to copute the exact value of S when there are at ost ites ust use Ω() bits of space. Proof. Let us assue that we have a randoized streaing algorith that coputes S = i i log i for any strea exactly using s bits of space. This gives rise to a counication coplexity protocol, using Θ(s) bits of counication, for coputing set intersection that works as follows. Suppose that Alice and Bob have as input subsets of the set {,..., /2}. Alice siulates the algorith using her set (in any arbitrary order) as input into the algorith and sends the saved state of the algorith (at ost Θ(s) bits) to Bob. Bob then restarts the algorith, starting with that saved state, and enters his entire set. At the end of this run, Bob checks the output of the algorith if the output is zero, he outputs disjoint, otherwise he outputs not disjoint. The above protocol relies on the fact that any ites that have frequency at ost one do not count toward the su S (since log = log = ). So, the value of S coputed is exactly twice the size of the intersection. If we find that the intersection has size zero then we know that Alice and Bob s sets are disjoint, otherwise they have soething in coon. Hence, even if the streaing algorith is randoized, it ust use s = Ω() bits. If it used fewer bits it would lead to a randoized protocol for set intersection with less than Ω(N) counication, which we know fro [] to be ipossible. Theore 2. Any deterinistic streaing algorith to approxiate S with relative error less than /3 ust use Ω() bits of space. Proof. The proof that any (non-randoized) approxiation algorith is inefficient is siilar to the proof of Proposition 3.7 in [2]. Let G be a faily of 2 Θ() subsets of {,..., 2}, such that each subset has cardinality /2 and any pair of distinct subsets has at ost /4 eleents in coon. (It is possible to show such a G exists using the probabilistic ethod.)

4 Let us assue for a contradiction that there exists a deterinistic streaing algorith that estiates S with relative error at ost /3, using less than linear (in ) space. For every pair of eleents G, G 2 G, let A(G, G 2) be the sequence of length consisting of the eleents of G in sorted order followed by the eleents of G 2 in sorted order. By the pigeonhole principle, if the eory used by the algorith has less than log G = Ω() bits, then at least two distinct subsets G i, G j G result in the sae eory configuration when their contents are entered into the algorith. Hence, the algorith cannot distinguish between the streas A(G i, G i) and A(G j, G i). For the input A(G i, G i) we have that S = (/2)(2 log 2) =, but for A(G j, G i), S (/4)(2 log 2) = /2. Now, if the relative error for A(G i, G i) is less than /3, its estiated value is ore than 2/3, but if the relative error for A(G j, G i) is less than /3 its estiated value is less than 2/3. Therefore, the algorith akes a relative error of at least /3 on at least one of these inputs. This tells us that any non-randoized algorith ust either use Ω() space or have a relative error of at least /3. Thus, we see that if we use only randoization or only approxiation we cannot hope to use a sublinear aount of space. As a result, the following algoriths that we present are both randoized and approxiate. Fortunately, when we allow for these two relaxations we get algoriths that are sublinear (in particular, polylogarithic) in space and tie per ite. 4. A STREAMING ALGORITHM In this section we present our first algorith and show guarantees on the perforance and the size of the eory footprint. The basic algorith is based on the key insight that estiating S is structurally siilar to estiating the frequency oents [2]. The advantage of this technique is that it gives an unbiased estiate of the entropy, with strong theoretical guarantees on the space consuption based upon the desired accuracy of the algorith. We then show how the assuptions and analysis of the algorith can be further tightened. 4. Algorith As deonstrated in the previous section, randoization and approxiation alone do not allow us to estiate S efficiently. Hence, we present an algorith that is an (ɛ, δ)- approxiation of S. An (ɛ, δ)-approxiation algorith is one that has a relative error of at ost ɛ with probability at least δ, i.e., P r( X X Xɛ) δ, where X and X are the real and estiated values, respectively. This algorith uses the idea of the celebrated Alon Matias Szegedy frequency oent estiation algorith [2]. Conceptually, the algorith can be divided into three stages. In the first stage we select rando locations in the strea. These locations decide the set of counters that the algorith tracks during the online stage. In the second stage, the online stage, we keep track of the nuber of occurrences of ites that appear at the randoly selected locations. For each selected ite, we keep an exact counter for the nuber of subsequent occurrences of that ite. For exaple, if position k in the strea was selected, we keep an exact counter for the ite at position k (denoted as a k ) for the reainder of the strea (i.e., between locations k and ). In the third Algorith : The streaing algorith : Pre-processing stage 2: z := 32 log /ɛ 2, g := 2 log (/δ) 3: choose z g locations in the strea at rando 4: Online stage 5: for each ite a j in the strea do 6: if a j already has one or ore counters then 7: increent all of a j s counters 8: if j is one of the randoly chosen locations then 9: start keeping a count for a j, initialized at : Post-processing stage : // View the g z counts as a atrix c of size g z 2: for i := to g do 3: for j := to z do 4: X i,j := (c i,j log c i,j (c i,j ) log (c i,j )) 5: for i := to g do 6: avg[i] := the average of the Xs in group i 7: return the edian of avg[],..., avg[g] and final stage the algorith uses the various counters it has tracked to obtain an estiator for the S value of the strea. The goal of the post-processing or estiating stage is to obtain an estiate of S that is unbiased and whose error is provably low. We present the pseudocode for this algorith in Algorith. In the pre-processing stage we need to choose z g locations in the strea. Note that for this stage we need to know the length of the strea to both copute z and to choose the rando locations. The choice of the rando locations can be deferred as described in [2] and to copute z we can use a safe overestiate for log without increasing the space too uch. In the online stage, for each such position we keep a counter c for that ite fro that position on. We update at ost one record per ite during the online stage, using a data structure described in the following section. In the post-processing stage, for each of the tracked counters we copute an unbiased estiator for S as follows: X = (c log c (c ) log (c )). These g z unbiased estiators are then divided into g groups each containing z variables. First we copute the average over each of the g groups, and then obtain the edian of the groups as our returned estiate for S. Intuitively, the estiator variable X provides us an unbiased estiate of S, but does not give good guarantees on the variance, and hence the relative error. By coputing any such estiates, and obtaining the edian over the averages of ultiple groups, we can provide rigorous guarantees on the error as we will see in Section Ipleentation Details One ajor advantage of this algorith is that it is light weight. For any ite in the strea, the algorith has to update its count if the ite is being counted. Checking whether the ite is being counted can be done very quickly using a hash table. However, it is possible that a single ite has ultiple records for it. In the worst case, we would need to update every record for each ite. We could greatly iprove the efficiency of the algorith by instead keeping

5 a single record for every unique ite. This can be ipleented by only updating the ost recent record for that ite and aintaining a pointer to the next ost recent record. When the entire strea has been processed, the counts for the older records can be reconstructed fro those of the newer ones. The record data structure that we suggest is illustrated in Figure. Each record in our ipleentation would require 2 bits because we would need to store the ite label ITEM LABEL ( bits), the counter for the ite COUNTER (32 bits), a pointer CHAINING PTR (32 bits) to resolve hash collisions if we use chaining and another pointer PREV PTR (32 bits) to point to the older records for the ite. We use a conservative estiate of bits for each ite label, assuing that we would store all 5 ain IP packet header fields, i.e., srcaddr, dstaddr, srcport, dstport, protocol. CHAINING_PTR (32 bits) ITEM_LABEL (~ bits) COUNTER (32 bits) PREV_PTR (32 bits) Figure : The record data structure At the end of each epoch the algorith needs to perfor the operations of averaging and finding the edian of a list. However, both these operations only need to be done in the post-processing step. If we ake an epoch sufficiently large, then these coputations need be done relatively infrequently. 4.3 Theoretical Guarantees We present analysis that shows we can give strong guarantees while using very little space. The proof is along the lines of the one in [2] and the ain contribution here is to show how the variance of the variable X can be bounded to give such a sall space requireent. The proof requires the assuption that S or, equivalently, that H log. We show in Section 4.5 why this assuption is reasonable. Theore 3. If we assue that S, then Algorith is an (ɛ, δ)-approxiation algorith for S that uses O(log log (/δ)/ɛ 2 ) records. Proof. We will first show that the variable X is an unbiased estiator for S. We will then ake use of Chebyshev s inequality to bound the probability of having a relative error greater than ɛ. Next, we show that if we average z = 32 log /ɛ 2 variables, this probability is at ost /8. We can then use Chernoff bounds to show that if we take g = 2 log (/δ) such averages, with probability at least δ ore than half of the have less than ɛ relative error. In this case, the edian of the averages ust have relative error less than ɛ. We first observe that the expected value of each variable X is an unbiased estiate of our desired quantity S: E[X] = = n i (j log j (j ) log (j )) j= n i log i = S. To ake use of Chebyshev s inequality, we need to bound the variance of X fro above, in ters of S 2. The bound proceeds as follows: V ar(x) = E(X 2 ) E(X) 2 E(X 2 ) [ n = 2 j ] (i log i (i ) log (i )) 2. Now we observe that j= i=2 n log n (n ) log (n ) = log n n (n ) n log nn n n 2 = 2 log n, () where the inequality coes fro the facts that the logarith function is onotonically increasing and that for all n >, n n 2 (n ) n, which is proven as follows: For n = 2 the fact can easily be checked. For all other n, n > e, so n n 2 (n ) = ( ) n n = ( + ) n. n n n n n This is at ost e/n. So, the inequality holds. Now, substituting () into the bound on the variance, we get that V ar(x) 4 n i (2 log j) 2 j=2 n i log 2 i ( ) 4 log i log i i ( ) 4S log i log i = 4S 2 log, where for the last inequality we ake use of our assuption that S. Let the average of the ith group be Y i. We know that V ar(y i) = V ar(x)/z and that it is also an unbiased estiator of S. Applying Chebyshev s inequality, we get that i

6 for each Y i, P r( Y i S > ɛs) V ar(yi) ɛ 2 S 2 4S2 log zɛ 2 S 2 = 4 log zɛ 2 8. Now, by Chernoff bounds we get that with probability at least δ, at least g/2 of the averages have at ost ɛ relative error. Hence, the edian of the averages has relative error at ost ɛ with probability at least δ. Note that if we had chosen z = log /(ɛ 2 δ) we could have guaranteed an error probability of at ost δ with just this one bigger group. While the analysis in the proof works well for saller δ (i.e., δ /28), for practical applications we ay want to use larger δ. Because of the independence of each run, with δ = % we detect anoalous entropy values within one epoch with 9% certainty, within two epochs with 99% certainty and so on. For the case where δ is greater than /28.8% we can use the average of a single group of z = log /(ɛ 2 δ) estiators for our estiate. The total space (in bits) used by this algorith is ( log log (/δ) ) O (log n + log ). ɛ 2 For fixed δ and ɛ this algorith uses O(log ) records of size O(log + log n) bits. Nuerical Illustration: To put this into a practical perspective, let us consider an exaple where we have a strea of length = illion, with n = 6 illion distinct ites. To copute the entropy exactly, we could have to aintain counts for each ite using 6 illion ite labels and counters (32 bits/record 6 illion records = 94 MB). Using Algorith we could approxiate the entropy with at ost 25% relative error at least 75% of the tie with 54 thousand records or.4 MB, using 2 bit records as discussed earlier. 4.4 Exact Space Bounds In practical settings we want to know the exact values of the paraeters of the above algorith so that we use as little space as possible. We tighten the bound on the nuber of groups needed by aking the observation that j j = j( + (j ) j j )j < ej. Here is a tighter (nonasyptotic) analysis for the bound on the variance: Theore 4. If we assue that S, then Algorith can be odified to use exactly records. (6 log +64) log (/δ) ɛ 2 Proof. E(X 2 ) = n i (j log j (j ) log (j )) 2 j= n i log 2 (ej) j= ( n ) i n i = log 2 j + log 2 e + 2 log e log j j= j= ( n ) i log 2 i + log 2 e + 2S log e ( n ) S i log 2 i + log 2 e + 2S log e S ( S log + log 2 e + 2S log e ) = S 2 (log + log 2 e/s + 2 log e) (2) S 2 (log + log 2 e + 2 log e) (3) S 2 (log + 5), where (2) and (3) require the assuption that S. Hence we have that the variance V ar(x) = E(X 2 ) (E(X)) 2 S 2 (log + 4). So, we see that z = 8 log +32 suffices. ɛ 2 Nuerical Illustration: Returning to our exaple of a strea of size 67 illion, the above iproveents would drop the nuber of records for the case of at ost 25% error with 75% probability to just 6 thousand (4 KB). 4.5 A Note on Assuptions For the above analysis, we needed to ake the assuption that S. It is not hard to see (and prove) that we need soe kind of lower bound on the value of S to protect ourselves fro the case that we are trying to distinguish two streas of low S value. If one strea has all unique eleents (so that S = ) and another has only one repeated eleent, then it is very hard to distinguish the. However, we ust distinguish the to have less than % relative error. Assuing that S, or that H log, is reasonable because H attains its axiu value at log. We now show soe other conditions that give us that S, thereby aking the reasonable assuptions to ake. Theore 5. If 2n then S. Proof. It is easy to show using Lagrange ultipliers that S attains its iniu value when all the ites in the strea have the sae count. Hence, a lower bound for S is S n n log (/n ) = log (/n ). Since we have assued that 2n, this gives us that S log (/n ) log 2 =. Hence, we need only assue that each ite in the strea appears at least twice on average. This assuption protects us fro the case described earlier and in any setting where S can get arbitrarily sall. We feel that in any practical setting this siple assuption is very reasonable. For exaple, on all the traces that we experiented on, the factor /n was in the range of 5 to 3.

7 4.6 A Constant-space Solution As it turns out, if we ake a stronger (but still reasonable) assuption on how large the entropy can get, we can ake the space usage of the algorith independent of (assuing fixed sized records). Upper bounding the entropy is reasonable to do since even during abnoral events (e.g wor attacks), when the randoness of the distributions are increased, there will still be a sufficiently large aount of legitiate activity to offset the increased randoness. Recall that H attains its axiu at log, when each of the ites in the strea appears exactly once. We will assue that H β log. This gives us the following bound on S: S = log H log β( log ) = ( β) log. We can now apply this to decrease the space usage of our algorith: Theore 6. If we assue that H β log, then Algorith can be odified to use exactly records. 64 log (/δ) ( β)ɛ 2 Proof. We once again bound the variance: V ar(x) = E(X 2 ) E(X) 2 E(X 2 ) [ n = 2 j ] (i log i (i ) log (i )) 2 4 j= i=2 n i (2 log j) 2 j=2 n i log 2 i ( ) 4 log i log i 4S 2 /( β). i 32 ( β)ɛ 2 Hence, we need only z = groups, which is independent of. The desired bound on the nuber of records follows fro this. Nuerical Illustration: For a strea with 67 illion packets, if we ake the siple assuption that the entropy never goes above 9% of its axiu value then we need 2 thousand records (525 KB), and if we assue that it never exceeds 75% of its axiu value then we only need 8, 2 records (25 KB). Note that these space bounds will not increase with the size of the strea they depend only on the error paraeters. Hence, we can use a few hundred kilobytes for arbitrarily large streas, as long as we can safely ake an assuption about how large its standardized entropy can get. 5. SEPARATING THE ELEPHANTS FROM THE MICE The algorith described in the previous section provides worst-case theoretical guarantees independent of the structure of the underlying traffic distributions. In practice, however, ost network traffic streas have significant structure. In particular a siple but useful insight [6] is that traffic distributions often have a clear dearcation between large flows (or elephants), and saller flows (or ice). A sall nuber of elephant flows contribute a large volue of traffic, and for any traffic onitoring applications it ay often suffice to estiate the elephants accurately. In our second algorith (see Algorith 2) we ake use of the idea of separating the elephants fro the ice in the strea. By separately estiating the contribution of the elephants and ice to the entropy we can further iprove the accuracy of our results, thereby also decreasing the space usage of the algorith. We believe that such a sieving idea has uch broader applicability. Other streaing algoriths for estiating different traffic statistics can potentially benefit by using such an idea. Intuitively, the aount of space needed by the first algorith is directly proportional to the variance of the estiator X (see Section 4.3), and by sieving out the high-count ites we can significantly decrease the variance of the estiator and hence the space required. For this algorith we change the ethod of sapling slightly. Rather than pre-copute positions in the strea (which requires foreknowledge of the length of the strea), we saple each location with soe sall probability. After the ite is sapled, an exact count is aintained for it, siilar to the Saple and Hold algorith described in [6]. If an ite is sapled exactly once, then we consider it a ouse and copute the entropy of the ice using the previous algorith. If an ite is sapled ore than once, we consider it an elephant and estiate its exact value. Note that this ethod is different fro [9] in that we are looking for ites that are sapled ultiple ties, not necessarily in consecutive saples. Once an ite is sapled a second tie, it is considered an elephant. To estiate its exact value (i.e. to copensate for the nuber of ties the ite appeared before it was first sapled), we siply add the count between the first and second sapling. Intuitively, the nuber of occurrences of the ite between successive saples should be equal if it is evenly distributed. This ethod of approxiating the exact count of the elephant was epirically found to be a good estiator. The record data structure for this sapling ethod is siilar that used by Algorith. The ain difference is that we no longer need a pointer to older copies of an ite since we only aintain a single count for each unique ite. To be able to tell whether the ite has been sapled before or not (to deterine whether it should be prooted to an elephant) we require just a single additional bit. Thus, we see that this sapling ethod requires inial overhead to separate the elephants fro the ice. The sieving algorith assues that every flow that is sapled twice is elevated to the status of an elephant. Rather than choose the elevation threshold, we evaluated different values of the threshold before we arrive at the nuber two. Figure 2 shows the relative error in estiating the S (of the destination address distribution) as a function of k, the threshold for prooting ice to elephants, for three different packet traces. The next section provides further details on the traces used in our evaluations. We observe that the lowest error is achieved with a value of k = 2. Intuitively, a higher strike-threshold decreases the nuber of elephants, and we do not achieve the desired elephant-ice separation.

8 Algorith 2: The sieving algorith : Online stage 2: for each ite in the strea do 3: if the ite is sapled then 4: if the ite is already being counted then 5: proote the ite to elephant status 6: else 7: allocate space for a counter for this ite 8: else 9: increent the counter for this ite, if there is one : Post-processing stage : S e := 2: for each elephant (with estiated count c) do 3: S e := S e + c log c 4: estiate the contribution of the ice S fro the reaining counts using Algorith 5: return S e + S Relative Error Trace Trace 2 Trace Threshold for elephant status (k) Figure 2: Selecting the threshold k for Sieving A natural question with such a sieving algorith is one regarding the relative weights of the two different contributing factors. Intuitively, if either the elephant or the ice flows are not substantial contributors, then we can potentially reduce the space usage further by ignoring the contribution of the insignificant one. We epirically confired the need for accurate estiation of both the elephant and the ice flows. Figure 3 shows the relative contribution of the elephant and ice flows to the S estiate (for the destination address distribution on Trace ). We observe that both elephant and ice flows have substantial contributions to the overall estiation, and ignoring one of the can yield inaccurate results for estiating S, and hence H. The results across different traces and across different traffic distributions of interest were siilar and are oitted for brevity. 6. EVALUATION We first describe the datasets used in this paper. We then present a coparison of the two streaing algoriths introduced in this paper with other sapling based approaches. There are two natural etrics for characterizing the perforance of the streaing algorith for entropy coputation: resource usage and error. The resource usage is related to the nuber of counters used by different algoriths, which directly translates into the total eory (SRAM) require- Relative Contribution to S Estiate Elephants Mice Epoch Figure 3: Confiring that estiating both elephants and ice is necessary ents of the algorith, and the total CPU usage. For the following evaluations, we use the notion of relative error to deterine the accuracy of different algoriths. Datasets: We use three different packet-header traces for evaluating the accuracy of our algoriths. We provide a brief description of each. 2 University Trace (Trace ): Our first packet trace is an hour-long packet trace collected fro USC s Los Nettos collecting facility on Feb 2, 24. We bin the trace into -inute epochs, with each epoch containing roughly.7 illion TCP packets, 3267 distinct IP addresses, and 565 ports per inute. We refer to this as Trace in the following discussion. Departent Trace (Trace 2): We use a 5-hour long packet trace collected on Aug 5, 23 at the gateway router of a ediu sized departent with approxiately 35 hosts. We observe all traffic to and fro the 35 hosts behind the access router to the coercial Internet, other non-departent university hosts and servers. We bin the dataset into 5-inute epochs for our evaluation, with each epoch observing 5 TCP packets, 2587 distinct addresses, and 4672 distinct ports on average. We refer to this as Trace 2 in the following discussion. University Trace (Trace 3): The third trace we use is an hour-long trace collected at the access link of the university to the rest of the Internet, at UNC on Apr 24, 23. We bin this trace into -inute epochs as with Trace. Each epoch contains on average 2.5 illion packets, distinct IP addresses, and 88 unique application ports. We refer to this as Trace 3. Distributions of Interest: We focus on two ain types of distributions for our evaluation. The nuber of distinct source and destination addresses observed in a dataset, and the distribution of traffic across destinations are typically affected by network attacks, including DDoS and wor attacks. We track the distribution of traffic across different addresses for the source and destination addresses. Understanding the application ix that traverses a network can usually be apped into a study of the distribution of traffic on different application ports. The distribution of traffic 2 The university traces are available on request fro the respective universities. The departent trace is a private dataset fro CMU.

9 across different ports can also be indicative of scanning attacks or the eergence of new popular applications. In each case we are interested in the distribution of the nuber of packets observed at each port or address (source or destination) within the easureent epoch. Lakhina et al. [7] give an overview of different types of network events and distributions that each would affect. For Algorith we use the assuption that /n 2 since it is both a weak assuption (i.e., weaker than the one ade in Section 4.6) and easy to check. To confir that this assuption holds for our traces and distributions, we present the ratio /n for the here. For Trace the ratio is roughly 55 for the addresses and 5 for the ports. n For Trace 2, is around 93 for the addresses and 95 for the ports. Lastly, the ratio is around 97 for the addresses and 3 for the ports in Trace 3. Thus we see that in all of our traces the assuption is satisfied. 6. Coparison with Sapling Algoriths We first evaluate the accuracy of estiation of our streaing algoriths by coparing the against the following:. Sapling: This is the well-known unifor packet sapling approach used in ost coercial router ipleentations [2]. Given a sapling probability p, the sapling approach will pick each packet independently with probability p. The estiation of S and H is perfored over the set of sapled packets, after noralizing the counts by /p. 2. Saple and Hold: This is the sapling approach proposed by Estan and Varghese [6]. Here given a sapling probability p, the algorith picks each ite in the strea with probability p and and keeps an exact count for that ite fro that point on. Each saple is appropriately renoralized (increenting by a factor /p) to account for occurrences of the ite before it was sapled. The Sieving algorith introduced in Section 5 is also conceptually a sapling algorith, siilar to Saple and Hold which selects a sapling probability p apriori. In order to perfor a fair coparison of the perforance across the different algoriths, we noralize the nuber of records to keep track of to be the sae. For the following experients we fix the sapling probability p, and pick the (ɛ, δ) values for Algorith, such that the nuber of counters across different algoriths used is the sae. We ipleented and tested our algoriths on coodity hardware (Intel Xeon 3. GHz desktops with GB of RAM). We found that the total CPU utilization for the streaing algoriths was very low even though we used a preliinary ipleentation with very few code optiizations, each easureent epoch took less than seconds to process when the epoch length was an entire inute. This deonstrates that our algorith can cofortably run in real-tie. We also found that the post-processing step consued a negligible fraction of the tie of each run. Since all of the algoriths are randoized or sapling-based, for the following results we present the ean relative errors and estiates over 5 independent runs. We found that the standard deviations were very sall, and do not present the deviations for clarity of presentation. Figure 4 copares the perforance of different algoriths across different traces, using a sapling rate of p =. for the different algoriths, and using ɛ and δ for Algorith such that the nuber of counters used in all four algoriths is roughly the sae. The figures show the CDF of the relative error in estiating the entropy of destination addresses observed across different easureent epochs. The streaing algoriths consistently outperfor the sapling based approaches. For exaple, on Trace we observe that the worst-case relative error with the sapling based approaches can be as high as 8%, whereas the streaing algoriths guarantee a error of at ost 6% (Algorith ) and 4% (Sieving). We also find that the sieving algorith provides substantially ore accurate estiates for the sae space usage copared to the basic streaing algorith. The sieving algorith has a worst-case error of at ost 2-5%, which bodes well for the practical utility of the algoriths for traffic onitoring applications. For the rest of the discussion, for brevity we only present the results fro Trace, and suarize the results fro Trace 2 and Trace 3. Figure 5 copares the CDF of relative error across easureent epochs, for different distributions of interest fro Trace. We observe a siilar trend across algoriths: the sieving algorith is consistently better than Algorith, which again is substantially ore accurate than the sapling based approaches. Both the streaing algoriths have a worst-case error of 7% and ean error of less than 3% across all the different traffic etrics of interest, which is a tolerable operating range for typical onitoring applications, confiring the practical utility of our approaches. We suarize the results for the other two traces in Table and Table 2. Table : Trace 2: Mean relative error in S estiate Distribution Saple Saple&Hold Algo. Sieving DSTADDR SRCADDR DSTPORT SRCPORT Table 2: Trace 3: Mean relative error in S estiate Distribution Saple Saple&Hold Algo. Sieving DSTADDR SRCADDR DSTPORT SRCPORT Error in estiating entropy Recall fro our discussion in Section 2, that it ay be the case an accurate estiation of S does not necessarily translate into an accurate estiate of H. However, we find fro our evaluations that the streaing algoriths can yield very accurate estiates of H as well. Figures 6(a) and 6(b) copare the relative error in estiating S to the relative error in estiating H, for Algorith and the sieving algorith respectively. We observe that across different traces and distributions, that the relative error in estiating H is very low as well (less than 3% ean error with the sieving algorith). Figure 7 also provides visual confiration of the utility of the different algoriths, in tracking the standardized entropy for the destination address distribution. The sieving algorith once again appears to have greatest accuracy,

10 Fraction of easureent epochs Sapling Saple and Hold. Algorith Sieving Algorith Fraction of easureent epochs Sapling Saple and Hold. Algorith Sieving Algorith Fraction of easureent epochs Sapling Saple and Hold. Algorith Sieving Algorith (a) Trace (b) Trace 2 (c) Trace 3 Figure 4: Coparing perforance of different traces for estiating destination address entropy Fraction of easureent epochs Sapling.2 Saple and Hold. Algorith Sieving Algorith Fraction of easureent epochs Sapling.2 Saple and Hold. Algorith Sieving Algorith Fraction of easureent epochs Sapling.2 Saple and Hold. Algorith Sieving Algorith Fraction of easureent epochs Sapling.2 Saple and Hold. Algorith Sieving Algorith (a) Destination Address (b) Source Address (c) Destination Port (d) Source Port Figure 5: Coparing different distributions, using Trace Relative Error Algorith H Algorith S Epoch (a) Algorith Relative Error Sieve Algorith H Sieve Algorith S Epoch (b) Sieving algorith Figure 6: in S vs. relative error in H which can be confired with visual inspection. We suarize the results for the other two traces in Table 3 and Table 4, and we observe that in each case the error in H is coparable to (or less than) the corresponding error in S (Tables and 2 respectively). Last, we vary the eory consuption of the algorith, and show how the ean and axiu relative errors (for destination address entropy on Trace ) vary as a function of the eory usage in Figure 8. We observe that the streaing algoriths have an order of agnitude lower error than the sapling algoriths, and can achieve very high accuracy (< 2% ean error), even with as low as KB of SRAM usage. Note that even though the sapling algoriths also can give reasonably low errors at higher eory consuption (> 8 KB), the corresponding sapling rates are uch Table 3: Trace 2: Mean relative error in H estiate Distribution Saple Saple&Hold Algo. Sieving DSTADDR SRCADDR DSTPORT SRCPORT Table 4: Trace 3: Mean relative error in H estiate Distribution Saple Saple&Hold Algo. Sieving DSTADDR SRCADDR DSTPORT SRCPORT higher (> in 2 packet sapling) than what is feasible for very high-speed links. 7. DISCUSSION One interesting observation fro our evaluations is that the observed errors on the traffic traces are uch saller than the theoretical guarantees for Algorith. In particular, we observe that the epirical error is at least one order of agnitude saller than the theoretical error guarantee. This is because the algorith ust guarantee the error bound for any strea with any distribution. Realworld packet traces have considerable underlying structure that the algorith cannot directly take advantage of.

11 Standardized Entropy Actual Algorith Estiate Epoch (a) Algorith Standardized Entropy Actual Sieving Estiate Epoch (b) Sieving algorith Figure 7: Verifying the accuracy in estiating the standardized entropy Mean relative error Saple Saple&Hold Algorith Sieving algorith Space usage (in KB) (a) Mean error Maxiu relative error Saple Saple&Hold Algorith Sieving algorith Space usage (in KB) (b) Maxiu error Figure 8: in estiating H vs. eory usage for different algoriths It now follows that one way to tighten the bounds on the space/error tradeoff is to ake reasonable assuptions about the distribution of the strea and have our algoriths take advantage of the. In Section 4.6 we deonstrate this by aking the siple assuption that the standardized entropy of the strea never goes above soe fixed constant. This gives us an algorith that needs a fixed nuber of records, independent of the size of the strea. Such additional assuptions can help in tightening our space bounds. However, in order to be as general and trace-independent as possible, in our algoriths and evaluations we use very weak assuptions (i.e., 2n ). It is a coon observation that network packets have a skewed Zipfian distribution. We took advantage of this fact by separating out the few high-count elephants to facilitate the estiation of the reainder ore accurately. In doing so, however, we do not ake any assuption about the nature of the strea. Algorith 2 has the property that if there are no elephants in the strea, then it should perfor coparably to Algorith. Hence, we expect that, in general, Algorith 2 should perfor better for highly-skewed distributions, but no better than Algorith when the skew is less pronounced. 8. RELATED WORK Many of today s networking onitoring applications use the traffic volue, in ters of flow, packet, and byte counts as the priary etric of choice. These are especially of interest for anoaly detection echaniss to flag incidents of interest. Soe of the well-known ethods include signal analysis (e.g., [3]), forecasting (e.g., [4, 2]), and other statistical approaches (e.g., [6, 25]). There has been a recent interest in using entropy and traffic distribution features for different network onitoring applications. Lakhina et al. [7] use the entropy to augent anoaly detection and network diagnosis, within their PCA fraework. Others have suggested the use of such inforation easures for tracking alicious network activity [7, 23]. Xu et al. [24] use the entropy as a etric to autoatically cluster traffic, to infer patterns of interesting activity. For detecting specific types of attacks, researchers have suggested the use of entropy of different traffic features for wor [23] and DDoS detection [7]. Streaing algoriths have received a lot of interest in the algoriths and networking counity. The seinal work is that of Alon et al. [2] who provide a fraework for estiating frequency oents. Since then, there has been a huge body of literature produced on streaing algoriths, and this is well surveyed in [9]. Kuar et al. use a cobination of counting algoriths and Bayesian estiation for accurate estiation of flow size distributions [3, 4]. Streaing algoriths have also been used for identifying heavy-hitters in streas [22, 26]. While the entropy can theoretically be estiated fro the flow size distribution [3], coputing the flow size distribution conceptually provides uch greater functionality than that required for an accurate estiate of the entropy. The coplexity of estiating the flow size distribution is significantly higher than the coplexity of estiating the entropy, requiring significantly ore eory and effort in post-processing. We are aware of two concurrent efforts in the streaing algoriths counity for estiating entropy. Chakrabarti et al. [] independently proposed an algorith to estiate S that is siilar to Algorith. In this paper we show how this algorith can be odified such that the eory usage is independent of the size of the strea if we ake a siple assuption on how large the standardized entropy can get. McGregor et al. [8] outline algoriths estiating entropy and other inforation-theoretic easures in the streaing context. However, our algoriths provide unbiased estiates of the entropy, and do not ake strong assuptions regarding the underlying distribution. We also provide extensive epirical validation of the utility and accuracy of our algoriths on real datasets, and observe that our sieving approach actually outperfors Algorith. 9. CONCLUSIONS In this paper, we addressed the need for efficient algoriths for estiating the entropy of network traffic streas, for enabling several real-tie traffic onitoring capabilities. We presented lower bounds for the proble of estiating the entropy of a strea, deonstrating that for space-efficient estiation of entropy, both randoization and approxiation are necessary. We provide two streaing algoriths for the proble of estiating the entropy. The first algorith is based on the key insight that the proble shares structural siilarity with the proble of estiating frequency oents over streas. By virtue of the strong bounds that we obtain on the variance of the estiator variable, we are able to liit the space usage of the algorith to polylogarithic in the length of the strea. Under soe practical assuptions of the size of the entropy, we also give an algorith that saples a nuber of flows that is independent of the length of the strea. We also identified a ethod for

12 increasing the accuracy of entropy estiation by separating the elephants fro the ice. Our evaluations on ultiple packet traces deonstrate that our techniques produce very accurate estiates, with very low CPU and eory requireents, aking the suitable for deployent on routers with ulti-gigabit per second links. Acknowledgents We would like to thank A. Chakrabarti, K. Do Ba, and S. Muthukrishnan for their useful discussion and for kindly sharing the ost recent version of their paper [] with us. We thank Minho Sung for helping us with the datasets used in this paper. We would also like to acknowledge the useful feedback provided by the anonyous reviewers.. REFERENCES [] A. Chakrabarti, K. Do Ba, and S. Muthukrishnan. Estiating entropy and entropy nor on data streas. In Proceedings of the 23rd International Syposiu on Theoretical Aspects of Coputer Science (STACS), 26. [2] N. Alon, Y. Matias, and M. Szegedy. The space coplexity of approxiating the frequency oents. In Proceedings of ACM Syposiu on Theory of Coputing (STOC), 996. [3] P. Barford, J. Kline, D. Plonka, and A. Ron. A Signal Analysis of Network Traffic Anoalies. In Proceedings of ACM SIGCOMM Internet Measureent Workshop (IMW), 22. [4] J. D. Brutlag. Aberrant behavior detection in tie series for network onitoring. In Proceedings of USENIX Large Installation Syste Adinistration Conference (LISA), 2. [5] N. Duffield, C. Lund, and M. Thorup. Estiating flow distributions fro sapled flow statistics. In Proceedings of ACM SIGCOMM, 23. [6] C. Estan and G. Varghese. New directions in traffic easureent and accounting. In Proceedings of ACM SIGCOMM, 22. [7] L. Feinstein, D. Schnackenberg, R. Balupari, and D. Kindred. Statistical approaches to DDoS attack detection and response. In Proceedings of the DARPA Inforation Survivability Conference and Exposition, 23. [8] S. Guha, A. McGregor, and S. Venkatasubraanian. Streaing and sublinear approxiation of entropy and inforation distances. In Proceedings of ACM Syposiu on Discrete Algoriths (SODA), 26. [9] F. Hao, M. Kodiala, and T. V. Lakshan. ACCEL-RATE: a faster echanis for eory efficient per-flow traffic estiation. In Proceedings of ACM SIGMETRICS, 24. [] N. Hohn and D. Veitch. Inverting sapled traffic. In Proceedings of ACM/USENIX Internet Measureent Conference (IMC), 23. [] B. Kalyanasundara and G. Schnitger. The probabilistic counication coplexity of set intersection. SIAM Journal on Discrete Matheatics, 5(4): , 992. [2] V. Karacheti, D. Geiger, Z. Kede, and S. Muthukrishnan. Detecting alicious network traffic using inverse distributions of packet contents. In Proceedings of ACM SIGCOMM Workshop on Mining Network Data (MineNet), 25. [3] A. Kuar, M. Sung, J. Xu, and J. Wang. Data streaing algoriths for efficient and accurate estiation of flow distribution. In Proceedings of ACM SIGMETRICS/IFIP WG 7.3 Perforance, 24. [4] A. Kuar, M. Sung, J. Xu, and E. Zegura. A data streaing algorith for estiating subpopulation flow size distribution. In Proceedings of ACM SIGMETRICS, 25. [5] E. Kushilevitz and N. Nisan. Counication coplexity. Cabridge University Press, New York, NY, USA, 997. [6] A. Lakhina, M. Crovella, and C. Diot. Diagnosing network-wide traffic anoalies. In Proceedings of ACM SIGCOMM, 24. [7] A. Lakhina, M. Crovella, and C. Diot. Mining anoalies using traffic feature distributions. In Proceedings of ACM SIGCOMM, 25. [8] K. Levchenko, R. Paturi, and G. Varghese. On the difficulty of scalably detecting network attacks. In Proceedings of ACM Conference on Coputer and Counications Security (CCS), 24. [9] S. Muthukrishnan. Data streas: algoriths and applications. [2] Cisco Netflow /Tech/np/netflow/index.shtl. [2] M. Roughan, A. Greenberg, C. Kalanek, M. Rusewicz, J. Yates, and Y. Zhang. Experience in easuring internet backbone traffic variability: Models, etrics, easureents and eaning. In Proceedings of International Teletraffic Congress (ITC), 23. [22] S. Venkataraan, D. Song, P. B. Gibbons, and A. Blu. New Streaing Algoriths for Fast Detection of Superspreaders. In Proceedings of Network and Distributed Syste Security Syposiu (NDSS), 25. [23] A. Wagner and B. Plattner. Entropy Based Wor and Anoaly Detection in Fast IP Networks. In Proceedings of IEEE International Workshop on Enabling Technologies, Infrastructures for Collaborative Enterprises, 25. [24] K. Xu, Z.-L. Zhang, and S. Bhattacharya. Profiling internet backbone traffic: Behavior odels and applications. In Proceedings of ACM SIGCOMM, 25. [25] Y. Zhang, Z. Ge, M. Roughan, and A. Greenberg. Network anoography. In Proceedings of ACM/USENIX Internet Measureent Conference (IMC), 25. [26] Y. Zhang, S. Singh, S. Sen, N. Duffield, and C. Lund. Online identification of hierarchical heavy hitters: algoriths, evaluations, and applications. In Proceedings of ACM/USENIX Internet Measureent Conference (IMC), 24.