Traffic classification-based spam filter

Size: px
Start display at page:

Download "Traffic classification-based spam filter"

Transcription

1 Traffic cassification-based spam fiter Ni Zhang 1,2, Yu Jiang 3, Binxing Fang 1, Xueqi Cheng 1, Li Guo 1 1 Software Division, Institute of Computing Technoogy, Chinese Academy of Sciences, , Beijing, China 2 Graduate Schoo of Chinese Academy of Sciences, , Beijing, China 3 Computing Center, Bureau of Statistics of Heiongjiang Provincia Government, Harbin , China Abstract We propose an unsupervised spam fiter caed Buk Mai Traffic Cassification (BMTC) for fitering junk mais from the perspective of ISPs. Our insight is that spammers generay sent mass unsoicited emais with few aterations to a common message content, which can be found at an extensive traffic environment. In our approach, we cassify emai deivery traffic into different categories by the simiarity of message contents. Then we can decide whether or not a particuar emai category is spam by the number of simiar mais of this category and take measures to fiter it. We aso design a simuator, two sketches data structure, and a series of agorithms to support our method. We have appied BMTC to emai traffic data captured at one of the argest commercia Internet service providers in China, and the experimenta resut indicates that a 70.4% reduction of emais can be achieved with our method. The resuts aso show that BMTC is practica. We can impement it in a highvoume traffic environment handing over miions of mais every day with sma memory consumption. 1 Introduction The rapid increase in the voume of unsoicited buk commercia emais, aso known as junk mais (or spam), has become a serious threat not ony to the Internet but aso to our society. A recent study by Brightmai 1 found that more than 65% of goba emai traffic is spam. Furthermore, AOL and MSN, two arge ISPs, report that they bock a tota of 2.4 biion junk mais from reaching their customers per day. This traffic corresponds to about 80% of daiy incoming emais at AOL [1]. A number of approaches have been proposed to aeviate the impact of spam, the majority of which is designed to identify spam after it has aready been deivered to the intended recipient s emai box. However, even if these techniques are successfu so that spam never reaches the target recipient s eyebas, there is nothing to prevent the junk mai senders from wasting a significant amount of bandwidth and causing deays to the deivery of good mais. Therefore, a new point of view is now appearing to protect the network resources from abuse by spam, not just to protect the end users. In this paper we are interested in the feasibiity and effectiveness of stopping or reducing spam traffic from the perspective of ISPs (Internet Service Providers). As a prerequisite to stop or reduce spam traffic at an ISP, this 1 paper proposes a nove technique to fiter junk mais from buk emai traffic in a high-voume traffic environment. The main contributions of our study are: (1) to cassify emai deivery traffic into different categories, making it possibe to hande buk e-mais. (2) The design and impementation of a simuator with two sketch data structures and a series of methods for detecting junk mais effectivey. Our approach utiizes the fact that junk mai senders generay sent mass unsoicited emais with few aterations to common message contents. Thus, we cassify emai traffic into different categories by the simiarity of message contents. If the number of simiar emais in any category exceeds the spam threshod which is a tunabe parameter, we wi mark this entry as spam one. If a new emai is cassified into a spam category, we can then take measures to fiter it. We appy our method to emai traffic data captured at one of the argest commercia ISPs in China, and the experimenta resut indicates that a 70.4% reduction of junk mai traffic can be achieved. The rest of the paper is organized as foows. Section II presents reated work in spam fitering fied and discusses their imitations in order to carify the motivation of this research. We describe key techniques of our Buk Mai Traffic Cassification (BMTC) in section 3. Section 4 proposes the design and impementation of our mechanism. Section 5 reports the experimenta resuts and compares BMTC with some other spam fitering methods. Finay, Section 6 concudes the paper. 2 Reated work Fooding of spam has become a headache probem to both Internet and society, and a number of methods have been proposed for fitering spam. In genera, spam fiter can be grouped into three types: access fitering, economic fitering, and content-based fitering. Access fitering, which merey verifies and authenticates the header information of an emai and without discosing users privacy, can be divided further into three categories: bocking, deaying and temporary faiure. Bocking that is generay accepted and empoyed now, means refusing to accept mais from a particuar sender. Unfortunatey, according to [2], bocking methods are ony partiay effective, mainy because it is easy for junk mai senders to conform to the heuristics and change their IP address frequenty. Deaying is a method for decreasing throughput of junk mai sending by injecting deay to the Simpe Mai Transfer

2 Protoco (SMTP) connection between junk mai senders and emai servers. Most mechanisms [3] assume that spam is sent at a high rate, and that sowing it down wi reduce the amount of spams received. However, these mechanisms are unikey to be effective, since the junk mai sender have aready sent at ow rate [2]. The third kind of access-fitering method is temporary faiure. Since SMTP is considered as an unreiabe transport protoco, the mechanism of tacking temporary faiures has been documented into the corresponding request for comments (RFC) fie. Any we-behaved emai servers shoud retry if encountering a temporary faiure for a deivery attempt. However, most spamming software today does not retry in order to obtain high throughput. This is the premise of Greyisting 2. A server running Greyisting keeps a ist of tripes consisting of the sender s IP address, the sender s emai address and the recipient s emai address. If the server has never seen a tripet before, it wi refuse this deivery and any others that may come within a certain period of time with a temporary faiure. Since spamming software do not retry, this woud reduce vasty the amount of spam accepted by mai server, whie a good mai is subject to some deay. The weakness of this approach is that if (and when) junk mai senders impement retries, the technique wi become ess effective [2]. Since the main attractiveness of spamming is that sending arge amounts of sma emai messages is reative cheap compared to other marketing techniques, the idea behind economic-fitering is to make sending high voumes of emai traffic to be more expensive. The two main categories of economic soutions are computing-time-based systems [4] and money-based systems [5]. The former forces the junk emai sender to spend considerabe computing resources to send a singe spam message, whie the atter charges a sma amount of money from every emai sent. Content-based fitering happens after a message is fuy received. In this case, fiter can be impemented by a variety of means, such as rue-based fitering, Naive Bayesian cassification [6], support vector machines (SVM) [7], memory-based approach, and checksums methods in coaborative circumstance [8]. Athough these fitering mechanisms can effectivey reduce the impact of spam on an individua user, they do nothing to protect network resources wasted by spam. In addition, a fiters stated above suffer from the probem of incorrecty cassifying emai and must be continuay maintained and updated, so ong as junk mai senders deveop new means to evade them. The work of Kenichi et a [9], which describes a densitybased spam detector, is much simiar to ours. They use document space density and design an unsupervised earning engine with a direct-mapped cache to identify spam. However, when we have impemented their agorithm, we found that athough it is an effective mechanism, it has bad performance in vector presentation and update mechanism in detecting buk emais. In contrast, we deveop an effective agorithm 2 that is based on the anaysis of emai traffic, more effective fingerprint technique, fast simiarity check, and fexibe update mechanism, and we design two sketch data structures to support our method and a simuator to test our agorithm. Our methods does not require any parsing, aggregation, or tokenizing of the input traffic, never bocks the good emais, nor decreases the deay to good mais. 3 The key techniques of BMTC There are many design choices in deveoping a spam fiter in a high-voume traffic environment. The primary design criteria and operating objectives of such anti-spam system incude: 1) Automatic hand-free depoyment and an onine update mechanism requiring itte or no human interaction. 2) Accuracy in detecting truy spam with very ow fase positive rate. 3) Efficiency in operation in a high-voume traffic environment with itte or no impact on networks throughput and atency. These objectives are difficut to meet concurrenty, yet they do suggest an approach that may baance these competing criteria for a spam fiter. A. Fingerprint technique Our goa is, for each input emai, to quicky decide whether it is simiar to some earier emais. To do this, one can often design an approximate data structure that maintains a sma sketch of the arge object rather than an exact representation. In our mechanism, we adapt a fingerprint technique deveoped by Manber [10] for finding simiar fies in a arge fie system and appied by Broder [11] to detect simiar Web documents. Fingerprints are integers generated by a one-way function appied to a set of bytes. A good fingerprint agorithm generates we-distributed fingerprints, which have the property that if two fingerprints are different the corresponding objects are certainy different, and there is ow probabiity that two different objects have the same fingerprint (The atter event is caed a coision.) We view each emai M as a sequence of bytes b 1 b b s. Contiguous bytes b i+1 b i+ contained in M is caed a window with a ength. One method of generating the representative fingerprints for an emai is to compute a Rabin fingerprint [12] for every ength window by the foowing expressions, where p and δ are constant integers. 1 F ( M, i, ) = ( bi p + bi+ 1 p... + bi+ 1 p + bi+ ) modδ (1) F ( M, i + 1, ) = [( F( M, i, ) bi p ) p + bi+ + 1]modδ (2) Therefore, we can compare the representative fingerprints of different emais to estimate their simiarity. Computing the Rabin fingerprints is fairy fast since p and δ are constant, and advancing the fingerprints ony requires a subtraction, a mutipication, an addition, as shown in (2), rather than generate a new one from scratch. We aso find that it is impractica to keep every computed fingerprint, which woud make us fa into memory crisis. We

3 can simpy seect first m fingerprints of an emai. To our surprise, this technique has worked very we in our experiments as we wi describe in foowing sections. In this case, we can associate every emai M to a set of fingerprints P m (M ), which contains m fingerprints. ( M) = { F( M, i, ), F( M, i + 1, )... F( M, i + m 1, )} (3) P m B. The simiarity of emais and the emai category If two sets of representative fingerprints share at east k eements, we say that the emais represented by these sets are simiar. sizeof ( Pm ( M1 ) I Pm ( M 2)) k M1 M (4) 2 If two emais are simiar, we say that they beong to the same mai category C. M1 M 2, M1 C M 2 C (5) We aso find that ordinary users sedom send more than 50 simiar emais, whie junk mai senders often send the same spam with a number far more than that vaue. So if the number of simiar emais in any category exceeds a spam threshod which is a tunabe parameter, ike 50, we can mark this category as spam one. C. Two sketch structures and our agorithm When handing over hundreds of emais per second, a miion previous emais may be checked in order to hande the current singe e-mai. So we need an efficient ookup mechanism and simiarity check method. Moreover, a hot category is more important than the one sedom visited. So we need a fexibe update mechanism to drop unimportant entries and eave important ones in order to make space for new entries. To sove these probems, we have deveoped a new type of unsupervised earning engine which uses two sketch data structures shown in Figure 1 to meet our needs: (1) The fingerprint database (FD) is a hash bucket which stores a distributed fingerprints. h i is determined by performing the mod θ (a constant) operation to fingerprint f i. h i = f i modθ (6) If two fingerprints have the same h i, we simpy add an entry and never overwritten the former. FD aso stores the pointer to the reated mai category entry in mai database. (2) The mai database (MD) stores a mai categories information, which incudes the numbers of simiar mai, the first mai ID and the ast mai ID of this category, and the pointers to a fingerprints beonged to this category. When our agorithm for detecting spam is running, two sketches, MD and FD, are used to store the most recent information. For every new coming emai, the agorithm first generates the representative set of fingerprints. Each fingerprint in this set is checked against the FD. If it is matched, then updates the reated mai category in MD, for exampe, increasing the number of simiar emais and updating the ast mai ID. Otherwise, we create a new category in MD and insert the set of fingerprints into FD. If the number of simiar emais in any categories exceeds a given spam threshod, we wi mark this categories as spam entry. Mai Database (MD) Fingerprint Database (FD) Pointer to No. of ID of ID of Pointer to Finger reated mai simiar first ast reated -print category Entry mai mai mai fingerprint f1 C1 f4 C2 h1 Category F1 F2... F1 F4 C1 f2 C h F3 F4... F2 C2 f3 C2 h3 F Fig. 1. Two sketch data structures D. Memory management for sketch data structures In mai database (MD), we shoud keep the most remarkabe mai categories and deete unimportant ones in order to make space for new entries. To do this, our agorithm first assigns a unique ID (an integer) for every input emai by the order of its arriva. According to the characteristics of traffic-based spam fitering, the spam category is much important than good category in MD because they wi be visited frequenty in a very ong period. Hence, a definition termed average distance of category is introduced and by which we can drop the outdate category. We wi use the foow notations: Let C i be the i th category in MD, I be the maximum mai ID, I s and I e be the first mai ID and ast mai ID of C i, sizeof(c i ) be the number of simiar mais of C i, and D(C i ) be the average distance of C i. Then: di = I I e (7) Ie Is D( Ci ) = if sizeof(c i )>1 sizeof ( Ci ) D ( C i ) = if sizeof(c i )=1 (8) Then the agorithm drops entries according to two rues as foowing. (1) For spam category and suspect category (which is described in Sec. 4): d i >10C i (2) For good category: d i > System configuration and impementation We describe the impementation of our mechanism from the aspects of trace data, its design, and seection of agorithm parameter. Since it is not aowabe to do experiments in rea networks, we have performed an offine simuation. The simuator must generate origina emai traffic as in rea. Furthermore, the simuator shoud be easy to use and provide patform of comparison with different fiters. A. Trace data TABLE 1 Trace data Trace Date Mais in trace Usage training testing

4 We coected two traces in June 2005, which is captured at one of the argest commercia ISPs in China, and kept in tcpdump format. The traces ony contain SMTP traffics which are composed of a cients to server directiona packets for this anaysis. We use trace 1 to tune agorithm parameters and use trace 2 to test our agorithm. The summary information of trace data is shown in Tabe I B. Design of simuator Figure 2 shows the system structure of our simuator we have designed. The structure is chosen to satisfy the above requirements and consists four parts. The first part is the traffic generator, which gets packets from trace data and cas packet sending function provided by ibnet ibrary to generating emai traffic. The second part is the wrapper, which is a patform that buids a bridge between the generator and the spam fiter. Since our trace incudes unanswered SYN packets, which are most ikey generated by port scanners or reseting SMTP connections, so the wrapper shoud choose vaid connections and then pre-process these SMTP traffic into emai text. The other work of wrapper is responsibe for receiving temporary resuts from fiter and keeping them in a og fie. The third part is the spam fiter, which cassifies emai texts provided by wrapper into different categories by ooking up FD and then updates MD and FD. The fiter aso sends the cassification resuts of new coming emai to the wrapper. The ast part is the statistica unit, which can access og fie and MD to gather information and output the statistics. Trace Fie Traffic Generator Mai Wrapper Log Mai Resut Text Update Mai database Spam Fiter Lookup Fingerprint database.fig. 2. The system structure of simuator Statistica Unit C. Agorithm parameters Our agorithm incudes various contro parameters, such as the ength of window (), the size of fingerprints set (m), and the spam threshod, which wi be determined in training phrase. The training data incudes 7583 good mais, distributed in 2352 categories and 7090 spam mais, distributed in 23 categories. The effects of atering these parameters are shown in Figure 3 to Figure 6. 1) Spam threshod and suspect threshod Threshod vaue is an indicator of spamminess, which can be used for marking a category as a good or spam one. If we keep the threshod vaue very ow, then the chances of fase positives increases. If we choose a arge vaue, we may miss some spams. The distribution of the number of simiar emais of good and spam category is shown in Figure 3 and Figure 4. From these figures we can see that 81.6% spam categories have more than 50 simiar emais, whie 98% good categories have ess than 50 simiar emais, except that some categories were composed of bounced mai messages sent by emai servers. Luckiy, most emai servers use a sender address of the nu sender <> or postmaster to operate bounced mai, which makes it fairy simpe to workaround, by integrating a simpe rue-based method to our agorithm. We aso find some categories whose number of simiar emais between 30 and 50 appear in both distributions, which wi decrease the performance of BMTC in traffic cassification. Therefore, we introduce another parameter, suspect threshod, to sove the probem. When facing category whose number of simiar emais between the given spam threshod and the given suspect threshod, we need to adopt another method to detect them, which is eft for future work. We aso use the tota number of emais in suspected categories as a criterion to evauate the performance in traffic cassification. We define the suspected ratio as foowed: N r = s s N (9) where N s is the tota number of suspected emais and N is the tota number of emais. For our test, a spam threshod vaue of 50 and a suspect threshod vaue of 30 are empoyed. 2) Parameter and m These two parameters are determined by the capabiity of detecting spam and memory consumption. Usuay a junk mai sender sends mutipe copies of an emai by making few aterations. If is too arge, ony arge regions are matched, which woud increase the average quaity of matches but decrease the number of potentia spams that are detected. If is too sma, the average quaity of matches is sacrificed, since the probabiity of two fingerprints coiding increases. There is aso a trade off with m in term of how we each emai is samped. Large vaues of m can increase the ikeihood of finding a match for a given emai but need more memory for storing more fingerprints and vice versa. Figure 5 shows the accuracy of detecting spam from trace 1 and Figure 6 shows memory consumption for different vaues and m. We can see that the midde vaue of m and the midde vaue of are most effective. We choose =70 and m=60 for our test. 5 Experimenta resuts This section summarizes some important statistics. The testing is done using an entirey different trace of emais. A. Resuts on spam through trace 2 We anayzed trace data through a simuator as described in Sec. IV in test phrase. A server with dua processor Pentium IV, 2.4GHz, 4Gbytes of memory, running inux9.0 is used as the patform of spam fiter in the experiment. In trace 2, 73% (143,563) of the mais are spams and they distributed ony in 115 spam categories, whie 36,039 good categories just incude 27% (48,971) good mais.

5 Cuumative percentage of categories(%) number of simiar mais Fig.3. The distribution of the number of simiar mais of good categories Cuumative percentage of categories(%) number of simiar mais Fig. 4. The distribution of the number of simiar mais of spam categories Memory consumption(kbytes) m=30 m=40 m=50 m=60 m=70 m= The vaue of Fig. 5. Accuracy of detecting spams as a function of at different vaues of m Accuracy of detecting spams(%) m=30 m=40 m=50 m=60 m=70 m=80 number of mais cassified to spam categories cassified to good categories cassified to suspect categories Memory consumption (Mbytes) Memory consumption of Fingerprint DB Memory consumption of Mai DB Tota memory consumption The vaue of Fig. 6. Memory consumption as a function of at different vaues of m number of mais processed by BMTC(x10000) Fig.7. Number of emais cassified to three categories number of mais processed by BMTC(x10000) Fig.8. Memory consumption These resuts show that the simiarity is a good index to distinguish spam from good mais. Once BMTC deas with ten thousand mais, we observe the number of mais cassified into three categories, good, spam, and suspected. As is shown in Figure 7, emais cassified into suspected categories decrease remarkaby with the number of mais processed increasing. When processing more than 150,000 emais, tota suspected emais increase sowy and neary cose to a constant (3193 emais). In our test, r s =0.02. That is, amost 98% emais can be cassified by BMTC. Any category in which the number of simiar mais is more than 50 is marked as spam one and next mais cassified into this category are then viewed as spam. As a resut, our method cannot identify the first fifties emais in spam category in rea time. The tota number of identified junk mais is , which is 70.4% of the tota number of emais. This means that a 70.4% reduction of emai traffic can be achieved after a short onine earning time. The category with ess than 30 simiar mais is a good category and the one with number of mais between 30 and 50 is considered as a suspected category that cannot be cassified by BMTC. As is shown in Figure 7, the suspected ratio (as described in formua 9) decrease greaty, which means BMTC has better performance in traffic cassification. As a resut, we confirm that BMTC can do much better in a high-voume traffic environment. Figure 8 shows the memory consumption of our method. The non-optimized agorithm just consumes 21MByte within a reasonabe CPU time when processing 195,727 emais, which encourages us greaty. B. Performance Comparison and Discussion As is reported above, our method can identify 115 spam categories and junk mais (=143, *50). None of the known content-based spam fiter seems to be abe to hande buk mais by imited memory and reasonabe CPU time whie without human maintenance. As far as we know, the work of Yoshida (DMC) et a [9] which describes a density-based spam detector that is simiar to our work. Tabe II shows some comparisons between the two agorithms. It shows aso the performance of DMC on our emai data. The traffic of our trace 2 with first 60,000 emais is used for this experiment. In comparison, both agorithms have the same onine earning threshod vaue (50), which means that they cannot identify junk mais unti they find 50 simiar emais. In [9], the impementation empoyed a hash function provided in inux C ibrary to represent vectors, a directmapped cache for simiarity checks, and an overwrite

6 mechanism to contro entries in the hash database. When a the hash vaues in the cache are overwritten by atter emais, DMC deetes the entry in the hash database. Therefore, it is possibe to deete hot categories. As shown in Tabe 2, DMC deeted 65 categories for hash vaue overapped, and detected 89 spam categories and 25,432 spams. 7 categories are bounced mai, which are wrongy identified as junk mais by DMC. TABLE 2 Comparison with DMC DMC BMTC Spam threshod (# mais) Suspect threshod (# mais) m Number of spam category bounce mai category 0 7 Number of spam Accuracy 96% 100% Reca rate 58% 74% Overwritten spam category 65 0 Tota CPU time (second) With a good fingerprint agorithm and two sketch data structures, our BMTC never overwrites od fingerprints, nor drops hot categories. Therefore, BMTC has a desirabe resut, identifying 82 spam categories and spams within a reasonabe amount of CPU time, by ony a few memory consumption. We have checked the sampes from a 82 spam categories and find that they are a junk emais. We check other ( ) mais, which shows that there are sti junk mais missed by both methods. The reason is that BMTC and DMC cannot identify spam categories which incude simiar mais ess than the spam threshod. As shown in TABLE 2, BMTC achieves 74% reca rate which is much better than that of DMC. Our method aso adopts a simpe rue-based poicy to detect bounced mai message sent by emai servers (as described in section IV), which can effectivey distinguish between rea spam categories and non-spam categories whose capacity over 50. As a resut, our method marks 82 spam categories and achieves a better accuracy and reca. An apparent advantage of DMC is CPU time required. DMC coud hande 2,727 (=60000/22) emais per seconds, whie the required CPU time of BMTC is five times more than that of DMC since never overwrite od fingerprints. However, BMTC can achieve a better resut at the cost of processing speed. The test shows that both poices are practica and reasonabe. In fact, the capabiity of handing 512(=60000/117) emais per seconds can aso do we in ive environment. In addition, an optimized BMTC version may be focused on this issue. When using supervised earning methods ike Naïve Bayes and SVM, such fiters require maintenance tasks. On the contrary, except a short onine earning time, BMTC needs no supervisor for earning or decoding message content. This impies that no one is required to trace message content manuay and user s privacy is inherenty protected. 6 Concusion and Future Work In this paper, we have presented a new technique BMTC for detecting spam from buk mai traffic. Our technique cassifies mai deivery traffic into different categories by the simiarity of message contents. If the number of simiar mais in any category exceeds a spam threshod, we wi mark this category as spam one. We aso design two sketch data structures and a series of methods to support our method. BMTC has three distinct advantages. 1) Automatic hand-free depoyment and an onine update mechanism. 2) Identifying spam from buk emai traffic with a high accuracy. 3) Handing over arge amounts of mais by sma memory consumption within a reasonabe CPU time. In addition, a distinguishing feature of our method is that it not ony protects the end-users from excessive voumes of unsoicited mais, but aso it can cut off spam, and thus effectivey utiizes the network bandwidth and reduces the deay to good mais. The experimenta resuts indicate that the BMTC is effective and practica, and a reduction of 70.4% junk mais may be achieved by our method. We can sketch an impementation in a high-voume traffic environment that requires no modification to the existed codes. As the work is in progress and the resuts described in this paper are preiminary, for future work we wi evauate our method according to its sensitivities of parameters and the dependencies of the resuts on the used data set. Further, we pan to buid an onine system to fiter spam and evauate our method in ive environments. In addition, evauation of miscassification, e.g., fase aarm rate, and the countermeasures to spam attack aso remains for future work. REFERENCES [1] L. Gomes, C. Cazita, J. Ameida. Characterizing a Spam Traffic, Proc. of the IMC 04, Oct , 2004, Taormina, Siciy, Itay. pp, [2] D. Twining, M. Wiiamson, J. Miranda, et a. Emai Prioritization: reducing deays on egitimate mai caused by junk mai, Proc. of the 2004 USENIX Annua Technica Conference, pp [3] M. Wiiamson. Design, impementation and test of an emai virus throtte, Proc. of the 2003 ACSAC Security Conference, Las Vegas, Nevada, December [4] G. Joshua, R. Robert. Stopping Outgoing Spam, Proc. of the EC'04, May 17-20, 2004, New York, USA. pp [5] S. Hird. Technica Soution for Controing Spam, Proc. of AUUG 2002, Mebourne Sept [6] M. Sahami, S. Dumais, D. Heckerman. "A Bayesian approach to fitering junk e-mai", Proc. of AAAI Workshop on Learning for Text Categorization, 1998, pp [7] DCC( [8] K. Yoshida, F. Adachi. Density-based spam detecton, Proc. of the KDD2004, Aug ,Seatte,Washington, USA, pp [9] U.Manber. Finding simiar fies in a arge fie system. Proc. of USENIX Winter 1994 Technica Conference, Jan [10] A. Broder. On the resembance and containment of document. Proc. of Compression and Compexity of Sequence (SEQUENCE 97). pp , Mar [11] N. T. Spring and D. Wethera, "A protoco-independent technique for eiminating redundant network traffic," Proc. of ACM SIGOCOMM 2000, Aug