Multi-Protocol Content Filtering

Multi-Protocol Content Filtering Matthew Johnson <mwj99@doc.ic.ac.uk> MEng Individual Project 1 Title hello, etc. 1-1

Why filter content? Information overload Specific personal interests General signal-to-noise ratio...affected by unwanted content, usually commercial or advertisement-based... 2 Information overload too much content, or too many content items to handle, but nothing we specifically don t want to know about, message digests of mailing lists, kerneltraffic and friends Specific personal interest lots of content, but a lot of it is about stuff we are not interested in knowing about, yet, that other content may be interesting to other people, e.g. on a developers mailing list, we may be interested in bugfixes and problems with a Linux version, but couldn t care less about VMS. General signal-to-noise ratio SNR: ratio of content in which the userbase is interested compared to that in which the userbase is disinterested. 2-1

Why is spam such a problem? Not just email Usenet / Netnews also suffers. Email: significant increase, 48% in last 12 months 3 Half of all emails monitored by MessageLabs in May 2003 were spam, June seems to have reduced so far but we are not at the end of the month yet. 3-1

Email filtration options Killfiles / Blacklists - simplistic header-based filter Spammers regularly spoof headers not much help. Precise hash matches (e.g. Vipul s Razor) Spammers regularly insert hashbusters into their content. But collaborative filtering not without merit... Regexp-based content matching and server blacklisting (e.g. SpamAssassin) Very effective, but suffers due to static heuristic rules. 4 Killfiles still used because users understand them REALLY well, despite their lack of effectiveness. Still useful for deliberately blocking posts from contributors who rub you up the wrong way, but useless for spam. Concept of matching content: right direction but not foolproof. Spamware agents the ability to insert hashbusters which are deliberately designed to throw off trivial hash-collision detection methods. Note the benefits of collab filtering though; when it works, it s good. Discuss SA rules body and header matching, e.g. Nigerian spam, mail-client spoofing. Effectiveness is excellent but errors remain possible. Quite a lot of confirmation emails (e.g. Easyjet, Ryanair) get misclassified because they match the heuristics. Equally, if spam comes along which doesn t match the static rules, it s not detected. 4-1

The dynamic solution Static rules can make an educated guess as to what the user thinks may be spam......but the only way to find out precisely is to have the user tell us. The user s wishes are unlikely to be codifiable as a set of static rules we must find a different way. 5 Project Objectives Implementation of a content filter for mail and news, controlled and influenced by the individual user. Content filtration by statistical classification and distribution of content hashes Investigation of statistical classification as applied to news 6

System Architecture Incoming Mail Incoming News Mail Handler News Handler Mgmt Clients Spam Handler Collab Handler Content Handler Management Interface Core Bayesian Classifier, Collaborative Filter Filtered Mail Filtered News Collab Messages Incoming Mail 7 Statistical filtering Analyze a set of examples which the user tells us are either spam or non-spam. Calculate the prior probability of each word in the examples based on how often they appear in spam content. e.g. Click appears in 939 out of 2,355 spam examples and 113 out of 4,787 non-spam content. p spam = 939 2355 113 4787 + 939 2355 = 0.9441 8

The Naïve Bayesian Classifier To test a content item, search for the probability of every word in the new content in the table we created. Find the most extreme n probabilities (those closest to 0 or 1) Use the word probabilities as likelihood indicators for the new content being spam. n k=1 P spam = p k n k=1 p k + n k=1 (1 p k) 9 Collaborative filtration Users generally in some form of community The same spam content may reach more than one member of the community Time delay in mail handling works to our advantage Can we share knowledge within communities to reduce the amount of spam a user sees? 10

Better content matching Current hash-detection systems fail too readily Need function such that: If content a and b are substantively similar...... values α and β are arithmetically similar. A fuzzy hash hash where two hashes are quantitatively comparable. 11 Using fuzzy hashing in collaboration Alice receives an email, which is detected as spam. Alice s mail filter hashes the content, notes the hash, and sends it on to any interested collaborators. Bob s mail filter receives a collaborative message regarding the new spam. It notes the hash. Bob then receives an email. The email is hashed, and compared with those it knows about. Bob s mail filter discovers the new mail is a 98% match with the spam Alice told us about. Bob has set his hash match threshold to 70%, so the mail is detected as spam. 12

Implementation Challenges Homogenization of content from various protocols abstract message format PGP integration for trustworthy collaboration News protocol implementation 13 Results Like-for-like testing: My filter: 75% accuracy with no false positives SpamAssassin: 90% accuracy with no false positives Hard to test collaborative filtering Reasonable performance but not really comparable with the bleeding edge 14

Demonstration 15 Further Work Optimization of configuration variables Token thresholds, number of tokens used in testing. Optimization of fuzzy hash matching algorithm Slow due to attempted rolling window matches Addition of other protocols Web-based bulletin boards? User interface extensions Provide a usable mail/news client SpamAssassin for news, meta-filtration Infrastructure could apply SpamAssassin to news, refactor to allow multiple content testing methods. 16

Summary A content filter which functions acceptably Bayesian filtering and fuzzy hash matching are useful Sole use of these technologies may not be sufficient Combining filters likely to be the best solution 17 Any further questions? 18