The Open Source Stack: One approach to spam filtering Chris St. Pierre Unix Systems Administrator Nebraska Wesleyan University
Breaks Administrivia
Administrivia Can turn your cell phone off.
Terminology Spam isn't an abbreviation or acronym. UCE (Unsolicited Commercial Email) and UBE (...Bulk...) Spam is more than spam: phishing, 419 scams, lottery scams, pump and dump, viruses, etc. Things to avoid: False positives (FPs): legit email marked spam False negatives (FNs): Spam marked legit
Goals Make your users happy Users with control are happier than users without control An FP is always worse than an FN
The Stack Approach There's no magic bullet that will kill all spam Zeno's Paradox Every tool we use will get rid of a little more spam Cost-benefit analysis
? Other Approaches Pay someone a lot of money Pure Whitelisting C & R Pray
Disclaimer This is just one approach to spam filtering. There are many other approaches that may be just as effective. Your anti-spam solution must be tailored to fit your environment, not mine. If something I recommend doesn't work for you, ditch it!
1.Honeypots 2.RBLs 3.Greylisting The Stack 4.HELO (and other) restrictions 5.Tarpitting 6.ClamAV 7.SpamAssassin 8.End-user tools 9.Statistics
Order is important If you can discard or reject messages before accepting them, this saves you valuable resources Never accept a message you don't have to
Basics NEVER bounce spam or viruses Don't be a jerk and cause backscatter! Reject with a 5xx error code Discarding is also bad, but sometimes we do it anyway NEVER forward to off-site addresses before filtering You will get blacklisted for spamming
1. Honeypots Create a fake address and publicize it; ban anyone who sends to it Remarkably ineffective Better approach: honeypot MX
Aside: secondary MXes Just Say No.
2. RBLs Realtime Black List (or DNSBL: DNS Black List) Someone else has done all the work for you. Yay! Run a caching nameserver When blocking based on RBL, you must avoid FPs http://www.usenix.org/publications/login/2006-12/pdfs/josephsen.pdf The big question: what RBLs to use?
Live RBL Revue! Only a few are worth considering: zen.spamhaus.org is excellent. Includes SBL, XBL, and PBL. Costs some cash for nonpersonal use; cbl.abuseat.org is free, and is one of their sources SpamCop got a bad reputation early on, but they're doing a great job now (bl.spamcop.net) The Passive Spam Block List (psbl.surriel.com) works much better than you might suspect Nothing else I've found or heard of is worth using
3. Greylisting Overview: Greylisting identifies each message with a unique triplet: sender, recipient, originating server. The first time it sees a given triplet, it gives a 4xx (tempfail) code Legitimate servers will retry, at which point the triplet will be recognized and accepted Spammers don't waste resources on retries Can block a lot of spam
3. Greylisting, continued Greylist on the /24 netblock of the originating server Retry time doesn't matter, because spammers don't retry. (5 minutes is sort of the standard.) Auto-whitelist and auto-blacklist
3. Greylisting, continued Find a greylisting server with a sizable preconfigured whitelist If you have >1 MX, look for a greylisting server that supports a shared database Policyd is wonderful, but is Postfix-only SQLGrey is quite nice and works with both Postfix and Exim RelayDelay is the closest I've found for Sendmail
4. HELO (and other) restrictions Lots of fun stuff! Site-specific whitelists/blacklists Reject non-fqdn HELOs and HELOs with bad syntax Reject mail to unknown recipients! Reject HELOs that resolve to bogons http://www.cymru.com/documents/bogon-bnagg.txt
4. HELO restrictions, continued HELO Randomization Protection (HRP) Reject mail when the HELO name has no MX or A record? Well-configured HELO restrictions can drop about 25% of your spam
5 (or 0). Tarpitting Make a connection very slow (or just pause) Spammers are impatient Claims of 80% block rates Two ways to implement: Pre-MTA wrapper Within the MTA (e.g., milter) Most connections are dropped after about a minute
5 (or 0). Tarpitting, continued Two years ago, this presentation had this line: Tarpitting is fairly new, so software is rare as of this writing Tarpitting never really caught on, so it's still fairly rare. Implementations: GreetPause (sendmail) OpenBSD SpamD Several commercial products
Changeup! Up to here, we've been talking about discarding messages After this, we'll assume you've already accepted the message This is filtering, and it's expensive
Aside: What about filtering integrators? Amavis, MailScanner, etc. Generally, not worth it Not a lot of supplementary functionality of consequence but that's changing They remove you one step from your component configuration, and whether or not they make the integration any easier is up for debate
Aside: What about filtering integrators? Cost: additional complexity of setup and maintenance; one more thing to break Benefit: Some (often minor) features Conclusion: Getting more useful every year
6. ClamAV ClamSMTPD is a great integrator Not just antivirus; anti-phishing par excellence In addition to the standard rules, use http://www.sanesecurity.com/clamav Exclude the SpamDomain rulesets Keep it updated and ClamAV will Just Work Drop viruses on the floor
7. SpamAssassin This could be a class of its own. We'll cover: a)basics b)bayes c)checksumming systems (Razor2, DCC, Pyzor) d)uribl e)sare rulesets f) Plugins g)miscellaneous score adjustments h)alternatives?
a. Basics SpamAssassin does not filter spam SA scores mail with a bunch of tests. Each test can add or subtract a few points to the score. If the mail has over a certain number of points, it gets marked as spam not filtered. The default required_hits value is 5, which tends to work well Keep your rules up to date! SA 3.1+ includes sa-update
b. Bayes You can keep your Bayesian database in either flat files, or in a real DB Use a real database if you have >1 MX Let your users report FPs and FNs, and train Bayes on it Use bayes_auto_learn to ensure a constant feed
b. Bayes, continued Train train train! DO NOT train Bayes on public corpora DO NOT train Bayes on your outgoing mail The SA Bayes engine isn't the greatest One solution (?): crm114 plugin http://mschuette.name/wp/crm114-spamassassin-plugin/
c. Checksumming systems Razor2, DCC, Pyzor They're all free now Razor2 rawks hard DCC gives lots of FPs, because it just measures bulkiness, not spamminess Both Razor2 and Pyzor have very low FP rates
d. URIBL Checks the URLs in an email against a blacklist This is wonderful Crank these scores If none of your top ten rules are URIBL_*, something is wrong
e. Third-Party Rulesets Additional rules that block lots of stock scams, image spam, etc. SpamAssassin Rule Emporium (SARE) Howto: http://daryl.dostech.ca/sa-update/sare/sare-sa-update-howto.txt http://www.rulesemporium.com/rules.htm Most rulesets have 2-4 options, increasing in aggressiveness KAM http://www.peregrinehw.com/downloads/spamassassin/contrib/kam.cf
e. Third-Party Rulesets, cont'd Extra rules from SpamAssassin http://wiki.apache.org/spamassassin/customrulesets See especially the Sought ruleset Sets for other languages
f. Plugins There are lots out there, but four major ones you need to know: Botnet: tries to identify mail from botnets Lots of FPs, not a lot of real positives http://people.ucsc.edu/~jrudd/spamassassin/ PDFInfo: ImageInfo for PDF attachment spam http://www.rulesemporium.com/plugins.htm
f. Plugins, continued ImageInfo: looks for broken or suspicious image attachments Together with the SARE rules, is very good at stopping image spam Doesn't use OCR or other processorintensive tests Consider it a necessity Included in SA 3.2+ http://www.rulesemporium.com/plugins.htm
f. Plugins, continued Custom plugins are beyond the scope of this tutorial Try to write rules instead of plugins Check out http://wiki.apache.org/spamassassin/dumptextplugin for a good sample plugin and a nice place to start
g. Miscellaneous score adjustments Tweak and frob scores to suit your environment Track: Which rules are hitting frequently and what they're hitting on (ham or spam) Which rules give you frequent FPs and FNs
g. Miscellaneous score adjustments Many rules are disabled (score = 0). Enable all tests initially to see if any of the disabled rules hit reliably: egrep 'score.*\s0$' \ /usr/share/spamassassin/50_scores.cf \ awk'{print $1, $2, "0.1"}' > all-rules.cf
h. Alternatives? Dspam, Bogofilter, others Dspam and Bogofilter violate the stack model; they only use Bayes SA uses Bayes, plus other plugins and rulebased tests
8. End-user tools Clients must, at a minimum, be able to report FPs and FNs Learn (with Bayes) and automatically white blacklist per-user based on what they report Let your clients configure their own filtering levels Forget quarantining Policies
8. End-user tools Let your clients configure their own whitelists and blacklists Ideally, whitelisting a sender should get them past RBLs, tarpitting, greylisting, etc., for the recipient(s) who whitelisted them Really really difficult Also ideally, generate whitelists from address books Whitelisting can be dangerous, since it relies on addresses, not Received: headers
9. Statistics You need statistics for four reasons: 1.Everyone likes pretty pictures 2.Track the effectiveness of your filters 3.Plan for and justify growth 4.Spot anomalies
9. What kind of statistics? Both graphs/charts and hard numbers General mail statistics are a prerequisite What is your ratio of ham to spam? How much spam are you delivering to mailboxes? How many viruses are you getting? How much is filtered out by tarpitting/greylisting/rbls/etc.?
9. What kind of statistics? What are your spam scores? (Min/max/avg) Are there arny trends? How long does it take to scan a message? What is your average time-to-delivery? What SA rules are hitting the most? (On ham? On spam?) Which are the best or most reliable rules? What viruses is ClamAV finding?