Major topics.

Technical Info:


 

MyRBL / Smart Filter -- New Spam Handling Process

 

How it works - Summary

Each incoming message is analyzed for as many 'features' as can be found that have significance, then the pre generated feature processor rule file combinese those features to figure out the probability that the message is SPAM or not. If the message passes a minimum level then it is placed in the users 'spam' folder and the sender is notified by a url (with CAPTCHA) this allows a real sender to bypass the 'spam' folder and ensure delivery.

The user can also access their spam folder via IMAP or the SurgeWeb email interface if they suspect a message may have been miss classified

This system has several key benefits:

  1. No messages are blocked or dropped, failed deliveries due to filtering mistakes are enormously annoying, and no filter is ever accurate enough to allow this type of heavy handed mechanism.
  2. The user does not 'see' the filtered messages unless they wish to.
  3. A real sender can easily get their message delivered directly to the users inbox by proving they are a human.
  4. By identifying many different 'features' using different modules/methods the results are more robust than any single mechanism for identifying spam.
  5. System admins can add their own 'features' and the system will correctly analyze the 'value' of those features so the systems reliability will not be degraded by 'one' badly chosen rule/score...

Filter Module Summaries:


Some more technical details

 

How sf_mfilter --> feature_gen.dat (note aspam_mfilter.txt is replaced with sf_mfilter.txt in this version)

The rule file sf_mfilter.txt now produces a list of significant features for any email message, the features are then analyzed using the rules in feature_gen.dat to come up with a 'score'. So the scores are not 'hard coded' into sf_mfilter.txt

The file feature_gen.dat is created by analyzing sample messages from your own server, so lets say we have a feature "blob" which on your server correlates 98% with spam, and on my server correlates 20% with spam (so in other words an email on your server with the feature 'blob' is a spam email 98 times out of a hundred, and on my server 80 times out of a hundred its not spam. Then on your server the score in x-spamdetect header will be something like "plus 10" for an email with 'blob' and on my server it will be 'minus 4'...

The feature 'blob' might relate to something like the length of the 'to' header, or weather or not the spf tests passed etc...

Then, in addition to simple rules the automatic process generates combinational rules based on your sample messages, so it might notice that a message which is from yahoo and has a long "To" header, is always spam. These 'combined' rules are also used to further increase accuracy.

 

Built-in RBL / Reputation system

SurgeMail now includes it's own RBL system (Realtime Blocking List) and Reputation system. This is a two level database, a local database based on each server, and a reporting system and DNS based query system to merge data between all SurgeMail servers in the world.

This system classifies all ip addresses into one of 5 colors

Unknown = 98% spam

Blue = less than 10 days old, nothing significant known (typically 70% spam)
Brown = 95% spam
Orange = 40% spam
Yellow = 20% spam
Black = 99% spam
White = Less than 4% spam

As most 'real' email comes from servers you talk to all the time this system quickly identifies the trusted mail servers that never send spam so that messages from those server will be very unlikely to be accidentally classified as spam.

 

The advantages this has over traditional RBL services (which should also be used of course)

It is free to use

It makes use of the many users clicking 'spam/not spam' on messages as they read them to help identify spam more accurately.

It provides both positive and negative indications for the spam filter, this is much more valuable than purely negative responses as given by many rbl services because the bulk of spam comes from 'unknown' transient ip addresses, so the significant information is the list of known mail servers that regularly send 'non' spam.

It is also a long term reputation rbl, so instead of automatically forgetting everything every 2 days like many rbl systems we try and store a long term record of stats for each ip address. This database can be searched here: http://reputation-email.com/reputation/index.htm


Management commands/settings etc

 

tellmail commands:

tellmail sf_train - Rebuild feature_gen.dat from sf_mfilter.txt using local data in 'train' subdirectories

tellmail sf_compare - Test feature_gen.dat on train sub directories.

tellmail friends_url - Show a sample URL for unblocking a message, use to test your web access/ports are set correctly.

New Optional settings:

g_myrbl_share "true" - Share IP reputation information with netwinsite.com (strongly recommended, this setting really helps contribute to the wide area rbl which all customers benefit from)
g_sf_generate "true" - Generate feature_gen.dat locally rather than using a standard generic one from NetWin. This is worth setting once you have a reasonable sample collected (surgemail automatically collects sample messages within a few days)

g_friends_lang_auto "true" - Guess the users language(s) by observing messages from each users friends, then add a tag if the user receives a message which is primarily in a language that the user does not have listed. The users language settings are prefixed with the word 'Auto,' when this setting is used so users who have manually set their language(s) will not get adjusted.

Brief outline of how a message is processed

    1. Get color from RBL/Myrbl/Surbl etc...
    2. Run sf_mfilter.txt to find 'features' of message
    3. Score message using feature_gen.dat and then bounce with url or give to user.
    4. Run mfilter.rul file
    5. If from friend accept
    6. If exceeds friends setting then bounce message and store in 'spam' folder.
    7. Deliver to inbox

Features to stop cracking local accounts and sending out spam

  1. g_breakin_enable "true" - used to stop a spammer sending from multiple (3+) ip addresses. (g_breakin_white can be used in rare problem cases, e.g. g_breakin_white "user1@domain.com,user2@domain2.com,*@domain3.com")
  2. g_user_send_warning - alert manager when user sends too many messages.
  3. g_user_send_max max="500" - limit users to a modest daily total
  4. g_safe_smtp "true" - stops a user logging into surgemail to send email without first logging into imap or pop, this will stop 'most' spammers in their tracks even after they hack into an account (but not all) It won't usually cause people problems but it might on rare occasions.

Explanation of the X-SpamDetect header

Here is an example header:

*******: 7.8 sd=7.8 [194]99%13.1(!9,46) [126]10%-7.2(!33,108) [38]87%5.4(X-myrbl:unknown)"

This shows a score of 7.8, then a list of the rules that were applied seperated by spaces. There are two sorts of rules, simple rules and combination rules.

A combination rule looks like this: [rulenumber]percent%score([!]a,[!]b)

rulenumber = this rule number as listed in feature_gen.dat

percent = The percent of messages that have spam if this rule is true

score = The score which is generated using the percent. Anything over 50% generates a positive score, below 50% generatese a negative score.

(a,b) = The two rule which were true that made this combined rule true, ! signs are used to indicate 'not'.

A simple rule looks like this: [38]87%5.4(X-myrbl:unknown)

rulenumber = this rule number as listed in feature_gen.dat

percent = The percent of messages that have spam if this rule is true

score = The score which is generated using the percent. Anything over 50% generates a positive score, below 50% generatese a negative score.

(a:b) = The header and value that were 'matched' that made this rule true, if no header is specified then it's a feature as defined in spf_mfilter.txt

The total sd=7.8 is not a simple sum of rules, but rather an 'average' of the rules that matched. Offset by '4' to the right, e.g. sum(scores)/n+4

 


How to enable the new system (for those upgrading)


How to disable the new system (if you must)

To disable the new Smart Filter mechanisms and return to the old behaviour!

Only use these settings if you really must :-)

g_myrbl_disable "true"
g_sf_disable "true"
g_friends_byemail "true"
g_spf_byemail "true"


How to TUNE it for your users/system (help I've got more spam or more bounces)


How to convert your local.rul file to sf_mfilter_local.txt

You can tailor your own rules still with this new system however, we suggest you consider the following, try using the builtin rules first and see how they perform.

 

If you are going to use 'local' rules then you must first enable local training:

g_sf_generate "true"

When adding rules (e.g. converting an existing local.rul file) you will need to change the actions to choose from the various possibilities

  1. Add a manual score - call feature_manual(0.8, "Manual addition")
  2. Add a self tuning score based on your spam sample - call feature_add(1.4,"featurename")

In the second case the score '1.4' is ignored. So the recommended method is to convert all call spamdetect(x,y) statements to call feature_add(x,y)

In the first case the value 0.8 is NOT the value added to the spam score, it is the probability that such a message is a spam message, so a value of 0.99 might add 12 to the spam score. A value above 0.5 will add a positive value to the spam score, a value below 0.5 will decrease the spam score. Examples:

You should only use 'manual' rules when the feature is so 'rare' that your sample data does not give useful figures on it, and in that case, the rule is probably of little or no value, so we suggest you don't use it at all :-) But there are exceptions where the sample spam messages will tend to give the wrong result (as the sample is not entirely random) so a manual rule might make sense.

 

A good example of when to use manual rules is when using an RBL service etc so you know that the probability of it being spam is very high you can improve results by using the manual rule with a nice high value like 0.9

 

Then run the following commands:

tellmail sf_train

tellmail sf_compare

The first will generate a feature_gen.dat rule file and the second will use it to compare results with the sample spam folders.

If you examine 'feature_gen.dat' after the sf_train command you will be able to see what surgemail thought the feature was and how significant it was (sig = the number of messages with the feature), A feature with a probability near 0.5, or one that occurs less than 20 times in the sample is probably of little point... Near 0.0 means the feature implies the message is not spam, near 1.0 implies the feature correlates with spam...

We are always interested in new features you make up that prove useful. Be cautious that some features can give misleading results due to the nature of the sample messages.

 

 

Tips for users to avoid spam:

Never put your email address on a web page, instead use a service like this one: http://www.emailmeform.com/

What if it doesn't work at all ?

If the scoring is completely blank or if you see this text in the headers:

X-SpamDetect: : 0.0 sd=0 feature_gen.net (or .dat) is blank or missing, update from netwinsite failed see netwinsite.com/surgemail/help/myrbl.htm for help

It might mean you are running a new build with the new spam handling mechanism, and most likely it's failed to pickup it's main rule file so it's not applying any rules at all.

It might fail if you don't have updates, or if you have a firewall blocking port 80 outgoing connections from your server. Once you fix the problem you can 'trigger' an update automatically by deleting aspam_update.done and restarting surgemail.

The two files you need are:

sf_mfilter.txt
feature_gen.net

They should automatically be fetched from netwinsite but that 'can' fail if your firewall is blocking port 80 connections. In which case you could download them manually then restart surgemail.

http://netwinsite.com/surgemail/sf_mfilter.txt
http://netwinsite.com/surgemail/feature_gen.net

Or You can disable the new system with this setting: g_sf_disable "true"