Weekly Report for week ending 6 May 2011




Further refined the spam classification for my existing dataset based on
the spam assassin logs. Building the state machine for the new data shows
every flow tagged as spam going through the same set of transitions
(corresponding to 550 errors and exiting), which makes sense seeing as
anything considered spam gets rejected. From that point on it is very
clear which flows are spam and which aren't, but the small amount of spam
left in the dataset isn't enough to differentiate any of the preceeding

Started looking at some of our recent ISP traces to build a larger dataset
with more spam flows. The data is more useful with some idea as to which
flows are spam and which are ham, so I've used the spamhaus block lists to
get an approximate classification. The data is current enough that the
block lists should be fairly relevant and accurate, and if this looks
promising I can capture new data or perhaps try to get access to mail
server logs. At the moment the state machine generation code is being run
over approximately 1400 SMTP flows (of which one third are spam) to see
how this differs from my old dataset.

Also made a few updates to the KAREN weathermap and spent some time on
documentation covering how to make similar updates.