Automatically determining interesting RSS feed posts

By | March 15, 2009

One of the interesting applications of automatic categorization of message items is the categorization of feed postings. Feed aggregations like Planet Mozilla often have many more posts than is convenient for most people to keep up with. How do you decide what to read, and what to skip?

The bayesian classifier that is part of the Thunderbird and Seamonkey distributions has been generalized by me over the last few months to allow it to be used for such categorizations, rather than be limited to spam recognition as originally implemented.  I can demonstrate its use with my TaQuilla extension, which allows automatic tagging of message items using the bayesian classifier. The release available on AMO works with Thunderbird 3 beta 2, but only supports email messages, not RSS feeds. I am extending the mailnews backend in bug 471833 to also apply bayesian filters to RSS messages. My private patch now does that, so I thought I would demonstrate how it works.

I’ll use my Planet Mozilla feed items from Thunderbird, and apply the tag “Interesting” to posts I want to read. First, I need to do a little training to show the bayes classifier what is Interesting and Not Interesting. So I picked posts from two people and flagged them as Interesting, and two others as Not Interesting. (Note this is a demonstration only, no disrespect is meant to the poor saps whose posts I’m calling Not Interesting here!) For Interesting, I chose posts from Mitchell Baker, and the Rumbling Edge posts from Gary Kwong. For Not Interesting, I chose the Mozilla Community meeting note posts from bsmedberg, and the European Community marketing efforts posts from William Quiviger. I picked two from each author to train, for a total of eight trained posts. After training, I reran the bayes filter on the folder to see how effective it was in categorizing posts similar to those that I had trained.

So, how did we do with this minimal amount of training?

The bayes filter determines a score of 0 – 100 for each post, with 100 for posts that most closely match the category, and 0 for those that most poorly match.

For the Rumbling Edge posts, I had 7 total in my sample, and used 2 for training. For the remaining 5 untrained posts, 4 had a score of 100 showing a strong match, while 1 had a score of 84. Looking at that post, it was quite different from the others, which were mostly lists of fixed bugs.

For the Mitchell Baker posts, of the 5 untrained posts, 4 had scores of 90 -100, and one had a score of 49, which is indeterminate. So close, but not perfect.

The quality of the match was similar for the “Not Interesting” posts, though of course here our goal is a score of 0 instead of a score of 100. For William Quiviger’s posts, the 7 untrained posts had scores of 0 or 1. For bsmedberg’s Meeting Notes posts, all but 1 of 19 untrained posts had scores of 0, and one had a score of 9.

So overall, this is pretty good categorization with minimal training. In the future, all new posts that arrive will be automatically categorized, with posts having a score of >50 tagged as Interesting, and < 50 untagged. If I am unhappy with the tagging, I can train additional messages by simply applying or removing the tag from the incorrectly categorized post (which is a single keystroke in Thunderbird.) To display posts that I want to read, I can setup a saved search folder that shows only Unread Interesting posts, and sort that by the percent match.

When I do this, and apply it to all 578 posts that I currently have in my Planet Mozilla folder, here’s what I get for the top 17 posts:

So I see all but 1 of the Rumbling Edge posts, and all but 1 of the Mitchell Baker posts.

What about the other Interesting posts? The Burning Edge and Seamonkey posts are very similar in intent to the Rumbling Edge posts, so it is understandable that they are also rated as interesting. Daniel Glazman’s post mentions “messages”, “Thunderbird”, and “mail” so it is also understandable. The Aza Raskin hit seems more random, the matched tokens are common words like “how” and “they” which would get averaged away with more training.

So overall, I think this will be a useful tool to add to Thunderbird to assist users with sorting of massive planet-style feed aggregations. I expect that the features demonstrated here will be available beginning with Thunderbird 3 beta 3 (in late April) when the TaQuilla extension is loaded.