I have recently been inspired to write spam filters in JavaScript, Greasemonkey-style, for several websites I use that are prone to spam (especially in comments). When considering my options about how to go about this, I realize I have several options, each with pros/cons. My goal for this question is to expand on this list I have created, and hopefully determine the best way of client-side spam filtering with JavaScript.
As for what makes a spam filter the "best", I would say these are the criteria:
Also, please note that I am trying to filter content that already exists on websites that aren't mine, using Greasemonkey Userscripts. In other words, I can't prevent spam; I can only filter it.
Here is my attempt, so far, to compile a list of the various methods along with their shortcomings and benefits:
Rule-based filters:
What it does: "Grades" a message by assigning a point value to different criteria (i.e. all uppercase, all non-alphanumeric, etc.) Depending on the score, the message is discarded or kept.
Benefits:
Easy to implement
Mostly transparent
Shortcomings:
Transparent- it's usually easy to reverse engineer the code to discover the rules, and thereby craft messages which won't be picked up
Hard to balance point values (false positives)
Can be slow; multiple rules have to be executed on each message, a lot of times using regular expressions
In a client-side environment, server interaction or user interaction is required to update the rules
Bayesian filtering:
What it does: Analyzes word frequency (or trigram frequency) and compares it against the data it has been trained with.
Benefits:
Shortcomings:
Requires training to be effective
Trained data must still be accessible to JavaScript; usually in the form of human-readable JSON, XML, or flat file
Data set can get pretty large
Poorly designed filters are easy to confuse with a good helping of common words to lower the spamacity rating
Words that haven't been seen before can't be accurately classified; sometimes resulting in incorrect classification of the entire message
In a client-side environment, server interaction or user interaction is required to update the rules
Bayesian filtering- server-side:
What it does: Applies the Bayesian filtering server-side by submitting each message to a remote server for analysis.
Benefits:
Shortcomings:
Heavy traffic
Still vulnerable to uncommon words
Still vulnerable to adding common words to decrease spamacity
The service itself may be abused
To train the classifier, it may be desirable to allow users to submit spam samples for training. Attackers may abuse this service
Blacklisting:
What it does: Applies a set of criteria to a message or some attribute of it. If one or more (or a specific number of) criteria match, the message is rejected. A lot like rule-based filtering, so see its description for details.
CAPTCHAs, and the like:
Not feasible for this type of application. I am trying to apply these methods to sites that already exist. Greasemonkey will be used to do this; I can't start requiring CAPTCHAs in places that they weren't before someone installed my script.
Can anyone help me fill in the blanks? Thank you,