Sunday, March 27, 2011

Algorithm for separating nonsense text from meaningful text

I provided some of my programs with a feedback function. Unfortunately I forgot to include some sort of spam-protection - so users could send anything they wanted to my server - where every feedback is stored in a huge db.

In the beginning I periodically checked those feedbacks - I filtered out what was usable and deleted garbage. The problem is: I get 900 feedbacks per day. Only 4-5 are really useful, the other messages are mostly 2 type of gibberish:

  • nonsense: jfvgasdjkfahs kdlfjhasdf (People smashing their heads on the keyboard)
  • language i don't understand

What I did so far:

  1. I installed a filter to delete any feedback containing "asdf", "qwer" etc... -> only 700 per day

  2. I installed a word filter to delte anything containing bad language -> 600 per day (don't ask - but there are many strange people out there)

  3. I filter out any messages containing letters not being used in my language -> 400 per day

But 400 per day is still way too much. So I'm wondering if anybody has dealt with such a problem before and knows some sort of algorithm to filter out senseless messages.

Any help would really be appreciated!

From stackoverflow
  • Look up Claude Shannon and Markov models. These lead to a statistical technique for assessing probabilities that letter combinations come from a specified language source.

    Here are some relevant course notes from Princeton University.

  • How about just using some existing implementation of a bayesian spam filter instead of implementing your own. I have had good results with DSpam

  • If you're only expecting (or care about) English comments, then why not simply count the number of valid words (with respect to some dictionary) in the feedback uploaded. If the number passes some threshold, accept the feedback. If not, trash it. This simple heuristic could be extended to other languages by adding their dictionaries.

    masfenix : Viagra! Cheap cheap Viagra!
  • A slightly different approach would be to set up a system to email the feedback messages to an account and use standard spam filtering. You could send them through gmail and let their filtering take a shot at it. Not perfect, but not too much effort to implement either.

    Rob : Oooh, quick and dirty, hackish and somehow thoroughly disgusting...I love it! :D
    Ross : Upvoted for the uniqueness :)
    Christian Nunciato : +1 for piggybacking off Gmail -- that's probably what I'd do, too; their spam filtering is excellent and as a quick (and quite easy) fix it's definitely worth trying as a first effort. Nice practical and uncomplicated suggestion.
    Alex Fort : +1 from me too. That's the programmer spirit, right there :P
    : But would Gmail really filter out a message that says "qwerty"? Even if so, they also look at the sender, subject, server it's mailed from etc, which would all be the same for his application (they are all sent from this one form to the Gmail account).
    Darius Bacon : If the 'from' address in this scheme is always the same, there's a danger of Gmail just deciding that *that address is a spammer* since it sends so much spam.
  • I had a spamming problem in a guestbook function on one of my sites a (quite long) while ago. my solution was simply to add a little captcha-like Q&A field asking the user "Are you a spamming robot?" Any answer containing the word "no" (letting through "no, i'm not", "nope" and "not at all" too, just for fun...) permitted the user to post...

    The reason I chose not to use captcha was simply that my users wanted a more "cozy" feel to the site, and a captcha felt too formal. This was more personal =)

  • The simplest method would be to count the occurrence of each letter. E is the most common letter in English, so it should be used the most. You could also check for word and digraph frequency. Have a look here to get the list of most frequently used anything in English

    0xA3 : This would be good for detecting the language and filter away unwanted languages. Bunt unfortunately this would not filter nonsense text.
    Marius : It would filter nonsense text, because nonsense text does not have the right statistics. If you randomly hit the keyboard, then E wont be the most typed letter
    RexE : Statistically, this works for long strings, but not always for short strings. (Note the previous sentence doesn't contain an "E", but that doesn't mean you should mark it as spam.)
    Marius : That is right, but it contains a lot more t's and i's than q's and z's. As long as you have at least a sentence or two, it should work.
  • You might try the Bayesian algorithm used by many spam filters.

    Better Bayesian Filtering

    Wikipedia explanation

    Some open Source

  • The preceding answers about strapping up some spam filter Bayesian-inspired classfier are a good idea. For your application, since you seem to get a lot of long nonsense words, it would be best to turn on an option in your parser to train on bigrams and trigrams; otherwise, many of the nonsense words will just be treated as "never seen before" which is not the most useful parse in your case.

  • Fidelis Assis and I have been adapting the spam filter OSBF-Lua so that it can easily be adapted to other applications including web applications. This spam filter won the TREC spam contest three years running. (I don't mind bragging because the algorithm is Fidelis's, not mine.)

    If you want to try things out, we have "nearly beta" code at

    git clone http://www.cs.tufts.edu/~nr/osbf-lua-temp
    

    We are still a long way from having a tidy release, but the code should build provided you install automake 1.9. Either of us would be happy to advise you on how to use it to clean your database and to integrate it into your application.

  • Yes, like people pointed out, you could look at spam filters or Markov Models.

    Something simpler would be to just count the different words in each response and sort by frequency. If words like the following are not at the top then it's probably not valid text:

    the, a, in, of, and, or, ...

    They are the most frequently used word in any usual English text.

  • Just store comments in a pending state, pass them through Akismet or Defensio, and use the response to mark them as potential spam or mark them active.

    http://akismet.com/

    http://defensio.com/

    I personally prefer Defensio's API but they both work fantastically well.

0 comments:

Post a Comment