Digital Media Mac Blogs > Mac

Why is SPAM so hard to detect?


Whether you opt for the nerdy designation of SPAM, prefer politically correct euphemisms such as unwanted messages, favor Apple's designation of junk mails or even, as we crazy Frenchmen do, nickname them pourriels, you have most certainly had the joy of dealing with unwanted pieces of data mudding the stream of information you swim in online.

It used to be that SPAM was a mostly pornographic problem: unless one was actually looking for "hardcore t**n action," one was unlikely to see such garbage popping up in search results. Yet, today, as I was looking up information for a client on a Paris meeting place, Google suggested I had a look at a video enticingly called "Huge Teen Orgasm TJMaxx Gift Certificate."

Well, of course, that was unlikely to answer my questions and, time being short, I confess I did not inquire further about the contents of the video. (If the preview thumbnail is anything to go by, it seems to be about bedroom decor or bed linens or something of the sort. Maybe someone has more information.)

What strikes me here is not that there is pornographic content on the Internet or even that pornography popped up in a search for business centers in Paris. That's all part of the daily enjoyment of modern technology.

What does strike me is that this title makes absolutely no sense. Certainly, there is a once in a million chance that a desperate TJMaxx would offer discount coupons on something related to huge orgasms, but one would think that chance to be nearly remote enough for the page in question to be penalized. That Google Video cannot actually look inside a video seems good and fair, that it is misled and tricked by a title or tags or even inbound links not fitting a bill is also perfectly understandable. Google already does a great job at looking for tell-tale signs of SPAMmy content on the Internet, swiftly removing it or demoting it from search pages.

Unfortunately, for all our craftiness at analyzing page structure, code correctness and the date at which domains were registered, we still seem unable to teach our computers about proper grammar. That a video is called "I can has Macintosh" is slightly disturbing for the school system, but it does not necessarily imply SPAM. When a video is given a title so grammatically and semantically ludicrous as "Huge Teen Orgasm TJMaxx Gift Certificate," one would think a computer could notice.

Of course, we're only dealing with English here. Attempting to gauge and rate language correctness in an attempt to fight SPAM is not only computationally intensive, it is also locale-specific, and it probably requires that different techniques be employed depending on the language.

There is also the possibility that SPAMmers will start writing proper English. Fortunately for us, however, writing original content with proper syntax and spelling is verry touf. Requiring that something be moderately well written, just like we require that web pages adhere to standards, could raise the bar to a frustrating extent and rule out a lot of basic robot activity.

In the prehistoric times of Mac OS X, the positively brilliant Kim Silverman and his team, whom I once had the pleasure of interviewing for O'Reilly, were working on analyzing language patterns to detect SPAM, and this is the very technology that powers Mail's junk detection features. I am sure Google does some of that to their search results, and we all know how much Amazon has invested in such research to categorize books and interests.

When, however, will the breakthrough come? When will my computer understand that "Huge Teen Orgasm TJMaxx Gift Certificate" cannot possibly, no matter how much of a shopping-addicted undercover pervert I potentially am, hold any interest?

Categories





AddThis Social Bookmark Button
Comments (1)
Read More Entries by FJ de Kermadec.

1 Comments

Peter Jaros said:

With apologies for being pedantic, SPAM is quite easy to detect. Spam, on the other hand, is trickier.

http://www.spam.com/legal/spam/

Leave a comment


Type the characters you see in the picture above.

Recommended for You

Topics of Interest

Archives


 
 


Or, visit our complete archive.