Wednesday, October 10, 2007

Has CAPTCHA been captured?

CNet is reporting that spammers have hijacked YouTube's email-a-friend feature to send out phishing emails. The thing I found most notable in the article is the reference to software that automatically decodes and defeats CAPTCHA. This is a new, and troubling development for anyone who uses viral marketing or web forms.

CAPTCHA is a system that requires a web visitor to retype a set of letters into a form field in order to submit the form. Those letters are presented in a somewhat distorted or garbled image. The idea is to prevent automated scripts from being able to hijack the form. From an online branding perspective, the email-a-friend forms, blog comment forms and contact forms that CAPTCHA protects are critical to developing an effective online brand conversation. If these systems cannot be adequately protected, we may lose some of our most effective tools for online branding and marketing.

If CAPTCHA is indeed in danger, my question to ponder is: "what will take its place?" What type of system can simultaneously be easy to use for website visitors, while being difficult or (improbably) impossible for the hackers and spammers to get around?

Leave a comment. I think this is an important discussion to have now, before our online forms become as compromised as our junk mail folders already are!

4 comments:

Anonymous said...

CAPTCHAs have always been an arms race. The New York Times wrote about this in June, and there are a number of discussions out there about alternate test methods. It seems like the most secure method is to write your own questions, since automated cracking methods are designed to deal with the types of questions they see most often. Avoiding the monoculture avoids malware just as much here as with computer viruses.

JA said...

Non-text CAPTCHA images is an area where the CAPTCHA test is still strong. Rather than skewing text, which is susceptible to OCR techniques, this type of test displays "real" images. While image recognition is not new, it requires much more sophisticated algorithms to break.

I have built an ASP.NET component called HTMLCaptcha that was designed on this very premise.

www.htmlcaptcha.com

It outputs small icon sized images of the user's choosing. It also employs some neat tricks, primarily that it converts the images to HTML so the image is that much more difficult to separate from the textual code.

This type of CAPTCHA has many more miles of travel left, whereas with skewed text CAPTCHAs, I'm not so sure.

Jeff Greenhouse said...

The HTMLCaptcha approach is interesting. I agree that non-text-based CAPTCHA has a lot of potential. At the same time, this approach takes up a lot of extra space, especially if it is going to be reasonably safe from brute-force approaches.

To explore the topic of brute force by the numbers, the website suggests that the possible solutions are:

(number of images in database) to the power of (number of selections to be made by web user)

In fact, the real formula is:

(number of options presented for each selection) to the power of (number of selections to be made by web user).

In the example on the HTMLCaptcha website, they use two selections with 3 options each. This means that the possibility of randomly guessing the right combination is 1 out of 3^3, or one in nine. A brute force attack that just picks the first option of each set would have a decent chance of getting it right after only 10 or 15 attempts (if not faster). Adding a third image and more options per image would certainly make this stronger, but it would also add extra steps for the end-user.

If we compare this to the number of possible solutions in a 5 character text CAPTCHA (with upper, lower and numbers), we might be looking at 52^5 possible options each time the CAPTCHA is loaded.

I agree with Jennifer that variability (including text, images, logic, etc) in CAPTCHA strategies is critical to our end of the arms race.

JA said...

In fact, the real formula is:

(number of options presented for each selection) to the power of (number of selections to be made by web user).

... means that the possibility of randomly guessing the right combination is 1 out of 3^3, or one in nine.


Actually, this is correct only if the spamming software is capable of discerning the descriptions as descriptions. For that, the spammer would have to study the HTML of the site displaying the CAPTCHAs and code software to identify the selections. (Not an economically viable option, since with a library like HTMLCaptcha, it would be easy for the attacked site to modify their CAPTCHA; also the unresolved CAPTCHA problem is one of creating an A.I. smart enough to solve the test, universally.) So this falls under the "difficult to parse" scenario, which is doubly compounded in the example by the inclusion of Javascript output.

Writing a spambot that can universally parse through arbitrary HTML and extract text that are "CAPTCHA descriptors" from other text is as hard, or harder than writing a bot that can apply OCR techniques to an image.

Also, it's important to understand that HTMLCaptcha is a library, and allows you to create CAPTCHA tests in many different ways. You could, in fact, quite easily create the "standard" concept of CAPTCHA with HTMLCaptcha by using images of fonts and randomly stringing them together. This would provide one improvement over existing techniques -- the CAPTCHA image would be embedded as HTML.

The HTMLCaptcha approach is as Jennifer brought up: CAPTCHAs can and will be broken as A.I. improves. You need a tool that will allow you to adapt.