A new method for digitizing books could help to win the war against spammers.
As it always seems, it took some really smart brainy type folks at Carnegie Mellon University to figure out how to foil spammers whilst also bringing a collection of digitized books and other print to the interweb. Unfortunately, it will take a little time and will need us humans to do a few things to help computers. Me work for a computer, that's unheard of!
A software system called reCAPTCHA will use the eyeballs (and connected brains) from thousands of web surfers to help computers identify text better. The reCAPTCHA system was originally developed by the university for Yahoo in the hopes of preventing drone computers from creating bogus e-mail accounts. In fact, we're all familiar with the basic system of distorted text that we (the lowly human) must correctly identify in order to access a particular web page and or service.

But the reCAPTCHA system works somewhat differently, though along the same lines. The system will present the user with two warped, distorted, obscured words. The first word is the CAPTCHA, a word that the computer knows. The second word is one that has stumped some optical character recognition (OCR) software. The human (that's you), must identify both words correctly. The first word allows the system to believe that you are a human being, and not not net bot. But in identifying the second word, the system must check your result against that of other humans. If 99 percent of all results for that second word are the same, then it implies that humans have correctly identified it.
"It is estimated that 60 million or more CAPTCHAs are solved each day, with each test taking about 10 seconds," director of the Internet Archive, Brewster Kahle, said in a statement. "That's more than 150,000 precious hours of human work that are lost each day, but that we can put to good use with reCAPTCHAs."
As for beating spammers at their own game, "Many sites display e-mails like bmaurer [at] foo [dot] com or use hacks with tables, javascript or encodings to get the same effect" blogged Ben Maurer, an undergraduate student on the project. "Spammers are getting smarter and figuring out these tricks."
But the new system could disguise a persons e-mail address in the form of 'A?????@b???.com'. By clicking on the question marks, the user would be directed to a reCAPTCHA. If they can correctly identify both words, then they would be granted access to the previously obscured email address. It's a very effective method of foiling bots. Certainly, a spammer isn't going to surf the web all day filling in CAPTCHAs just to get a few e-mail addresses, that just wouldn't be cost effective on their part.
But it can also be used to help digitize text from books and magazines and the like. The system can rely upon the OCR abilities of a given computer to identify the majority of text. But when it encounters a word and or character it can't identify, it asks humans to help. But having humans essential vote on what the word is, it allows the OCR software to be better taught on how to identify certain strings of text and characters. Just imagine; the next time you fill in a CAPTCHA on some web site, it could be a reCAPTCHA. That second word could be some portion of text from a book that no OCR system can identify. Us humans fill in the blank and not only do we foil some spammer, but we also help to bring more digitized, searchable text to the Internet.
"This is an example of why having open collections in the public domain is important," Kahle said. "People are working together to build a good, open system." The reCAPTCHA project is being run by Carnegie Mellon with the help of server hardware donated by Intel and Suse Linux Enterprise Server support subscriptions donated by Novell.
(Conspiracy theorist^)
I'm confuzzled, this was written very tech-geeky. Can someone please put it into laymans' terms?
the second word is the recaptcha, something that humans can read but OCR software can't.
this recaptcha may be a scanned word from a book or a newspaper article.
humans will identify the recaptcha and type it in. say 90% of all users type in the same word. the recaptcha system then knows the word because humans told it what it is. the system will then tell a person that the word from that afore mentioned book or newspaper scan.
the system has done two things. it's foiled OCR bots trying to harvest e-mail addresses for spam mailers, and it's helped to identify a word from some printed text. that text can then be more easily digitized, with very little human work, and placed on the internet.