Neoseeker : News : Digitized books and spam actually have something in common.
Hardware Newsletter:
Email:

Latest News
Mon, Jul 06
Fri, Jul 03
Thu, Jul 02
Wed, Jul 01

send article hardware newsletter   article comments (6)

Digitized books and spam actually have something in common.
J. Micah Grunert - Friday, May 25th, 2007 | 3:37PM (PT)


A new method for digitizing books could help to win the war against spammers.

As it always seems, it took some really smart brainy type folks at  Carnegie Mellon University to figure out how to foil spammers whilst also bringing a collection of digitized books and other print to the interweb. Unfortunately, it will take a little time and will need us humans to do a few things to help computers. Me work for a computer, that's unheard of!

A software system called reCAPTCHA will use the eyeballs (and connected brains) from thousands of web surfers to help computers identify text better. The reCAPTCHA system was originally developed by the university for Yahoo in the hopes of preventing drone computers from creating bogus e-mail accounts. In fact, we're all familiar with the basic system of distorted text that we (the lowly human) must correctly identify in order to access a particular web page and or service.

But the reCAPTCHA system works somewhat differently, though along the same lines. The system will present the user with two warped, distorted, obscured words. The first word is the CAPTCHA, a word that the computer knows. The second word is one that has stumped some optical character recognition (OCR) software. The human (that's you), must identify both words correctly. The first word allows the system to believe that you are a human being, and not not net bot. But in identifying the second word, the system must check your result against that of other humans. If 99 percent of all results for that second word are the same, then it implies that humans have correctly identified it.

"It is estimated that 60 million or more CAPTCHAs are solved each day, with each test taking about 10 seconds," director of the Internet Archive, Brewster Kahle, said in a statement. "That's more than 150,000 precious hours of human work that are lost each day, but that we can put to good use with reCAPTCHAs."

As for beating spammers at their own game, "Many sites display e-mails like bmaurer [at] foo [dot] com or use hacks with tables, javascript or encodings to get the same effect"  blogged Ben Maurer, an undergraduate student on the project. "Spammers are getting smarter and figuring out these tricks."

But the new system could disguise a persons e-mail address in the form of 'A?????@b???.com'. By clicking on the question marks, the user would be directed to a reCAPTCHA. If they can correctly identify both words, then they would be granted access to the previously obscured email address. It's a very effective method of foiling bots. Certainly, a spammer isn't going to surf the web all day filling in CAPTCHAs just to get a few e-mail addresses, that just wouldn't be cost effective on their part.

But it can also be used to help digitize text from books and magazines and the like. The system can rely upon the OCR abilities of a given computer to identify the majority of text. But when it encounters a word and or character it can't identify, it asks humans to help. But having humans essential vote on what the word is, it allows the OCR software to be better taught on how to identify certain strings of text and characters. Just imagine; the next time you fill in a CAPTCHA on some web site, it could be a reCAPTCHA. That second word could be some portion of text from a book that no OCR system can identify. Us humans fill in the blank and not only do we foil some spammer, but we also help to bring more digitized, searchable text to the Internet.

"This is an example of why having open collections in the public domain is important," Kahle said. "People are working together to build a good, open system." The reCAPTCHA project is being run by Carnegie Mellon with the help of server hardware donated by Intel and Suse Linux Enterprise Server support subscriptions donated by Novell.

  Related Reviews & Articles

back to news    comments or corrections

Comments:

  • 0 thumbs!
    Mr Gray | May 25, 07 | quote
    I can dig it. To an extent. Who'e checking those relays? Never know what agency could twist it into their favor. :\


    (Conspiracy theorist^)
  • 0 thumbs!
    jmicahg | May 25, 07 | quote
    Yeah, but if it's run by a university, you can probably bet that it's on the up and up. They want to keep their good name, as does Yahoo and the other tech/web companies involved. If someone would use the system to take advantage, then they'd probably get shut down pretty quick.
  • 0 thumbs!
    SillyPuddee | May 28, 07 | quote
    There is a problem with this though. What if some unintillegent people purposly fail the CAPTCHA's.
  • 0 thumbs!
    jmicahg | May 28, 07 | quote
    Law of averages. The majority of people will answer correctly. Therefore, the system will thorugh that higher average yeild a positive result. Besides, there may always be a couple humans checking up on the machine, just to make sure all the results are on the up and up.
  • 0 thumbs!
    Sungod Okami | May 30, 07 | quote
    So this is basically the exact same thing as the old CAPTCHA system, except it can identify new words... For what?

    I'm confuzzled, this was written very tech-geeky. Can someone please put it into laymans' terms?
  • 0 thumbs!
    jmicahg | May 30, 07 | quote
    The first captcha is a word that humans and ocr software can read.

    the second word is the recaptcha, something that humans can read but OCR software can't.

    this recaptcha may be a scanned word from a book or a newspaper article.

    humans will identify the recaptcha and type it in. say 90% of all users type in the same word. the recaptcha system then knows the word because humans told it what it is. the system will then tell a person that the word from that afore mentioned book or newspaper scan.

    the system has done two things. it's foiled OCR bots trying to harvest e-mail addresses for spam mailers, and it's helped to identify a word from some printed text. that text can then be more easily digitized, with very little human work, and placed on the internet.
- This news story is archived and is closed to new comments now -

  RSS Feeds

Latest Comments
Most Comments

Latest Net Reviews:
Latest Inhouse:


Compare Prices

Motherboards
Abit
ASUS
Gigabyte
MSI
eVGA
Intel
Tyan
More...

Processors
AMD
Intel
More...

Memory
DDR
DDR2
DDR3
More...

Video Cards
ATI
eVGA
XFX
BFG
Sapphire
More...

search for lowest prices
(0.0848/d/nova)