TECH.BLORGE.com
VISTA.BLORGE.com
MAC.BLORGE.com
GAMER.BLORGE.com

May 28, 2007 |

Web users unwittingly aid in digitalization of books

By George Gardner





Web users unwittingly aid in digitalization of books You’ve all seen those “are you human” boxes at the bottom of web sites, usually in registration sections, that are meant to differentiate between a computer and a human. The boxes, filled with an image of text, require the user type the text shown in the image to prove he/she is human. These boxes are known as CAPTCHAs,  have found a genius new use through the clever thinking of a Carnegie Mellon University computer scientist.

CAPTCHAs, an acronym for Completely Automated Public Turing Test to Tell Computers and Humans Apart, can be found at the bottom of registration forms on such sites as PayPal, Wikipedia, Yahoo, and just about any other important or secure website.

Working with a team, including computer science professor Manuel Blum, undergraduate student Ben Maurer and research programmer Mike Crawford, Von Ahn invented a new version of the tests, called reCAPTCHAs (shown in illustration).

reCAPTCHAs, instead of just randomly shown text, are simply scanned images of old books, newspapers, and other printed materials. The “are you human” text will be turned into online digital text by the hundreds of thousands of users who input this special text every day.

“It is estimated that 60 million or more CAPTCHAs are solved each day, with each test taking about 10 seconds,” Von Ahn said. “That’s more than 150,000 precious hours of human work that are lost each day, but that we can put to good use with reCAPTCHAs.”    

Computers which normally decipher the scanned text usually fail when it comes across fuzzy or underlined text, or scribbled and poorly printed handwriting. This new process, using reCAPTCHAs, will rely on the collective power of the Internet to judge exactly what the text says; thereby making it searchable by computers and capable for new uses which require a digital format.

“I think it’s a brilliant idea — using the Internet to correct OCR mistakes,” said Brewster Kahle, director of the Internet Archive. ReCAPTCHAs will speed the digitization process while also helping to improve OCR methods and perhaps extend them to additional languages, he said. “This is an example of why having open collections in the public domain is important,” he added. “People are working together to build a good, open system.”  

Intel Corporation is helping in the widespread use of the reCAPTCHAs by developing a web-based service to allow webmasters to easily install them in their sites.

Since the computer does not know what the scanned text actually is, and to make certain that people are correctly deciphering the printed text, the reCAPTCHA system will have the visitors of the website type in two words, one of which the system already knows.

Each unknown word will be submitted to multiple visitors to ensure that the translation is 100% correct. 

An audio version of reCAPTCHA, which will transcribe portions of radio programs that have defied speech recognition programs, will also be available for blind Web users.  

Related:

  • Amazon blames ‘gay discrimination’ on glitch; blogger says he pulled a scam
  • Google’s Book Search goes mobile
  • Amazon launches Kindle reader for PCs
  • Dangerous Botnet threatens online iPhone buyers
  • Amazon Kindle coming Monday: digital books takeaway




  • Sign up for the BLORGE daily email newsletter

    Leave a Reply:

    Copyright © 2008 Engaging and compelling blogs that entertain and inform