Recently Google acquired CAPTCHA provider reCAPTCHA one of the leading CAPTCHA providers on the Internet. The key strength behind the reCAPTCHA implementation of the CAPTCHA test is that it pairs a known word (to the server) with an unknown word that an OCR scan has failed to recognise. This allows reCAPTCHA to crowdsource the digitisation of scanned books such as those in the Google Books project as Google outlines in their blog post on the acquisition:
“This technology also powers large scale text scanning projects like Google Books and Google News Archive Search. Having the text version of documents is important because plain text can be searched, easily rendered on mobile devices and displayed to visually impaired users. So we’ll be applying the technology within Google not only to increase fraud and spam protection for Google products but also to improve our books and newspaper scanning process.”
The above publicly stated reasons for the aquisition seem obvious and perhaps they are a little too obvious, hiding the real reason behind the aquisition. reCAPTCHA hold no specific patents for the technology behind their text CAPTCHA algorithms (At least none they discuss on their website or are able to be found on the US Patents & Trademark Office site) and given that reCAPTCHA operates mostly on open source software the case for buying Google buying the company gets thinner.
Given that Google could easily code their own reCAPTCHA equivalent, this business deal goes beyond the obvious. Google already has their own ‘CAPTCHA Killer’ that operates using images and video, surely they could roll that out if they were really serious about security, so the tech involved with reCAPTCHA is not compelling from this perspective. ReCaptcha’s Prof. Von Ahn has already licenced his ESPgame image labeling program to Google (Now known as Google Image Labeler) so buying reCAPTCHA might have been an attempt to grab this technology as a bundle with the companies other assets.
Certainly using reCAPTCHA to digitise and make searchable Google’s vast collection of books in the Google Books archive project, however given the above they could have designed a similar system themselves. What is really key to this purchase by Google is in fact the existing distribution network of websites that are already using reCAPTCHA and the API they have created to allow new sites/uses of the reCAPTCHA system. Given the 100,000 sites and 30 Million CAPTCHA’s served daily by reCAPTCHA, the decision by Google to buy reCAPTCHA is about the text processing volume it gives them immediately, not the technology behind it.