Friday, August 06, 2010

Inside Google Books: Books of the world, stand up and be counted! All 129,864,880 of you.

Google takes a stab at counting all the books in the world: Google.
Our definition is very close to what ISBNs (International Standard Book Numbers) are supposed to represent, so why can’t we just count those? First, ISBNs (and their SBN precursors) have been around only since the mid 1960s, and were not widely adopted until the early-to-mid seventies. They also remain a mostly western phenomenon. So most books printed earlier, and those not intended for commercial distribution or printed in other regions of the world, have never been assigned an ISBN.

The other reason we can’t rely on ISBNs alone is that ever since they became an accepted standard, they have been used in non-standard ways. They have sometimes been assigned to multiple books: we’ve seen anywhere from two to 1,500 books assigned the same ISBN. They are also often assigned to things other than books. Even though they are intended to represent “books and book-like products,” unique ISBNs have been assigned to anything from CDs to bookmarks to t-shirts.

What about other well-known identifiers, for example those assigned by Library of Congress (Library of Congress Control Numbers) or OCLC (WorldCat accession numbers)? Rather than identifying books, these identify records that describe bibliographic entities. For example the bibliographic record for Lecture Notes in Mathematics (a monographic series with thousands of volumes) is assigned a single OCLC number. This makes sense when organizing library catalogs, but does not help us to count individual volumes. This practice also causes duplication: a particular book can be assigned one number when cataloged as part of a series or a set and another when cataloged alone. The duplication is further exacerbated by the difficulty of aggregating multiple library catalogs that use different cataloging rules. For example, a single Italian edition of “Angels and Demons” has been assigned no fewer than 5 OCLC numbers.

So what does Google do? We collect metadata from many providers (more than 150 and counting) that include libraries, WorldCat, national union catalogs and commercial providers. At the moment we have close to a billion unique raw records. We then further analyze these records to reduce the level of duplication within each provider, bringing us down to close to 600 million records.

No comments: