Friday, September 17, 2010

Repost: 'Qualified Metadata' - What Does it All Mean?

Originally posted on 2/22/2007, I was speaking to someone this afternoon about this topic and it reminded me a little of this post.


Earlier this month I spoke about how data providers may be able to carve a place for themselves as the single provider of catalog information for particular industries. This data, representing 'base level' descriptive information (in the book world we call it bibliographic data) would be widely disseminated across the Internet to facilitate trade of products, materials and services and would be provided by one data supplier. Other data suppliers - one layer up if you will - would also make use of this base level information but add to it value added data elements which would be particularly important to segments of the supply chain. The most obvious example in books would be subject and categorization data which aids in discovery of the item described. Another set of data elements could reflect more descriptive information about a publisher over and above basic address and contact details. In the second of my series, I take a look at the library environment.

In a recent article in D-Lib (January 07), Karen Markey of the University of Michigan looks at how the library online catalog experience needs to change in order for users to receive more relevant and authoritative sources of information to support their research needs. She goes on to quote Deanna Marcum of Library of Congress "the detailed attention that we have paid to descriptive cataloguing may no longer be justified...retooled catalogers could give more time to authority control, subject analysis, [and] resource identification and evaluation." Markey proposes redesigning the library catalog to embrace three things:
  1. post-Boolean probabilistic searching to ensure the precision in online catalogs that contain full-text
  2. subject cataloguing that takes advantage of a users ability to recognize what they do and don't want
  3. qualification cataloguing to enable users to customize retrieval based on level of understanding or expertise
New search technologies such as MarkLogic, FAST and the search tool behind Worldcat offer some of these capabilities but are generally not accessible to the average user. For example, some of these tools enable flexibility in the relevant importance given to elements within a record; so manipulating the importance of Audience level in a WorldCat search would 'skew' the search result set to higher or lower comprehension titles based on the bias given to one or the other.

Perhaps the most compelling point Markey raises in her article supporting increased attention to "qualification metadata" is the 30 to 1 'rule'.
The evidence pertains to the 30-to-1 ratios that characterize access to stores of information (Dolby and Resnikoff, 1971). With respect to books, titles and subject headings are 1/30 the length of a table of contents, tables of contents are 1/30 the length of a back-of-the-book index, and the back-of-the-book index is 1/30 the length of a text. Similar 30 to 1 ratios are reported for the journal article, card catalog, and college class. "The persistence of these ratios suggests that they represent the end result of a shaking down process, in which, through experience, people became most comfortable when access to information is staged in 30-to-1 ratios" (Bates, 2003, 27). Recognizing the implications of the 30-to-1 rule, Atherton 1978) demonstrated the usefulness of an online catalog that filled the two 30-to-1 gaps between subject headings and full-length texts with tables of contents and
back-of-the-book indexes.
Once I read this it was obvious to me that we may not have thought through the implications of projects such as Google Print on retrieval. These initiatives will result in huge (big, big, big) increases in the amount of stuff researchers and students will have to wade through to find items that are even remotely relevant to what they are looking for. In the case of students, unless appropriate tools and descriptive data is made available we will only compound the 'its good enough' mentality and they will never see anything but Google Search as useful.

Markey's article is worth a read if you are interested in this type of stuff, but I think her view point is a starting point for any bibliographic agency or catalog operation in defining their strategy for the next ten years. Most bibliographers understand that base level data is a commodity. The only value a provider can supply here is consistency and one-stop shopping and the barriers to entry are lowered every day. I am of the view (see my first article on this subject) that the agency that can demonstrably deliver consistent data should do so as a loss leader in order to corner the market on base level data and then generate a (closed) market for value added and descriptive (qualification) metadata. There are indications that markets may be heading in this direction (Global Data Synchronization - which I will address next) with incumbent data providers reluctantly following.

Providing relevancy in search is a holy grail of sorts and descriptive data is key to this. In the library environment if the current level of resources were reallocated to building the deeper bibliographic information we need then the traffic in and out of library catalogs would be tremendous. If no one steps in to provide this needed descriptive data then the continuing explosion of resources would be irrelevant because no-one would be directed to the most relevant stuff. Serendipity would rule. The data would also prove valuable and important to the search providers (Google, etc.) because they also want to provide relevance; having libraries and the library community execute on this task would be somewhat ironic given the current decline in use of the online library catalog.

No comments: