Search NFAIS

Home
About NFAIS
Events

Promotions
Information Community News
Press Releases
Members
Committees
Join NFAIS
Contact NFAIS

Member Login



 

 

 

 

 

 

 

 

 

 

 

Home  >>  Publications  >>  Metadiversity  >>  Preprints Contents
 
Preprints of the Metadiversity Conference Proceedings

  Session 5: The Metadata Challenge for Libraries

Metadata Challenges for Libraries

CARL LAGOZE, Digital Library Scientist, Cornell University

ABSTRACT

Metadata creation, in the form of cataloging, has been a fundamental task of traditional libraries. The movement from physical to digital artifacts brings some fundamental challenges to this metadata creation process. These challenges include new types of documents, few controls over the quality of content, an unprecedented increase in the quantity of content, and distributed management of content and services. In such an environment, traditional metadata methods are neither appropriate nor sufficient. In this short talk, I will briefly review these challenges and describe some work-in-progress for creating new metadata practices and standards.

I am going to make some observations about metadata and how metadata has worked in traditional library environments. Then I will look at some of the challenges that we face regarding doing metadata on the Internet. My apologies to the librarians here if I make observations that violate your beliefs about libraries. I want to clarify that I am not a librarian. I try to give accurate observations based on my experiences with libraries.

The Purpose of Metadata

Earlier in this meeting, metadata was defined as being structured data about data. In this talk, I am going to focus on metadata as something that imposes structure. In fact, I am going to present the view that, overall, the purpose of metadata is to impose some order in a disordered information universe.

What we generally end up with when we search is a lot of "stuff," as many people call it these days. We may call the "stuff" documents or digital objects, but most seem to be comfortable with calling it "stuff." So, we have a lot of stuff out there, and as we access that stuff and as we try to find that stuff and use that stuff, we are more comfortable thinking of it in ordered categories and approaching it from an ordered point of view, rather than as an amorphous glob of stuff. One of the major roles of metadata in libraries is to help us impose some order, to make us think that the stuff out there all has an author or creator (as we like to call it in the Dublin Core world) and a title and a subject. Even if that is a myth, it at least helps us to think about it that way.

Order-Making as a Library Function

Many people think that all libraries are big warehouses of books and maps and things like that. But in fact, one of the basic functions of libraries is order-making. One of the things libraries do best is impose this order, and they do it in two ways. They do this with spatial ordering through shelving rules (e.g., the Dewy Decimal system), and they do this with a sort of semantic, logical ordering through cataloging rules (e.g., AARC2, MARC). These orders allow us to use libraries in lots of ways. We can enjoy walking up to the shelves and serendipitously browsing down the shelves or running into things in sort of familiar clumpings or going through the card catalog.

I have a rather elementary hypothesis that will be of no surprise to anyone: The ease of order-making is inversely proportional to the level of chaos. In fact, one of the reasons that libraries have been so successful–and this is not to demean their level of success or how hard the work actually is–is that the environment in which librarians work has a relatively low level of chaos. But as we step away from the physical artifacts and move beyond the walls, we are opening that chaos and making the chaos boundless. This makes the job very difficult.

Traditional Metadata-Creation Environments

Let's briefly look at the traditional cataloging metadata-creation environment within which libraries have worked.

Characteristics of the Environment

The metadata-creation environment has a number of characteristics. First is the notion of the stability of a physical artifact: I know that I have something and that I can describe it and it stays there and it doesn't disappear. It is something that has very stable characteristics.

The second characteristic of a traditional cataloging metadata-creation environment is that it has clearly established roles. Usually I know who is an author, who is a publisher, who is a consumer of the information, and all these roles sort of stay where they are–they don't shift.

Another characteristic is that there is a relatively small number of content producers. For example, the Cornell Library does not collect the works of everybody. Instead, it collects the works of an established set of content producers with which it has established quality and trusting relationships. This is very important to the way traditional libraries work. They know whom to trust, and therefore you can trust the libraries–you can assume when you go into the Cornell Library that you will find material that is accurate and acceptable.

Lastly, there is this notion of a defined "control zone." What is inside the walls is part of the library. What is outside the walls is, more or less, not part of the library. Of course, the walls are somewhat permeable, considering things like interlibrary loans, but they at least exist.

Effects on the Resulting Metadata

Together, these characteristics mean that library cataloging metadata is extremely high-quality. This is despite the fact that libraries are dealing with complex information. If you have ever actually read ACRII, for example, you know that it is not exactly the kind of thing you can go through in one pass. It requires an extremely high investment of time and money. I think that the general rule of thumb is $60 to $65 per record in original cataloging. That is a lot of money when you consider the acquisition rate of a place like the Cornell Library. Of course, there are a lot shortcuts that go on. For example, there is copy cataloging, and some follow shortcut cataloging rules. But in the end, cataloging remains a very expensive process.

In addition, the metadata resulting from libraries is professionally produced. The people who work there are professionals. They go to school to learn to be catalogers, and they do good cataloging work.

Networked Information Is Different

Networked information is profoundly different. The kind of information that is appearing on the Web–the information space that we are creating now–exists in an extremely distinct environment.

Changing Relationships

One reason for this difference is that networked information changes and makes fluid the relationships among stakeholders. For example, Carl Lagoze is, at one point, an author. Then suddenly, he is a publisher. And just as suddenly, he is a user of information. He shifts back and forth. He does all these things, all over the place. In other words, we can't really define who these people are. In addition, there are all these peculiar information intermediaries, of the America Online genre.

So these things get very fluid. And because they are fluid, the roles of these stakeholders interact in extremely complex ways. We cross paths all the time, and we would have all sorts of ways of talking to each other that really don't fit into the molds that traditional libraries have used for acquisitions and cataloging and so on.

Changing Content

In addition, content itself has changed. Those nice, packaged, physical artifacts with which we used to deal in the library are getting very fuzzy right before our eyes. One reason for this is that the amount of content with which we are dealing has vastly increased. The number of objects created is soaring, and those objects are pouring onto the Web at a phenomenal rate.

In addition, these objects are increasingly ephemeral. Things appear, they disappear, they change, they migrate to other places. There is just no stability in the environment.

The content also is changing with regard to quality. Because of the greatly reduced barriers to publication, we have extremely variable quality. And it is hard to differentiate the quality. On the Web, the works of Nobel prize-winners, for example, can sit next to the works of the local first-grade class’s writing assignment. It is frightening to consider the number of public-school papers that are being written based on information taken from Websites. When a student brings in a paper talking about the Holocaust, for example, it may be based on material from sites most of us would not consider authorities on the subject. This is the kind of thing that should give us all pause.

New Content Forms

Most interesting from my research-point of view is the whole notion of new content forms. I don't only mean images versus data versus this versus that–I mean the inter-mixing of these things in all sorts of odd ways, picking out piecemeal bits, putting them together, and calling them a document or a digital object or a piece of "stuff."

There also is the question of preservation: How do we preserve data, a component of which is a live-data feed from, for example, a meteorology satellite image? How do we solve the problems of preserving multi-media information? And what does it mean to preserve that data?

A Blurring of the Control Zones

We also have experienced a blurring of the control zones that used to be in place. This makes librarians really nervous. For example, they give access to X-network document, which they say is part of their library. Well that document has a link to another document, and that document goes here, there, there, and there, and finally the user arrives at, for example, a pornographic site. Who is responsible? Is the library responsible for providing a public service by allowing access to the original document? Are librarians responsible for cataloging the document? As you can see, the control zone gets very, very, very strange, due to distributed and ill-defined administration across individual objects and within individual objects.

Metadata for Networked Objects

As I said before, I like to think of objects as these packages of things that come from all over the place. But who is responsible for all the individual pieces? And without knowing that, how do we ensure the integrity of the pieces?

Let me lay out a couple of metadata goals for networked objects. The first is that there are so many ways that these digital objects are treated, accessed, searched, and administrated that we need to accommodate in a metadata framework the multiple roles and responsibilities of the multiple stakeholders. We want to allow them to act in independent ways. For example, if I am a library cataloger, I don't want to have to complete the slot of the metadata record that deals with terms and conditions. Instead, I want to hand that off to my lawyer or my legal team or whoever has expertise in that area and say, "That is your domain. You administer that." I will take care of the thing I want to do. This is what Clifford Lynch and Ron Daniel and I wrote about in something called the "Warwick Framework." The Resource Description Framework (RDF) also addresses this issue.

The other thing that we want very much to accommodate is layered solutions. It is very important to recognize that people want to operate across these multiple roles in layered ways. So that there are multiple forms of resource discovery, and there are multiple ways that we want to do resource discovery. Things like Dublin Core may be good for what we call basic, simple resource discovery, and there may be other, coexisting, resource-discovery metadata forms that are targeted for specific disciplines.

I am an extremely strong supporter of the Dublin Core. But we have seen efforts like the Dublin Core find themselves on a slippery slope as they say, "We can do this also, and we can do that also."

Other Issues to Consider

Let me close with some issues we have identified, given those goals on metadata for networked objects.

What to Trust?

First is the whole issue of quality and trust, which I brought up earlier: How do you know what metadata to trust? A lot of us make the assumption that the metadata is embedded in the object in sort of a traditional way–you have an HMTL page, and the metadata is in it. But the way that we are thinking in the RDF and other, similar architectures is that metadata is somehow associated with the object through external means.

Which metadata do you know to trust at that point? There is an incredible amount of index "spamming" going on on the Web today. People load their HTML pages with all sorts of junk, hidden from the user, that makes their pages come up first in a search. As a result, as information becomes a consumer object and a consumer commodity, some people are led to do things that aren't always honest.

The Issue of Interoperability

Interoperability is another issue we must consider as we consider the issues of metadata. We must remember the dimensions both of semantic and of syntactic interoperability.

Simplicity or Complexity?

There also is continuing tension among simplicity, complexity, and extensibility. This has been apparent in the Dublin Core community from the beginning. Some say that we have to make this simple so that everybody will use it. But at the same time, others are insisting that it must be able to express everything we want to express down, down, down, down, a long hierarchy.

I urge all of us to consider the costs of complexity and whether we really gain anything from it in terms of resource discovery. My favorite example comes from the information retrieval community. This community has had a metric for measuring the effectiveness of information retrieval. In fact, we can bring in our latest research and measure its effectiveness, using a very strict metric and very fine granularity numbers. And every single year, the metric is raised by, I believe, .005%. Considering the huge amount of money that is put into this research, one has to wonder: At what point does increasing the complexity of this level really improve what users get?

Tools and Practices

A question one has to ask is, are common people going to create metadata that is useful? I work on a project called the Ancestral that is a distributed digital library of computer-science technical reports. And I have found it very difficult to convince those award-winning computer-science researchers who are involved that it is important to spend five minutes creating decent metadata for their research reports. They just don't want to spend the time. So we have to ask ourselves, is it worthwhile to create these simple forms? Are people really going to use them?

Administration Issues

Lastly, how do we administer metadata? We must create tools for administering this "stuff" without overloading the architecture in some horrible way.

Previous | Next

 


Questions: Email us or Call (215) 893-1561

Copyright © 2003 NFAIS. All rights reserved. No part of this product or service may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, without prior written consent.

Privacy Policy