Preprints of the
Metadiversity
Conference
Proceedings
Session 5: The Metadata Challenge for Libraries
Metadata Challenges for
Libraries
CARL LAGOZE, Digital
Library Scientist, Cornell University
|
ABSTRACT
Metadata creation, in
the form of cataloging, has been a fundamental task
of traditional libraries. The movement from
physical to digital artifacts brings some
fundamental challenges to this metadata creation
process. These challenges include new types of
documents, few controls over the quality of
content, an unprecedented increase in the quantity
of content, and distributed management of content
and services. In such an environment, traditional
metadata methods are neither appropriate nor
sufficient. In this short talk, I will briefly
review these challenges and describe some
work-in-progress for creating new metadata
practices and standards. |
I am going to make some
observations about metadata and how metadata has worked in
traditional library environments. Then I will look at some
of the challenges that we face regarding doing metadata on
the Internet. My apologies to the librarians here if I make
observations that violate your beliefs about libraries. I
want to clarify that I am not a librarian. I try to give
accurate observations based on my experiences with
libraries.
The Purpose of Metadata
Earlier in this meeting,
metadata was defined as being structured data about data. In
this talk, I am going to focus on metadata as something that
imposes structure. In fact, I am going to present the view
that, overall, the purpose of metadata is to impose some
order in a disordered information universe.
What we generally end up with
when we search is a lot of "stuff," as many people call it
these days. We may call the "stuff" documents or digital
objects, but most seem to be comfortable with calling it
"stuff." So, we have a lot of stuff out there, and as we
access that stuff and as we try to find that stuff and use
that stuff, we are more comfortable thinking of it in
ordered categories and approaching it from an ordered point
of view, rather than as an amorphous glob of stuff. One of
the major roles of metadata in libraries is to help us
impose some order, to make us think that the stuff out there
all has an author or creator (as we like to call it in the
Dublin Core world) and a title and a subject. Even if that
is a myth, it at least helps us to think about it that way.
Order-Making as a Library
Function
Many people think that all
libraries are big warehouses of books and maps and things
like that. But in fact, one of the basic functions of
libraries is order-making. One of the things libraries do
best is impose this order, and they do it in two ways. They
do this with spatial ordering through shelving rules (e.g.,
the Dewy Decimal system), and they do this with a sort of
semantic, logical ordering through cataloging rules (e.g.,
AARC2, MARC). These orders allow us to use libraries in lots
of ways. We can enjoy walking up to the shelves and
serendipitously browsing down the shelves or running into
things in sort of familiar clumpings or going through the
card catalog.
I have a rather elementary
hypothesis that will be of no surprise to anyone: The ease
of order-making is inversely proportional to the level of
chaos. In fact, one of the reasons that libraries have been
so successful–and this is not to demean their level of
success or how hard the work actually is–is that the
environment in which librarians work has a relatively low
level of chaos. But as we step away from the physical
artifacts and move beyond the walls, we are opening that
chaos and making the chaos boundless. This makes the job
very difficult.
Traditional Metadata-Creation
Environments
Let's briefly look at the
traditional cataloging metadata-creation environment within
which libraries have worked.
Characteristics of the
Environment
The metadata-creation
environment has a number of characteristics. First is the
notion of the stability of a physical artifact: I know that
I have something and that I can describe it and it stays
there and it doesn't disappear. It is something that has
very stable characteristics.
The second characteristic of a
traditional cataloging metadata-creation environment is that
it has clearly established roles. Usually I know who is an
author, who is a publisher, who is a consumer of the
information, and all these roles sort of stay where they
are–they don't shift.
Another characteristic is that
there is a relatively small number of content producers. For
example, the Cornell Library does not collect the works of
everybody. Instead, it collects the works of an established
set of content producers with which it has established
quality and trusting relationships. This is very important
to the way traditional libraries work. They know whom to
trust, and therefore you can trust the libraries–you can
assume when you go into the Cornell Library that you will
find material that is accurate and acceptable.
Lastly, there is this notion of
a defined "control zone." What is inside the walls is part
of the library. What is outside the walls is, more or less,
not part of the library. Of course, the walls are somewhat
permeable, considering things like interlibrary loans, but
they at least exist.
Effects on the Resulting
Metadata
Together, these characteristics
mean that library cataloging metadata is extremely
high-quality. This is despite the fact that libraries are
dealing with complex information. If you have ever actually
read ACRII, for example, you know that it is not exactly the
kind of thing you can go through in one pass. It requires an
extremely high investment of time and money. I think that
the general rule of thumb is $60 to $65 per record in
original cataloging. That is a lot of money when you
consider the acquisition rate of a place like the Cornell
Library. Of course, there are a lot shortcuts that go on.
For example, there is copy cataloging, and some follow
shortcut cataloging rules. But in the end, cataloging
remains a very expensive process.
In addition, the metadata
resulting from libraries is professionally produced. The
people who work there are professionals. They go to school
to learn to be catalogers, and they do good cataloging work.
Networked Information Is
Different
Networked information is
profoundly different. The kind of information that is
appearing on the Web–the information space that we are
creating now–exists in an extremely distinct environment.
Changing Relationships
One reason for this difference
is that networked information changes and makes fluid the
relationships among stakeholders. For example, Carl Lagoze
is, at one point, an author. Then suddenly, he is a
publisher. And just as suddenly, he is a user of
information. He shifts back and forth. He does all these
things, all over the place. In other words, we can't really
define who these people are. In addition, there are all
these peculiar information intermediaries, of the America
Online genre.
So these things get very fluid.
And because they are fluid, the roles of these stakeholders
interact in extremely complex ways. We cross paths all the
time, and we would have all sorts of ways of talking to each
other that really don't fit into the molds that traditional
libraries have used for acquisitions and cataloging and so
on.
Changing Content
In addition, content itself has
changed. Those nice, packaged, physical artifacts with which
we used to deal in the library are getting very fuzzy right
before our eyes. One reason for this is that the amount of
content with which we are dealing has vastly increased. The
number of objects created is soaring, and those objects are
pouring onto the Web at a phenomenal rate.
In addition, these objects are
increasingly ephemeral. Things appear, they disappear, they
change, they migrate to other places. There is just no
stability in the environment.
The content also is changing
with regard to quality. Because of the greatly reduced
barriers to publication, we have extremely variable quality.
And it is hard to differentiate the quality. On the Web, the
works of Nobel prize-winners, for example, can sit next to
the works of the local first-grade class’s writing
assignment. It is frightening to consider the number of
public-school papers that are being written based on
information taken from Websites. When a student brings in a
paper talking about the Holocaust, for example, it may be
based on material from sites most of us would not consider
authorities on the subject. This is the kind of thing that
should give us all pause.
New Content Forms
Most interesting from my
research-point of view is the whole notion of new content
forms. I don't only mean images versus data versus this
versus that–I mean the inter-mixing of these things in all
sorts of odd ways, picking out piecemeal bits, putting them
together, and calling them a document or a digital object or
a piece of "stuff."
There also is the question of
preservation: How do we preserve data, a component of which
is a live-data feed from, for example, a meteorology
satellite image? How do we solve the problems of preserving
multi-media information? And what does it mean to preserve
that data?
A Blurring of the Control
Zones
We also have experienced a
blurring of the control zones that used to be in place. This
makes librarians really nervous. For example, they give
access to X-network document, which they say is part of
their library. Well that document has a link to another
document, and that document goes here, there, there, and
there, and finally the user arrives at, for example, a
pornographic site. Who is responsible? Is the library
responsible for providing a public service by allowing
access to the original document? Are librarians responsible
for cataloging the document? As you can see, the control
zone gets very, very, very strange, due to distributed and
ill-defined administration across individual objects and
within individual objects.
Metadata for Networked
Objects
As I said before, I like to
think of objects as these packages of things that come from
all over the place. But who is responsible for all the
individual pieces? And without knowing that, how do we
ensure the integrity of the pieces?
Let me lay out a couple of
metadata goals for networked objects. The first is that
there are so many ways that these digital objects are
treated, accessed, searched, and administrated that we need
to accommodate in a metadata framework the multiple roles
and responsibilities of the multiple stakeholders. We want
to allow them to act in independent ways. For example, if I
am a library cataloger, I don't want to have to complete the
slot of the metadata record that deals with terms and
conditions. Instead, I want to hand that off to my lawyer or
my legal team or whoever has expertise in that area and say,
"That is your domain. You administer that." I will take care
of the thing I want to do. This is what Clifford Lynch and
Ron Daniel and I wrote about in something called the
"Warwick Framework." The Resource Description Framework (RDF)
also addresses this issue.
The other thing that we want
very much to accommodate is layered solutions. It is very
important to recognize that people want to operate across
these multiple roles in layered ways. So that there are
multiple forms of resource discovery, and there are multiple
ways that we want to do resource discovery. Things like
Dublin Core may be good for what we call basic, simple
resource discovery, and there may be other, coexisting,
resource-discovery metadata forms that are targeted for
specific disciplines.
I am an extremely strong
supporter of the Dublin Core. But we have seen efforts like
the Dublin Core find themselves on a slippery slope as they
say, "We can do this also, and we can do that also."
Other Issues to Consider
Let me close with some issues we
have identified, given those goals on metadata for networked
objects.
What to Trust?
First is the whole issue of
quality and trust, which I brought up earlier: How do you
know what metadata to trust? A lot of us make the assumption
that the metadata is embedded in the object in sort of a
traditional way–you have an HMTL page, and the metadata is
in it. But the way that we are thinking in the RDF and
other, similar architectures is that metadata is somehow
associated with the object through external means.
Which metadata do you know to
trust at that point? There is an incredible amount of index
"spamming" going on on the Web today. People load their HTML
pages with all sorts of junk, hidden from the user, that
makes their pages come up first in a search. As a result, as
information becomes a consumer object and a consumer
commodity, some people are led to do things that aren't
always honest.
The Issue of Interoperability
Interoperability is another
issue we must consider as we consider the issues of
metadata. We must remember the dimensions both of semantic
and of syntactic interoperability.
Simplicity or Complexity?
There also is continuing tension
among simplicity, complexity, and extensibility. This has
been apparent in the Dublin Core community from the
beginning. Some say that we have to make this simple so that
everybody will use it. But at the same time, others are
insisting that it must be able to express everything we want
to express down, down, down, down, a long hierarchy.
I urge all of us to consider the
costs of complexity and whether we really gain anything from
it in terms of resource discovery. My favorite example comes
from the information retrieval community. This community has
had a metric for measuring the effectiveness of information
retrieval. In fact, we can bring in our latest research and
measure its effectiveness, using a very strict metric and
very fine granularity numbers. And every single year, the
metric is raised by, I believe, .005%. Considering the huge
amount of money that is put into this research, one has to
wonder: At what point does increasing the complexity of this
level really improve what users get?
Tools and Practices
A question one has to ask is,
are common people going to create metadata that is useful? I
work on a project called the Ancestral that is a distributed
digital library of computer-science technical reports. And I
have found it very difficult to convince those award-winning
computer-science researchers who are involved that it is
important to spend five minutes creating decent metadata for
their research reports. They just don't want to spend the
time. So we have to ask ourselves, is it worthwhile to
create these simple forms? Are people really going to use
them?
Administration Issues
Lastly, how do we administer
metadata? We must create tools for administering this
"stuff" without overloading the architecture in some
horrible way.
Previous |
Next
Questions:
Email us or Call (215)
893-1561
Copyright © 2003 NFAIS. All rights
reserved. No part of this product or service may be
reproduced, stored in a retrieval system or transmitted in any
form or by any means, electronic, mechanical, photocopying,
recording or otherwise, without prior written consent.
Privacy
Policy |