Preprints of the
Metadiversity
Conference
Proceedings
Session 5: The Metadata Challenge for Libraries
Building Digital
Libraries for Metadiversity: Federation across Disciplines
CLIFFORD LYNCH,
Director, the Coalition for Networked Information (CNI)
I am going to make some
fairly wide-ranging comments this morning, and I am afraid I
am going to ask more questions than I am going to answer.
But hopefully this will set the stage for some more detailed
discussion of metadata issues later in the panel. It also, I
think you will see, will connect very strongly with some of
the comments that Steve Griffin was making about the
evolution in thinking in the new phase of the NSF, ARPA et
al. Digital Libraries program.
The Vocabulary of Our
Industry
One of the things I want to
do is probe at some words that have been used a lot today.
We all feel good about these words, but the definitions can
get vague. Lead candidates are digital library (what on
earth is a digital library?), interoperability (something we
all know is a good thing, but we are not exactly sure what
it means), federation (another term that we use rather
glibly that is similar in meaning to interoperability), and
infrastructure (a very relative term, with one person’s
infrastructure being another’s application).
Digital Libraries
The idea of a digital library
really emerged in the late 1980s. I think it’s clear to
everyone that digital libraries are going to be key
components in the networked world and will play very
significant roles in efforts such as the biodiversity
program that is our focus here. But there is still a very
real conceptual debate about what a digital library is and
how it fits into the broader environment of networks,
network services, application support services
("middleware"), and the applications themselves; there’s
also debate about the roles that they play with respect to
organizations and individuals. For example, in an
expansionist sort of mode, one can view a digital library as
the manifestation of an extensive system (infrastructure?)
of digital storage repositories and the tools to organize,
search, and navigate them. It is the organizing interface
that puts all of these data at your disposal. Another more
limited way to view a digital library is as a storage
system; it is simply as an infrastructure component that
houses data and that can be drawn upon to get work done. In
this latter view one thinks of digital libraries as
components that may be built on by applications that
actually serve users–particularly in very complex,
integrative, multidisciplinary environments.
One of the big, open issues
today is how passive or active the digital library should
be. There is one view that says digital libraries are mostly
about housing data and are mainly passive; they react to
user queries. There is another view that says that digital
libraries ultimately become work environments; they are
about making decisions, about doing analysis, about getting
work done–and in that sense, digital libraries may be far
less passive than some of their physical predecessors, such
as print collections. They may really start moving in the
direction of data-analysis environments, collaboration
environments, and computer-supported work environments. In
that sense, they may be the manifestation of an active kind
of infrastructure and an environment for doing work.
We must think about the
definition of digital libraries as we consider their roles
here.
You heard earlier that the
first round of NSF digital libraries was mainly about
information technology. I think the kinds of things that
researchers were calling for in the second round were much
more about operational systems and sustainable-content
bases. I also think there are questions going beyond the
social and institutional roles of digital libraries, and
those questions are going to really emerge over the next few
years as we gain more experience with operational systems.
Interoperability and
Federation
When we think about the
issues involved in biodiversity, we have as big a sweep and
diversity of content as any pursuit I can imagine. The range
and number of information sources and computational
resources that need to be woven in here is quite stunning.
It includes geospatial information and remote sensing. It
includes systematic and molecular biology resources. It
involves environmental data, like the EPA and pollution
data. It involves both the published and the gray literature
from a huge number of different disciplines.
The range of data includes
lots of multimedia–not just in the sense of numeric data
sets or of combinations of video, text, and images, but also
of spatially organized data as a basic data type. Spatially
organized data will be critical, particularly as we try to
understand the effects of climate change and the effects of
various building and population shift activities on the
biological world. We need to recognize that spatial data is
going to get more and more complex, because it is going to
move in the direction of time series of spatially organized
data. We are very interested in how things change over time,
and the quality and availability of data varies wildly
across time.
That is another component
here that is very, very different from, for example,
thinking about libraries that work off the traditional
publishing process, which in a very real sense has not
changed in several hundreds of years. Articles are articles,
text is text.
But now we see new
generations of sensors and other technological advances
improving our data, very steadily, decade by decade. We also
see huge blank spaces, gaps in the data, gaps that are very
significant. In addition, and unlike traditional libraries,
we find wild variation of conceptual organization as we move
from resource to resource.
One of the central challenges
we face here is how to link together–or interoperate or
federate–all of these very different resources. It’s not
clear what our expectations–at least our realistic
ones–really are in this area. We know that we can give some
very superficial coherence to much of this information by
making it available through a Web browser; we probably
cannot devise a single universal data model that covers all
of the resources that we’ll need to investigate and manage
biodiversity. This is an area that calls out for more
consideration.
Consider what we’ve learned
in the library community about the difficulties of
federation. On a conceptual level, library catalogs are
fairly consistent as we move from place to place. Certainly,
we all have enjoyed the experience of being able to walk
into various research libraries and do our work. After all,
one card catalog is pretty much like another.
But building a high-quality
federation of library catalogs using technologies like
Z39.50 has been a startlingly difficult problem–a problem
with which the research library community has been
struggling for over a decade now. Some of this has to do
with the multiplicity of vendor systems available, coupled
with the fact that the vendors are not necessarily eager to
interoperate elegantly. Some has to do with the insistence
by various institutions on the continuance of various local
idiosyncrasies. Regardless of the cause, the fact of the
matter is that construction of this federation has been
hard.
When we start talking about
the reach of data that I have been describing, the concept
of federation becomes remarkably complex. In fact, it is not
at all clear what those sort of global data models are that
would allow us to accomplish that federation. The kinds of
work going on in metadata standards has given us the
beginnings of a definition–at least for discovery and
retrieval purposes. But we know, particularly in dealing
with geographic data, that we must go a lot farther than
simple discovery and retrieval–there are extensive
manipulation and presentation concerns that come to light.
If nothing else, some of the descriptive metadata approaches
let us treat the resources as a collection that can be
searched consistently and systematically, even if they may
have to subsequently be used sequentially or in smaller
groups of similar resources that are more amenable to
federation.
Infrastructure
The fourth candidate for my
list of conveniently ambiguous terms is infrastructure. We
have certainly heard about the construction of all manner of
infrastructures in the initiatives around spatial data
(think of the very phrase "National Spatial Data
Infrastructure," for example) and around biological
diversity data. I think that infrastructure is a sort of
wonderful catchall term, but we need to start thinking much
more about what we want from the overall "system"–be it
applications, infrastructure, or whatever–and what the
components in this system are and how they interact and
connect with each other. We also need to recognize, I think,
that infrastructure is a moving target–once we understand a
network service, and it becomes ubiquitously available and
widely used, and stable–we tend to consign it to
infrastructure, and in particular new people who are
learning the network regard it as infrastructure, because
for them it was always there and they can count on it.
Metadata Issues
We know metadata will be an
essential tool. Let me turn to some issues around metadata
at a high level.
First, I think it’s important
to say that I think that it’s often unhelpful to talk about
metadata as a thing without context–it’s information that
exists and is being exploited to make something happen:
resource discovery, federation, management and presentation
of a digital object, electronic commerce, whatever. Our
decisions about metadata need to be guided not by metadata
"theology" but by the demands of the activities that we need
to support.
Metadata Use Across
Communities
One of the metadata issues we
must consider is the correlating and using of metadata for
multiple communities. We have content that is shared in
common by different communities–but the communities are
describing the content for different purposes. All of a
sudden we are faced with the very real challenge of using
these multiple, independent metadata sets together in the
kind of multi-disciplinary research about which we are
talking. There are some conceptual models for this, such as
the Warwick Framework and the Resource Description Framework
coming out of the World Wide Web Consortium. But right now a
lot of this is still architectural modeling. We have very
limited experience in marshalling this sort of thing for
actual retrieval across disciplinary communities.
Metadata: More than Data
Elements
Another consideration about
metadata: Metadata is not just data elements. Those data
elements are populated by vocabularies. We have heard
earlier in this meeting about various efforts to develop
controlled vocabularies to populate those sorts of data
elements. In fact, this is not just a commonplace activity
in scientific nomenclature within disciplines. It is also,
for example, long-standing practice in organizing literature
from various disciplinary points-of-view. For example, think
about the work the National Library of Medicine has done
with medical-subject heading schemes to organize the
literature of Health and Life Sciences. Think about the work
that organizations like the Institute of Electrical
Engineers and others have done in the vocabulary for the
broad engineering disciplines.
Therefore, one of the things
we face in this kind of multi-disciplinary work is
inter-linking–for the first time–these controlled
vocabularies. All of a sudden, these vocabularies are not
just books. They are databases that act as "traffic cops"
and interchange points and translators. We really have very
limited experience with these kinds of multilingual,
multi-terminology information resources, which, I believe,
are going to be a very key component of this sort of
multi-disciplinary information system. We need to explicitly
recognize that along with primary information and metadata,
there is going to be a need for a whole set of system
components that manage vocabularies, gazetteers, thesauri,
and similar tools and maintain mappings among them; these
are going to be services in their own right. And if these
are going to be general-purpose servers on the Net, then we
need to think about the protocols or other service
interfaces that applications will need in order to integrate
their use.
Critical Evaluation of
Metadata Costs and Results
Historically, we have taken
this odd view of metadata, in which we first define the
right metadata and then we try to build the accompanying
systems. Automation of the library catalog offers a
wonderful lesson for us here. As you know, there was this
sort of proper description that was manifested in card
catalogs and later in the digital bibliographic records that
were used to print catalog cards when shared copy cataloging
became a reality. Then systems came into place that allowed
us to automate the card catalogs. We built these retrieval
systems, called on-line catalogs, and very rapidly
discovered that we didn't know what to do with some of the
data elements in the bibliographic record that were so
carefully and specifically tagged and differentiated one
from another and recorded. At least some of the information
there made absolutely no difference to the retrieval
process, at least as far as anyone could tell.
Metadata is a really big
investment. For example, the museum, archival, and
special-collections world is now facing the enormous
challenge of digitizing all of the treasures it has been
holding in physical form. It is not uncommon in those
projects to see budgets in which the generation of the
metadata to describe the things that are being digitized is
up around the same level of investment as the digitalization
itself!
Metadata is really expensive,
both to create and to maintain. This cost must be considered
as we learn more about what does and doesn’t make a
difference in the use of metadata, and we need to reflect
this evaluation back in our ongoing processes and standards
for creating metadata.
Other Concerns
I want to end by raising a
few more questions about digital libraries, about scope, and
about federation.
The Issue of Published
Literature
There is an enormous
published literature out there. Some of this is so-called
grey literature–technical reports, policy statements, and so
on. A lot of it is the traditional journal literature that
has been coming out through traditional published journals.
This literature also needs to get inter-linked into this
giant system. The inter-linkage of published literature
suddenly gives this project a very different character.
Some of the numeric and
remote sensing databases, or the molecular biology databases
are public domain or quasi-public domain. These data are
fairly accessible and are often viewed as the "collective
property" of the scientific community. But when we start
looking at the whole body of published literature, we have a
much broader range of ownership, economic models, and
control concerns regarding those linking processes. It is
not just the journal literature. It is the abstracting and
indexing databases that organize that journal literature and
make it coherent. It is also various kinds of handbooks,
encyclopedias, and other reference materials that are likely
to evolve into very complex databases, which again are
linked to these vocabularies, which in turn are linked to
these sorts of data sets. These are going to be significant
parts of the system. And we should not overlook the fact
that some of the vocabularies themselves are commercial
properties.
The Economics of Data
All of these databases
operate under different economic and business models.
Therefore, we need to think about the experience of the user
who is wandering around in this system. For example, for a
given project, you discover some federal-government
geospatial data that you can download for free. You find
some journal literature. You rummage around in an
encyclopedia of plant types. Some of the material you have
used is commercial, while some might be supported by a
university or a scholarly society.
How will the economics of
this work? You can take the obvious easy-out and say, "A
screen will pop up and ask you to enter your credit card
number." In fact, that is not the way things are working
right now. If you look particularly at the university
community as it moves to use literature that is increasingly
in electronic form, you find site licenses that entitle
members of university communities to utilize this material.
You also find very difficult problems regarding
authentication and access management and regarding how you
demonstrate you are a member of one of those communities. We
need to think rather carefully about the sort of economic
and business models that are going to infuse the user's
experience in transversing the infrastructure we are
building, particularly as we look at it in the broadest
sense–as a place to make decisions, to do analysis, to do
research, to do work, to capture an understanding.
Ownership Issues
Ownership and data sources
also are going to be extremely varied. There is federal
content; there is international and foreign content; there
is state, local, and regional content. And remember that
while there is a pretty well-established policy about how
federal data are managed, paid for, and distributed, there
are 50 different state policies. In addition, cities are
making up policies on their own, and some cities are looking
to these as significant revenue sources. So, we have a very
complex patchwork of policies about access and ownership
that we need to consider.
Many data sources are
non-governmental. Some of these are for-profit publishers or
data suppliers with fairly well-established commercial
motivations and business approaches, but we also need to
recognize that universities, state libraries, museums,
scholarly societies, and other groups are going to play a
role here. It’s much less clear what sort of ownership and
economic models are going to become prevalent here.
We also need to recognize
that many researchers are going to want to be able to
integrate private, unreleased data with the "public" (either
free or commercially accessible) information resources, but
they will need to do this in a way that protects their
private, unreleased data. And these arrangements will need
to be flexible enough to facilitate constantly changing
inter-organizational research collaborations.
The Issue of Scope
There is also a question of
whether digital libraries are not only places to passively
house, archive, and preserve data, but also places to use
data, to understand data, to reason about data. What role
will digital libraries play in the publishing process? Are
digital libraries within this infrastructure going to become
ways for people to publish databases, to publish research
reports? How will the system relate to the grey literature
as this grey literature changes character and moves into
electronic forms? And will people be able to contribute to
and comment on material that is already housed there? As
collaborative analysis activities take place more in the
electronic environment, will digital libraries include the
records of these activities? These also are important
questions to consider.
Legal Questions
There also is the significant
issue of legal publication of record. Much of the work in
the environmental, biodiversity, and spatial data
communities interpenetrates with a lot of legal issues about
land use, about environmental impact, and so on. In
addition, some of this material is going to be material of
record, so we need to deal with questions about source
integrity and authenticity. And, although this is properly a
matter for another talk, we have to work out the assignment
and funding of responsibility for archiving all of the data
for the long term.
When we talk about metadata,
there is a tendency to focus very heavily on description for
discovery, because we have been so challenged by these
problems of federation and interoperability. My view of
metadata is considerably broader and also encompasses
metadata for making assertions about source, provenance,
authenticity, and integrity–the whole range of things that
comprise terms and conditions of use. All of these are going
to be quite critical as we go on to build systems to support
biodiversity.
Conclusion
I hope the questions I have
raised will be useful in forming our discussion about the
sort of scope and direction of the biodiversity
infrastructure (or system, if you prefer to leave the
assignment of roles to infrastructure as an open question)
and about the role digital libraries play in
infrastructures. I have to stress that there is no right
answer here–it is not as if somebody had the definition of
digital library and we just have to look it up. It is a
question, to some extent, of what we choose to label as a
digital library and how we sort the cultural and economic
baggage and traditions associated with the labels we choose.
But I think that these sorts of questions–questions about
where work happens, where analysis happens, about our
expectations in terms of interoperability and federation,
and how those relate to data management and archiving–are
going to be very critical as we go forward.
Previous |
Next
Questions:
Email us or Call (215)
893-1561
Copyright © 2003 NFAIS. All rights
reserved. No part of this product or service may be
reproduced, stored in a retrieval system or transmitted in any
form or by any means, electronic, mechanical, photocopying,
recording or otherwise, without prior written consent.
Privacy
Policy |