Search NFAIS

Home
About NFAIS
Events

Promotions
Information Community News
Press Releases
Members
Committees
Join NFAIS
Contact NFAIS

Member Login



 

 

 

 

 

 

 

 

 

 

 

Home  >>  Publications  >>  Metadiversity  >>  Preprints Contents
 
Preprints of the Metadiversity Conference Proceedings

  Session 5: The Metadata Challenge for Libraries

Building Digital Libraries for Metadiversity: Federation across Disciplines

CLIFFORD LYNCH, Director, the Coalition for Networked Information (CNI)

I am going to make some fairly wide-ranging comments this morning, and I am afraid I am going to ask more questions than I am going to answer. But hopefully this will set the stage for some more detailed discussion of metadata issues later in the panel. It also, I think you will see, will connect very strongly with some of the comments that Steve Griffin was making about the evolution in thinking in the new phase of the NSF, ARPA et al. Digital Libraries program.

The Vocabulary of Our Industry

One of the things I want to do is probe at some words that have been used a lot today. We all feel good about these words, but the definitions can get vague. Lead candidates are digital library (what on earth is a digital library?), interoperability (something we all know is a good thing, but we are not exactly sure what it means), federation (another term that we use rather glibly that is similar in meaning to interoperability), and infrastructure (a very relative term, with one person’s infrastructure being another’s application).

Digital Libraries

The idea of a digital library really emerged in the late 1980s. I think it’s clear to everyone that digital libraries are going to be key components in the networked world and will play very significant roles in efforts such as the biodiversity program that is our focus here. But there is still a very real conceptual debate about what a digital library is and how it fits into the broader environment of networks, network services, application support services ("middleware"), and the applications themselves; there’s also debate about the roles that they play with respect to organizations and individuals. For example, in an expansionist sort of mode, one can view a digital library as the manifestation of an extensive system (infrastructure?) of digital storage repositories and the tools to organize, search, and navigate them. It is the organizing interface that puts all of these data at your disposal. Another more limited way to view a digital library is as a storage system; it is simply as an infrastructure component that houses data and that can be drawn upon to get work done. In this latter view one thinks of digital libraries as components that may be built on by applications that actually serve users–particularly in very complex, integrative, multidisciplinary environments.

One of the big, open issues today is how passive or active the digital library should be. There is one view that says digital libraries are mostly about housing data and are mainly passive; they react to user queries. There is another view that says that digital libraries ultimately become work environments; they are about making decisions, about doing analysis, about getting work done–and in that sense, digital libraries may be far less passive than some of their physical predecessors, such as print collections. They may really start moving in the direction of data-analysis environments, collaboration environments, and computer-supported work environments. In that sense, they may be the manifestation of an active kind of infrastructure and an environment for doing work.

We must think about the definition of digital libraries as we consider their roles here.

You heard earlier that the first round of NSF digital libraries was mainly about information technology. I think the kinds of things that researchers were calling for in the second round were much more about operational systems and sustainable-content bases. I also think there are questions going beyond the social and institutional roles of digital libraries, and those questions are going to really emerge over the next few years as we gain more experience with operational systems.

Interoperability and Federation

When we think about the issues involved in biodiversity, we have as big a sweep and diversity of content as any pursuit I can imagine. The range and number of information sources and computational resources that need to be woven in here is quite stunning. It includes geospatial information and remote sensing. It includes systematic and molecular biology resources. It involves environmental data, like the EPA and pollution data. It involves both the published and the gray literature from a huge number of different disciplines.

The range of data includes lots of multimedia–not just in the sense of numeric data sets or of combinations of video, text, and images, but also of spatially organized data as a basic data type. Spatially organized data will be critical, particularly as we try to understand the effects of climate change and the effects of various building and population shift activities on the biological world. We need to recognize that spatial data is going to get more and more complex, because it is going to move in the direction of time series of spatially organized data. We are very interested in how things change over time, and the quality and availability of data varies wildly across time.

That is another component here that is very, very different from, for example, thinking about libraries that work off the traditional publishing process, which in a very real sense has not changed in several hundreds of years. Articles are articles, text is text.

But now we see new generations of sensors and other technological advances improving our data, very steadily, decade by decade. We also see huge blank spaces, gaps in the data, gaps that are very significant. In addition, and unlike traditional libraries, we find wild variation of conceptual organization as we move from resource to resource.

One of the central challenges we face here is how to link together–or interoperate or federate–all of these very different resources. It’s not clear what our expectations–at least our realistic ones–really are in this area. We know that we can give some very superficial coherence to much of this information by making it available through a Web browser; we probably cannot devise a single universal data model that covers all of the resources that we’ll need to investigate and manage biodiversity. This is an area that calls out for more consideration.

Consider what we’ve learned in the library community about the difficulties of federation. On a conceptual level, library catalogs are fairly consistent as we move from place to place. Certainly, we all have enjoyed the experience of being able to walk into various research libraries and do our work. After all, one card catalog is pretty much like another.

But building a high-quality federation of library catalogs using technologies like Z39.50 has been a startlingly difficult problem–a problem with which the research library community has been struggling for over a decade now. Some of this has to do with the multiplicity of vendor systems available, coupled with the fact that the vendors are not necessarily eager to interoperate elegantly. Some has to do with the insistence by various institutions on the continuance of various local idiosyncrasies. Regardless of the cause, the fact of the matter is that construction of this federation has been hard.

When we start talking about the reach of data that I have been describing, the concept of federation becomes remarkably complex. In fact, it is not at all clear what those sort of global data models are that would allow us to accomplish that federation. The kinds of work going on in metadata standards has given us the beginnings of a definition–at least for discovery and retrieval purposes. But we know, particularly in dealing with geographic data, that we must go a lot farther than simple discovery and retrieval–there are extensive manipulation and presentation concerns that come to light. If nothing else, some of the descriptive metadata approaches let us treat the resources as a collection that can be searched consistently and systematically, even if they may have to subsequently be used sequentially or in smaller groups of similar resources that are more amenable to federation.

Infrastructure

The fourth candidate for my list of conveniently ambiguous terms is infrastructure. We have certainly heard about the construction of all manner of infrastructures in the initiatives around spatial data (think of the very phrase "National Spatial Data Infrastructure," for example) and around biological diversity data. I think that infrastructure is a sort of wonderful catchall term, but we need to start thinking much more about what we want from the overall "system"–be it applications, infrastructure, or whatever–and what the components in this system are and how they interact and connect with each other. We also need to recognize, I think, that infrastructure is a moving target–once we understand a network service, and it becomes ubiquitously available and widely used, and stable–we tend to consign it to infrastructure, and in particular new people who are learning the network regard it as infrastructure, because for them it was always there and they can count on it.

Metadata Issues

We know metadata will be an essential tool. Let me turn to some issues around metadata at a high level.

First, I think it’s important to say that I think that it’s often unhelpful to talk about metadata as a thing without context–it’s information that exists and is being exploited to make something happen: resource discovery, federation, management and presentation of a digital object, electronic commerce, whatever. Our decisions about metadata need to be guided not by metadata "theology" but by the demands of the activities that we need to support.

Metadata Use Across Communities

One of the metadata issues we must consider is the correlating and using of metadata for multiple communities. We have content that is shared in common by different communities–but the communities are describing the content for different purposes. All of a sudden we are faced with the very real challenge of using these multiple, independent metadata sets together in the kind of multi-disciplinary research about which we are talking. There are some conceptual models for this, such as the Warwick Framework and the Resource Description Framework coming out of the World Wide Web Consortium. But right now a lot of this is still architectural modeling. We have very limited experience in marshalling this sort of thing for actual retrieval across disciplinary communities.

Metadata: More than Data Elements

Another consideration about metadata: Metadata is not just data elements. Those data elements are populated by vocabularies. We have heard earlier in this meeting about various efforts to develop controlled vocabularies to populate those sorts of data elements. In fact, this is not just a commonplace activity in scientific nomenclature within disciplines. It is also, for example, long-standing practice in organizing literature from various disciplinary points-of-view. For example, think about the work the National Library of Medicine has done with medical-subject heading schemes to organize the literature of Health and Life Sciences. Think about the work that organizations like the Institute of Electrical Engineers and others have done in the vocabulary for the broad engineering disciplines.

Therefore, one of the things we face in this kind of multi-disciplinary work is inter-linking–for the first time–these controlled vocabularies. All of a sudden, these vocabularies are not just books. They are databases that act as "traffic cops" and interchange points and translators. We really have very limited experience with these kinds of multilingual, multi-terminology information resources, which, I believe, are going to be a very key component of this sort of multi-disciplinary information system. We need to explicitly recognize that along with primary information and metadata, there is going to be a need for a whole set of system components that manage vocabularies, gazetteers, thesauri, and similar tools and maintain mappings among them; these are going to be services in their own right. And if these are going to be general-purpose servers on the Net, then we need to think about the protocols or other service interfaces that applications will need in order to integrate their use.

Critical Evaluation of Metadata Costs and Results

Historically, we have taken this odd view of metadata, in which we first define the right metadata and then we try to build the accompanying systems. Automation of the library catalog offers a wonderful lesson for us here. As you know, there was this sort of proper description that was manifested in card catalogs and later in the digital bibliographic records that were used to print catalog cards when shared copy cataloging became a reality. Then systems came into place that allowed us to automate the card catalogs. We built these retrieval systems, called on-line catalogs, and very rapidly discovered that we didn't know what to do with some of the data elements in the bibliographic record that were so carefully and specifically tagged and differentiated one from another and recorded. At least some of the information there made absolutely no difference to the retrieval process, at least as far as anyone could tell.

Metadata is a really big investment. For example, the museum, archival, and special-collections world is now facing the enormous challenge of digitizing all of the treasures it has been holding in physical form. It is not uncommon in those projects to see budgets in which the generation of the metadata to describe the things that are being digitized is up around the same level of investment as the digitalization itself!

Metadata is really expensive, both to create and to maintain. This cost must be considered as we learn more about what does and doesn’t make a difference in the use of metadata, and we need to reflect this evaluation back in our ongoing processes and standards for creating metadata.

Other Concerns

I want to end by raising a few more questions about digital libraries, about scope, and about federation.

The Issue of Published Literature

There is an enormous published literature out there. Some of this is so-called grey literature–technical reports, policy statements, and so on. A lot of it is the traditional journal literature that has been coming out through traditional published journals. This literature also needs to get inter-linked into this giant system. The inter-linkage of published literature suddenly gives this project a very different character.

Some of the numeric and remote sensing databases, or the molecular biology databases are public domain or quasi-public domain. These data are fairly accessible and are often viewed as the "collective property" of the scientific community. But when we start looking at the whole body of published literature, we have a much broader range of ownership, economic models, and control concerns regarding those linking processes. It is not just the journal literature. It is the abstracting and indexing databases that organize that journal literature and make it coherent. It is also various kinds of handbooks, encyclopedias, and other reference materials that are likely to evolve into very complex databases, which again are linked to these vocabularies, which in turn are linked to these sorts of data sets. These are going to be significant parts of the system. And we should not overlook the fact that some of the vocabularies themselves are commercial properties.

The Economics of Data

All of these databases operate under different economic and business models. Therefore, we need to think about the experience of the user who is wandering around in this system. For example, for a given project, you discover some federal-government geospatial data that you can download for free. You find some journal literature. You rummage around in an encyclopedia of plant types. Some of the material you have used is commercial, while some might be supported by a university or a scholarly society.

How will the economics of this work? You can take the obvious easy-out and say, "A screen will pop up and ask you to enter your credit card number." In fact, that is not the way things are working right now. If you look particularly at the university community as it moves to use literature that is increasingly in electronic form, you find site licenses that entitle members of university communities to utilize this material. You also find very difficult problems regarding authentication and access management and regarding how you demonstrate you are a member of one of those communities. We need to think rather carefully about the sort of economic and business models that are going to infuse the user's experience in transversing the infrastructure we are building, particularly as we look at it in the broadest sense–as a place to make decisions, to do analysis, to do research, to do work, to capture an understanding.

Ownership Issues

Ownership and data sources also are going to be extremely varied. There is federal content; there is international and foreign content; there is state, local, and regional content. And remember that while there is a pretty well-established policy about how federal data are managed, paid for, and distributed, there are 50 different state policies. In addition, cities are making up policies on their own, and some cities are looking to these as significant revenue sources. So, we have a very complex patchwork of policies about access and ownership that we need to consider.

Many data sources are non-governmental. Some of these are for-profit publishers or data suppliers with fairly well-established commercial motivations and business approaches, but we also need to recognize that universities, state libraries, museums, scholarly societies, and other groups are going to play a role here. It’s much less clear what sort of ownership and economic models are going to become prevalent here.

We also need to recognize that many researchers are going to want to be able to integrate private, unreleased data with the "public" (either free or commercially accessible) information resources, but they will need to do this in a way that protects their private, unreleased data. And these arrangements will need to be flexible enough to facilitate constantly changing inter-organizational research collaborations.

The Issue of Scope

There is also a question of whether digital libraries are not only places to passively house, archive, and preserve data, but also places to use data, to understand data, to reason about data. What role will digital libraries play in the publishing process? Are digital libraries within this infrastructure going to become ways for people to publish databases, to publish research reports? How will the system relate to the grey literature as this grey literature changes character and moves into electronic forms? And will people be able to contribute to and comment on material that is already housed there? As collaborative analysis activities take place more in the electronic environment, will digital libraries include the records of these activities? These also are important questions to consider.

Legal Questions

There also is the significant issue of legal publication of record. Much of the work in the environmental, biodiversity, and spatial data communities interpenetrates with a lot of legal issues about land use, about environmental impact, and so on. In addition, some of this material is going to be material of record, so we need to deal with questions about source integrity and authenticity. And, although this is properly a matter for another talk, we have to work out the assignment and funding of responsibility for archiving all of the data for the long term.

When we talk about metadata, there is a tendency to focus very heavily on description for discovery, because we have been so challenged by these problems of federation and interoperability. My view of metadata is considerably broader and also encompasses metadata for making assertions about source, provenance, authenticity, and integrity–the whole range of things that comprise terms and conditions of use. All of these are going to be quite critical as we go on to build systems to support biodiversity.

Conclusion

I hope the questions I have raised will be useful in forming our discussion about the sort of scope and direction of the biodiversity infrastructure (or system, if you prefer to leave the assignment of roles to infrastructure as an open question) and about the role digital libraries play in infrastructures. I have to stress that there is no right answer here–it is not as if somebody had the definition of digital library and we just have to look it up. It is a question, to some extent, of what we choose to label as a digital library and how we sort the cultural and economic baggage and traditions associated with the labels we choose. But I think that these sorts of questions–questions about where work happens, where analysis happens, about our expectations in terms of interoperability and federation, and how those relate to data management and archiving–are going to be very critical as we go forward.

Previous | Next

 


Questions: Email us or Call (215) 893-1561

Copyright © 2003 NFAIS. All rights reserved. No part of this product or service may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, without prior written consent.

Privacy Policy