Search NFAIS

Home
About NFAIS
Events

Promotions
Information Community News
Press Releases
Members
Committees
Join NFAIS
Contact NFAIS

Member Login



 

 

 

 

 

 

 

 

 

 

 

Home  >>  Publications  >>  Metadiversity  >>  Preprints Contents
 
Preprints of the Metadiversity Conference Proceedings

  Session 5: The Metadata Challenge for Libraries

Gazetteer and Collection-Level Metadata Developments

LINDA HILL, Research Specialist, Alexandria Digital Library University of California, Santa Barbara http://www.alexandria.ucsb.edu

ABSTRACT

The Alexandria Digital Library (ADL) includes both collections and digital library services that focus on georeferenced information. Georeferencing means that a key attribute of the metadata for the collection objects (COs) is geographic location as represented spatially with latitude and longitude coordinates - that is, the spatial representation of the locations that the COs are about. Collection objects include maps, remote sensing images, aerial photographs, and other obviously geospatial information; they also extend to texts, art objects, music, etc. That is, the ADL approach is designed to represent, find, evaluate, and retrieve any information that can be georeferenced. Metadata development and implementation are major components of the ADL Project. This presentation will give an overview of the ADL system architecture and then focus on two metadata areas where we have made major contributions and where action within the biodiversity community is recommended: (1) development and implementation of a Gazetteer Content Standard and accompanying Thesaurus of Feature Types and (2) implementation of a Collection Metadata that allows us to serve diverse collections through description of inherent and contextual collection data for the purposes of client-middleware communication and user documentation. Our definition of gazetteers is that they are indexes of geographic names (i.e., place-names or feature names) that contain at least three attributes per entry: name, coordinate location, and category. With such a digital tool, links can be made between direct geospatial reference (i.e., coordinates) and indirect reference (i.e., geographic names), and types of places can be identified in a specified geographic location.

The Alexandria Digital Library (ADL) is a georeferenced digital library, one of the six Digital Library Initiatives funded by the NSF, DARPA, and NASA. The funding period has now ended, and ADL is in the process of becoming an operational component of the UCSB library system and the California Digital Library that will have its public presence at the end of 1998 or the beginning of 1999. Currently access to ADL is limited to the University of California campuses only. We have limited computer equipment and help staff and cannot open it up to everyone.
All kinds of information can be georeferenced with latitude and longitude coordinates. You might think primarily about maps, aerial photographs, and remote sensing images, but text, specimen collections, music, people, and gazetteer entries can also be georeferenced to a place that they are about.

A description of the ADL client interface will illustrate a user’s view of ADL. First, the user is presented with a Map Window where a query area can be drawn to indicate that area of the world in which the user is interested. In the Search Window, the user is given a choice of collections that can be searched and search options such as Type of Item (e.g., aerial photograph), Format (e.g., online image), and Topical Text (for freetext searching). The search results are returned to the Workspace Window. Each item is briefly described and its footprint (area covered on the map) is shown in the Map Window. If a thumbnail image of the item is available, it is also shown. By highlighting listings in the Workspace Window or clicking on footprints in the Map Window, the user can selectively review interesting items and move the most interesting ones to user-created folders. Online data can be directly accessed and downloaded. The user can save the status of the Workspace and reload it at a later time.

The ADL system architecture supports the user interface. The system consists of three levels: the database level, the middleware level, and the client level. The middleware level–including access control, translation, DB connection, and logging–is the heart of the system. Multiple clients can be developed for it. On the other side, the middleware interfaces with the database level. There can be multiple collections, held locally or remotely, and the collections can have their own structures. The databases present query and retrieval views of the collection objects to the middleware. The middleware dynamically discovers which collections are available for searching and how they can be searched, and presents these to the user through the client. User-created queries are fanned out by the middleware to the queryable parameters of the collections through appropriate retrieval software. The results are merged and presented back to the user.

Another view of ADL structure is the metadata view. Here we start with a data set that represents its objects through full object metadata. ADL does not specify what the metadata at this level needs to be; it can be whatever is suitable for the collection objects. In practice, creating metadata for the data is often a labor-intensive step. Object level metadata is mapped to ADL in three ways: (1) to the middleware search buckets and some additional scan attributes; (2) to the access report; and (3) to the full metadata report.

The search buckets provide a few high-level search parameters designed to search across diverse collections. This is somewhat like the Dublin Core approach of identifying core elements for description but the ADL search buckets are designed for searching. The ADL search buckets are:

Location (latitude and longitude coordinates)
Date (date of coverage, date of publication)
Type (controlled domain)
Format (controlled domain)
Topical–Freetext
Topical–Assigned Text (derived from controlled vocabularies)
Originator (author, publisher, etc.)
Identifier (ISBN, scene ID, etc.)

Buckets with controlled domains have a limited set of values that are shown to the user for selection.

The access report provides the links to the actual data set if it is online, or to the point of contact if it is offline. It also provides information about any constraints for accessing or using the data and sometimes links to related information.

The full metadata report is a report containing attribute labels and values from the object-level metadata. ADL has created a style for these that provides a common look for the metadata from the various underlying collections.

Metadata Developments: Collection-Level Metadata

The key to accommodating multiple collections, and multiple types of collections, in this search environment is collection-level metadata, which describes the collections and how they can be searched. The collection-level metadata gives the title and the ID for the collection, the search buckets populated, and the controlled lists and controlled vocabularies associated with the collections. It also gives a collection description of two kinds:

Inherent data
Contextual description

Inherent-collection metadata can be obtained from the collection itself and includes such information as the number of items in the collection and the types and formats of items in the collections. This information can be visualized by geographic and temporal coverage to show to the user.

Contextual metadata is provided by the collection owner and includes such information as the purpose and description of the collection, its frequency of update, any constraints to its use, and contact information for the responsible person.

ADL uses collection metadata for two purposes: collection registration and user documentation. Collections are made known to the ADL middleware through an XML version of the collection metadata. The middleware dynamically discovers which collections are available for presentation to the user through this method. This metadata also tells the middleware which search buckets are active for any particular collection and what the controlled domains are for the associated buckets. All mappings that are necessary for accessing the collection are contained in this XML registration version. User documentation is an HTML version of the collection metadata. This is displayed to the user on request.

Since the variety of collections is wide–indeed it is difficult to agree on the definition "collection" in the first place–ADL developed and implemented collection metadata to describe and register whatever collections come along. It has been very successful and provides us with a way to accommodate many more collections in the future. We therefore recommend the collection metadata approach to the biodiversity community. It is the key to accessing a variety of collections, where a "collection" is whatever someone decides to call a collection. It is a way to capture both inherent and contextual metadata for registration and user documentation purposes.

Metadata Developments: Gazetteer Metadata

The next ADL metadata development I will present is the work we have done with gazetteers. The word "gazetteers" is not familiar to everyone. They can be described as dictionaries of named geographic places. ADL further defines gazetteers to require three minimum descriptive elements for each place: (1) a name; (2) a location in latitude and longitude coordinates; and (3) a type or category. One example is:

Name: Goleta
Type: populated place
Location: -119.83,34.44 (decimal degrees for longitude and latitude)

The following example illustrates the value of such a gazetteer in a digital library. A user has a "where is" type question: "Where is Philadelphia?" The system returns a footprint for Philadelphia and displays it on the map (this is a simplified example, ignoring for the moment that there is more than one Philadelphia in the world). Next the user asks, "What rivers are in the Philadelphia area?" The system knows the footprint of Philadelphia and it knows the footprints of entries in the gazetteer of the type "rivers." It can make a match of these footprints and return a list of the rivers "in" the Philadelphia area. Next the user might ask a question like, "What remote sensing images are there of the Philadelphia area?" This is a search of the catalog rather than of the gazetteer. The system can compare the footprint of Philadelphia to the footprints of items of the type "remote sensing images" in the catalog and return a list of those whose footprints overlap the Philadelphia area. This retrieval is possible not based on the images labeled with Philadelphia but because the match can be made on the basis of footprints. This use of footprints is known as indirect georeferencing.

In building ADL, we developed a 6-million-entry gazetteer by combining the two large U.S. federal gazetteers from the U.S. Geological Survey (USGS) and the National Imagery and Mapping Agency (NIMA). In the process, we found out firsthand the difficulties of combining gazetteer data from different sources. We found out that there is no shared concept of how gazetteer information is represented. We therefore developed a Gazetteer Content Standard (GCS) and a Feature Type Thesaurus (FTT) to provide type categories. We are in the processing of implementing it.

GCS provides for the representation of names and variant names for places and information about these names: the source or authority of the name, the language, etymology, pronunciation, dates when the name was/is used, and more. Each name is assigned one or more type categories. If the place has a feature code (e.g., an FIPS code), it can be included. The location of the place can be given by a point, bounding box, or polygonal coordinate description. Features can be related to one another–e.g., one place "IsPartOf" another. Data such as elevation or population can be given for a place, and links can be made to other sources of information, such as a city’s homepage. Temporal ranges can be given for the names themselves, the footprints, the data, and the relationships. Each entry, and each part of each entry, can be attributed to a contributor and to a source.

There is no common set of feature types for gazetteers and making different categorization schemes work together is one of the most difficult parts of combining data from various sources into a new gazetteer. We have developed a thesaurus of feature types that we are applying to our gazetteers and that we hope will be adopted by others. It is based on the Z39.19 standard for hierarchical thesauri designed for information retrieval. It includes broad term/narrow term relationships, synonymous terms, and related terms.

Both the Gazetteer Content Standard and the Feature Type Thesaurus are available through my homepage: http://www.alexandria.ucsb.edu/~lhill.

We are currently in the process of converting a current version of the NIMA gazetteer to the new Content Standard using the terms from the Feature Type Thesaurus. We already have sets of bounding boxes for countries and U.S. counties loaded as well as a set of volcano sites. We have various other sets waiting for conversion, including the GNIS from USGS. We are looking for sets of gazetteer information that include polygon or bounding box footprints to load. We are working on extracting polygon footprints for places from digital map products.

Georeferencing is an identification key that can be applied to all types of information–not all information, but to all types of information including place names. Georeferencing is a "natural bridge" across information types because latitude and longitude referencing is universally understood. A spatially referenced gazetteer is a powerful component of a georeference system because it adds the dimension of indirect spatial referencing through the use of place names.

We therefore recommend to the biodiversity community that standard practices for gazetteer development and use be adopted so that geographical site descriptions developed by one subgroup of the community can be shared and used by other subgroups and with other information operations as well.

Previous | Next

 


Questions: Email us or Call (215) 893-1561

Copyright © 2003 NFAIS. All rights reserved. No part of this product or service may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, without prior written consent.

Privacy Policy