Preprints of the
Metadiversity
Conference
Proceedings
Session 5: The Metadata Challenge for Libraries
Gazetteer and
Collection-Level Metadata Developments
LINDA HILL,
Research Specialist, Alexandria Digital Library University
of California, Santa Barbara
http://www.alexandria.ucsb.edu
|
ABSTRACT
The Alexandria Digital
Library (ADL) includes both collections and digital
library services that focus on georeferenced
information. Georeferencing means that a key
attribute of the metadata for the collection
objects (COs) is geographic location as represented
spatially with latitude and longitude coordinates -
that is, the spatial representation of the
locations that the COs are about. Collection
objects include maps, remote sensing images, aerial
photographs, and other obviously geospatial
information; they also extend to texts, art
objects, music, etc. That is, the ADL approach is
designed to represent, find, evaluate, and retrieve
any information that can be georeferenced. Metadata
development and implementation are major components
of the ADL Project. This presentation will give an
overview of the ADL system architecture and then
focus on two metadata areas where we have made
major contributions and where action within the
biodiversity community is recommended: (1)
development and implementation of a Gazetteer
Content Standard and accompanying Thesaurus of
Feature Types and (2) implementation of a
Collection Metadata that allows us to serve diverse
collections through description of inherent and
contextual collection data for the purposes of
client-middleware communication and user
documentation. Our definition of gazetteers is that
they are indexes of geographic names (i.e.,
place-names or feature names) that contain at least
three attributes per entry: name, coordinate
location, and category. With such a digital tool,
links can be made between direct geospatial
reference (i.e., coordinates) and indirect
reference (i.e., geographic names), and types of
places can be identified in a specified geographic
location. |
The Alexandria Digital Library (ADL)
is a georeferenced digital library, one of the six Digital
Library Initiatives funded by the NSF, DARPA, and NASA. The
funding period has now ended, and ADL is in the process of
becoming an operational component of the UCSB library system
and the California Digital Library that will have its public
presence at the end of 1998 or the beginning of 1999.
Currently access to ADL is limited to the University of
California campuses only. We have limited computer equipment
and help staff and cannot open it up to everyone.
All kinds of information can be georeferenced with latitude
and longitude coordinates. You might think primarily about
maps, aerial photographs, and remote sensing images, but
text, specimen collections, music, people, and gazetteer
entries can also be georeferenced to a place that they are
about.
A description of the ADL client
interface will illustrate a user’s view of ADL. First, the
user is presented with a Map Window where a query area can
be drawn to indicate that area of the world in which the
user is interested. In the Search Window, the user is given
a choice of collections that can be searched and search
options such as Type of Item (e.g., aerial photograph),
Format (e.g., online image), and Topical Text (for freetext
searching). The search results are returned to the Workspace
Window. Each item is briefly described and its footprint
(area covered on the map) is shown in the Map Window. If a
thumbnail image of the item is available, it is also shown.
By highlighting listings in the Workspace Window or clicking
on footprints in the Map Window, the user can selectively
review interesting items and move the most interesting ones
to user-created folders. Online data can be directly
accessed and downloaded. The user can save the status of the
Workspace and reload it at a later time.
The ADL system architecture
supports the user interface. The system consists of three
levels: the database level, the middleware level, and the
client level. The middleware level–including access control,
translation, DB connection, and logging–is the heart of the
system. Multiple clients can be developed for it. On the
other side, the middleware interfaces with the database
level. There can be multiple collections, held locally or
remotely, and the collections can have their own structures.
The databases present query and retrieval views of the
collection objects to the middleware. The middleware
dynamically discovers which collections are available for
searching and how they can be searched, and presents these
to the user through the client. User-created queries are
fanned out by the middleware to the queryable parameters of
the collections through appropriate retrieval software. The
results are merged and presented back to the user.
Another view of ADL structure is
the metadata view. Here we start with a data set that
represents its objects through full object metadata. ADL
does not specify what the metadata at this level needs to
be; it can be whatever is suitable for the collection
objects. In practice, creating metadata for the data is
often a labor-intensive step. Object level metadata is
mapped to ADL in three ways: (1) to the middleware search
buckets and some additional scan attributes; (2) to the
access report; and (3) to the full metadata report.
The search buckets provide a few
high-level search parameters designed to search across
diverse collections. This is somewhat like the Dublin Core
approach of identifying core elements for description but
the ADL search buckets are designed for searching. The ADL
search buckets are:
Location (latitude and longitude
coordinates)
Date (date of coverage, date of publication)
Type (controlled domain)
Format (controlled domain)
Topical–Freetext
Topical–Assigned Text (derived from controlled vocabularies)
Originator (author, publisher, etc.)
Identifier (ISBN, scene ID, etc.)
Buckets with controlled domains
have a limited set of values that are shown to the user for
selection.
The access report provides the
links to the actual data set if it is online, or to the
point of contact if it is offline. It also provides
information about any constraints for accessing or using the
data and sometimes links to related information.
The full metadata report is a
report containing attribute labels and values from the
object-level metadata. ADL has created a style for these
that provides a common look for the metadata from the
various underlying collections.
Metadata Developments:
Collection-Level Metadata
The key to accommodating
multiple collections, and multiple types of collections, in
this search environment is collection-level metadata, which
describes the collections and how they can be searched. The
collection-level metadata gives the title and the ID for the
collection, the search buckets populated, and the controlled
lists and controlled vocabularies associated with the
collections. It also gives a collection description of two
kinds:
Inherent data
Contextual description
Inherent-collection metadata can
be obtained from the collection itself and includes such
information as the number of items in the collection and the
types and formats of items in the collections. This
information can be visualized by geographic and temporal
coverage to show to the user.
Contextual metadata is provided
by the collection owner and includes such information as the
purpose and description of the collection, its frequency of
update, any constraints to its use, and contact information
for the responsible person.
ADL uses collection metadata for
two purposes: collection registration and user
documentation. Collections are made known to the ADL
middleware through an XML version of the collection
metadata. The middleware dynamically discovers which
collections are available for presentation to the user
through this method. This metadata also tells the middleware
which search buckets are active for any particular
collection and what the controlled domains are for the
associated buckets. All mappings that are necessary for
accessing the collection are contained in this XML
registration version. User documentation is an HTML version
of the collection metadata. This is displayed to the user on
request.
Since the variety of collections
is wide–indeed it is difficult to agree on the definition
"collection" in the first place–ADL developed and
implemented collection metadata to describe and register
whatever collections come along. It has been very successful
and provides us with a way to accommodate many more
collections in the future. We therefore recommend the
collection metadata approach to the biodiversity community.
It is the key to accessing a variety of collections, where a
"collection" is whatever someone decides to call a
collection. It is a way to capture both inherent and
contextual metadata for registration and user documentation
purposes.
Metadata Developments:
Gazetteer Metadata
The next ADL metadata
development I will present is the work we have done with
gazetteers. The word "gazetteers" is not familiar to
everyone. They can be described as dictionaries of named
geographic places. ADL further defines gazetteers to require
three minimum descriptive elements for each place: (1) a
name; (2) a location in latitude and longitude coordinates;
and (3) a type or category. One example is:
Name: Goleta
Type: populated place
Location: -119.83,34.44 (decimal degrees for longitude and
latitude)
The following example
illustrates the value of such a gazetteer in a digital
library. A user has a "where is" type question: "Where is
Philadelphia?" The system returns a footprint for
Philadelphia and displays it on the map (this is a
simplified example, ignoring for the moment that there is
more than one Philadelphia in the world). Next the user
asks, "What rivers are in the Philadelphia area?" The system
knows the footprint of Philadelphia and it knows the
footprints of entries in the gazetteer of the type "rivers."
It can make a match of these footprints and return a list of
the rivers "in" the Philadelphia area. Next the user might
ask a question like, "What remote sensing images are there
of the Philadelphia area?" This is a search of the catalog
rather than of the gazetteer. The system can compare the
footprint of Philadelphia to the footprints of items of the
type "remote sensing images" in the catalog and return a
list of those whose footprints overlap the Philadelphia
area. This retrieval is possible not based on the images
labeled with Philadelphia but because the match can be made
on the basis of footprints. This use of footprints is known
as indirect georeferencing.
In building ADL, we developed a
6-million-entry gazetteer by combining the two large U.S.
federal gazetteers from the U.S. Geological Survey (USGS)
and the National Imagery and Mapping Agency (NIMA). In the
process, we found out firsthand the difficulties of
combining gazetteer data from different sources. We found
out that there is no shared concept of how gazetteer
information is represented. We therefore developed a
Gazetteer Content Standard (GCS) and a Feature Type
Thesaurus (FTT) to provide type categories. We are in the
processing of implementing it.
GCS provides for the
representation of names and variant names for places and
information about these names: the source or authority of
the name, the language, etymology, pronunciation, dates when
the name was/is used, and more. Each name is assigned one or
more type categories. If the place has a feature code (e.g.,
an FIPS code), it can be included. The location of the place
can be given by a point, bounding box, or polygonal
coordinate description. Features can be related to one
another–e.g., one place "IsPartOf" another. Data such as
elevation or population can be given for a place, and links
can be made to other sources of information, such as a
city’s homepage. Temporal ranges can be given for the names
themselves, the footprints, the data, and the relationships.
Each entry, and each part of each entry, can be attributed
to a contributor and to a source.
There is no common set of
feature types for gazetteers and making different
categorization schemes work together is one of the most
difficult parts of combining data from various sources into
a new gazetteer. We have developed a thesaurus of feature
types that we are applying to our gazetteers and that we
hope will be adopted by others. It is based on the Z39.19
standard for hierarchical thesauri designed for information
retrieval. It includes broad term/narrow term relationships,
synonymous terms, and related terms.
Both the Gazetteer Content
Standard and the Feature Type Thesaurus are available
through my homepage:
http://www.alexandria.ucsb.edu/~lhill.
We are currently in the process
of converting a current version of the NIMA gazetteer to the
new Content Standard using the terms from the Feature Type
Thesaurus. We already have sets of bounding boxes for
countries and U.S. counties loaded as well as a set of
volcano sites. We have various other sets waiting for
conversion, including the GNIS from USGS. We are looking for
sets of gazetteer information that include polygon or
bounding box footprints to load. We are working on
extracting polygon footprints for places from digital map
products.
Georeferencing is an
identification key that can be applied to all types of
information–not all information, but to all types of
information including place names. Georeferencing is a
"natural bridge" across information types because latitude
and longitude referencing is universally understood. A
spatially referenced gazetteer is a powerful component of a
georeference system because it adds the dimension of
indirect spatial referencing through the use of place names.
We therefore recommend to the
biodiversity community that standard practices for gazetteer
development and use be adopted so that geographical site
descriptions developed by one subgroup of the community can
be shared and used by other subgroups and with other
information operations as well.
Previous |
Next
Questions:
Email us or Call (215)
893-1561
Copyright © 2003 NFAIS. All rights
reserved. No part of this product or service may be
reproduced, stored in a retrieval system or transmitted in any
form or by any means, electronic, mechanical, photocopying,
recording or otherwise, without prior written consent.
Privacy
Policy |