Preprints of the
Metadiversity
Conference
Proceedings
Session 1: The Nation’s
Call to Action
The Metadata Landscape:
Conventions for Semantics, Syntax, and Structure in the
Internet Commons
STUART L. WEIBEL,
Senior Research Scientist, OCLC
|
ABSTRACT
The Internet has
brought previously distinct communities into closer
contact in the Internet Commons. Effective resource
discovery in this global information environment
requires international, cross-disciplinary
conventions for the creation management and
exchange of resource description information. The
Dublin Core and the Resource Description Framework
provide two of the foundation building blocks
necessary to support a resource description
infrastructure of sufficient power and
extensibility to meet the needs of the digital
information age. |
This is a daunting task, because
I am an interloper here. I am not part of your community,
but I used to be. I actually got my undergraduate degree in
biology and, in a previous life, was a pharmacologist. I
actually taught biology for a while. That is about the
extent of my qualifications to talk to you this morning
except to say that yes, indeed, I have been working in the
area of metadata for the past two years. What I would like
to talk to you about today are three aspects of metadata and
why metadata might be relevant to you in your community.
First I will discuss the motivation for developing new
conventions for resource description and, in particular,
resource descriptions about electronic resources. Secondly,
I would like to tell you a little bit about the Dublin Core
metadata initiative, which is the development of semantics
for resource description on the Net. And finally I would
like to tell you why you really don't need to care about the
Dublin Core, because–whether you like it or not–there are in
fact some other things that can help you, irrespective of
your particular choice of a metadata standard.
What Is Metadata?
I bet that everybody in this
room already knows the definition of metadata, or you all
probably would not be here now. It is such a popular topic
in so many venues that it is really hard to avoid the
definition, but the standard definition is "data about
data." I would like to modify that definition and say that,
in fact, metadata is "structured data about data." Now,
before I get off this topic, I want to point out that there
is a strong temptation for people to say, we might need
metadata about other metadata–does that make it "metametadata"?
In fact, I think that if you use the term "metametadata,"
you are barking up the wrong tree. It is a failure to
understand what metadata is really all about. Metadata is a
relationship. You know, one person's metadata is just
another person's data, so if you are talking about it as the
object, you don't have to say "metametadata"–it is just
"data." So don't be confused by that issue any more than I
have managed to confuse you.
Resource Description
Communities
One of the phrases important to
discussions of this kind is the phrase "resource description
community." I would like to suggest that a resource
description community is any group of institutions and
people that is characterized by a common understanding of
three things: semantics, structure, and syntax. The
community I come from–the library community–has had these
common understandings for really 30 or more years now. We
call them MARC and AACR2–our names for the rules in
populating element sets. That is really all they are–a way
that we can pass them around. So, if all libraries want to
do is talk to themselves, we know how to do that, and we
have known how for a long time. We use MARC cataloging. And
the rules that we use to fill those MARC’d cataloging fields
are the Anglo-American cataloging rules.
Living in the Internet
Commons
But we live now in what I like
to refer to as the Internet Commons. This is one of the
important metaphors I will bring to you today. The notion of
the Internet Commons means that we, in fact, all live in
that little box that is now sitting on top of our desks. If
we do not live there, then our users live there, and they
want to get to us and to our information through the screens
on their desks. Period. They don't want to walk to your
library. They don't want to walk to your data repository.
They don't want to go to the nearest store. They want it to
be available on their desks. That is what I mean by the
Internet Commons.
So now geospatial repositories
and museums and libraries are all forced into this common
box that I call the Internet Commons. And now our own little
communities have to learn how to speak with one another,
because we do not have those shared conventions about
syntax, structure, and semantics–we have to develop them all
anew.
In addition to that, there are
lots of scientific databases that you have in your
laboratories and that you might want to make available and
visible to other communities. There also are 14-year-old
boys and girls doing work on the Internet. And you know,
some of those 14-year-olds are going to eventually win Nobel
prizes, and we would like to be able to find metadata about
what they did back when they were 14 years old. This kind of
information also is going to be important.
Also, people are doing commerce
on the Internet, so we want to be able to find information
about commerce as well. You name it and there will be people
who will want to provide metadata for it. I am, in fact,
suggesting to you that it will be useful to be able to have
these common conventions so that you can find even those
sorts of kinds of information.
Even when you are doing
scientific data searches, there is often heard this
complaint about our community: We don't talk to each other.
I had a lunch conversation yesterday at which someone
remarked that the ornithologists don't talk to the
ichthyologists and the ichthyologists don't talk to the
herpetologists. Well, we need to be able to talk to one
another more effectively. We need to be able to share
information. So, developing those conventions is, I think,
important for a lot of us.
Three Levels of
Interoperability
Semantic Interoperability.
Semantic interoperability is achieved through agreements
about content description standards. The Dublin Core is an
example of one of those. It is a new one. AACR2 is an old
one. TEI, very popular in the humanities, is a relatively
new one. The Federal Geographic Data Committee (FGDC) is one
that is popular among many of the communities represented
here. In other words, you name it and there is somebody who
has a content description standard for it. There are a
zillion of these things.
Structural Interoperability.
Structural interoperability is another level of
interoperability that we need to be able to define. I am
going to be talking about mechanisms for supporting
structural interoperability later. But basically it is going
to be about the Resource Description Framework (RDF), which
is a data model for specifying semantic schemas in a way
that they can be shared.
Syntactic Interoperability.
Finally, we need syntactic interoperability. This is the
easiest interoperability to understand. It just means that
we want to be able to mark up our data in a similar fashion
so we can share the data and so that our machines can
understand and take the data apart in sensible ways. The
kind of syntactic interoperability that I will be talking
about later on is supported by something called eXtensible
Markup Language (XML)–a markup idiom for structured data on
the web. If you are unsure as to whether this is going to be
an important standard in the future because it is relatively
new, I will tell you just one salient fact: Microsoft has
publicly announced that its future versions of Word are
going to marked up in XML. And if the future versions of
Word and also the future versions of Excel and Access and
all of these other Microsoft products are marked up in XML,
it is because they need to be able to exchange data. I think
that is enough said about whether XML is going to be
important to us in the future. Whether you like it or not,
Microsoft makes a lot of the decisions for us.
The Dublin Core Metadata
Workshop Series
I would like to talk a little
bit about the metadata workshop series that I have been
involved in for the last several years. It is called the
Dublin Core because it began at a workshop in Dublin, Ohio,
where I worked three and a half years ago. These workshops
were initially called simply to answer the question of how
we could improve resource discovery on the Web. We were
looking for simple resource description semantics.
(Remember, at the time when we originated the series there
were all of 500,000 individually addressable items on the
web, compared to today’s 500 million and growing.) The goal
then was to set up an interdisciplinary consensus about a
core element set for resource discovery. It was very
important that it be interdisciplinary–not just librarians,
not just archivists, not just museum people, but a broad
range of content experts and disciplines for resource
discovery for electronic information. This was our starting
point, so we wanted it to be simple and intuitive. We wanted
it to be cross-disciplinary. We certainly wanted it to be
international–after all, we were not talking about the Ohio
Wide Web or the U.S. Wide Web, but the World Wide Web. We
also wanted it to be flexible enough that it could be
applied to a broad diversity of problems and a broad
diversity of complexity as well.
Characteristics of the Dublin
Core Metadata Element Set. The central characteristics of
what was developed at that workshop have been elaborated
since.
There are 15 elements. They are
descriptive metadata for resource discovery. Half of these
elements are the kinds of things you would expect to see in
a catalog card, so this is the kind of simple metaphor for
the Dublin Core. It is a catalog card for resource
description. But just as the catalog card does not hold all
of the information that libraries keep about resources,
there are additional elements that provide you with the
ability to add richness. The 15 elements in the Dublin Core
Metadata Element Set are:
1. Title
2. Creator
3. Keywords
4. Description
5. Publisher |
6.
Contributor
7. Date
8. Type
9. Format
10. Identifier |
11. Source
12. Language
13. Relation
14. Coverage
15. Rights |
All elements are optional. You
see something you don't like, don't use it.
All elements are repeatable.
The element set should be
extensible, a starting place for richer description. Fifteen
elements will not provide all the richness that all of us
want in ultimate metadata element sets. So we want to be
able to extend it, to enrich it through a variety of ways
The element set should be
interdisciplinary.
The element set should be
international. We now have translations in the Dublin Core
in 20 different languages and there are new ones appearing
on a regular basis.
Let's talk about extensibility
for a moment. How can you take this catalogue card and, in
fact, provide within it enough flexibility to build in the
richness that you need to support much more sophisticated
metadata applications? There are a couple of kinds of
extensibility that I would like to talk about. The first
metaphor that I will offer you is the Ukrainian doll model.
You take the top off one doll and there is another doll
inside, and another doll inside that one, and so on. In
other words, there is a substructure. Without a
substructure, interoperability is not supported very well.
The idea here is that if you take a basic element called
Creator, you could just plop some stuff in there. For
example, you could plop in an unstructured name. But that is
not going to support interoperability very well. So, you
want to add some additional substructure to that
element–perhaps a given name and a surname. You might also
want to have some information about affiliation there
because one of the ways you find people and resources is to
know with what organizations they are connected. In
addition, you might want a telephone number or e-mail
address or something like that. So, you are basically
unpacking this structure and finding that within it there
are some well-defined additional structures that support the
needs that you have in your particular database. That is one
kind of extensibility.
The second, and I think perhaps
the more important, kind of extensibility is what I refer to
as the Lego™ metaphor–modular extensibility. Let’s say you
want additional elements to support local or disciplinary
specific requirements. In addition, you want them to be
complementary–that is to say, you want them to be able to
fit together. So, you might have a block of metadata that we
call description metadata, such as the Dublin Core, but you
also want species distribution metadata. This morning, it
was stated that there are something like 750 million
specimens in natural history museums around the world.
Wouldn't it be great to have unified databases that would
allow those sorts of things to snap together, Lego™-like,
with description metadata. I like the Lego™ metaphor,
because it really has a lot of richness to it. Legos™ are
child's play–except that they are not child's play at all.
It’s true our kids play with them, but there are some
interesting things about them. One is that there are lots of
different kinds of Legos™. There are the Jacques Cousteau
Undersea Legos™. And there are the Astronaut Legos™. And
there are the Medieval Knight Legos™. The amazing thing is
that they all snap together and work together–they
interoperate.
Now I don’t exactly know the
semantics of mixing Jacques Cousteau and medieval castles,
but my 12-year-old does. And one of the things we want about
the metadata environment is not to have to anticipate what
the semantics are gong to be in the future. We want people
to be able to invent new semantics, the things that we don't
think about or haven't thought of or maybe don't even think
are even a good idea, but that might be important in the
environments and for the tasks and problems in the future.
That is one point about Legos™. Another is that your kids’
Legos™ can interoperate with the Legos™ that you played with
30 years ago. That is a very impressive degree of
interoperability, and it comes at the cost of very highly
engineered products. Legos™ are manufactured to tolerances
that approach the internal combustion engine. So yes, they
are child's play. But they are child's play only because
they are easy to snap together from simple components into
much more elaborate components. They interoperate because
somebody has put a lot of thought into engineering those
things so that they fit together right (and will continue to
fit together right). This is the kind of interoperable
metadata architecture we are trying to develop for the
future.
What does this sort of
extensibility mean to you, to the scientific data
communities? You can think of the Dublin Core as a semantic
framework–a set of top-level descriptors. These high-level
descriptors can be used to describe data sets in ways that
allow you to find things in a relatively straightforward
manner. In addition to that you can use domains, specific
schemes for further precision, so that a collection of four
billion objects might be aggregated according to a
particular set of standards and specifications regarding how
they are encoded. You can have your own schemes that refine
the semantics of subjects, of formats, of relations, and of
coverage. You can use controlled vocabularies and thesauri,
name spaces, and coding rules to make your metadata very
specific.
Someone was also telling me
yesterday about a database that was developed in Europe that
has the same set of descriptors in 10 different languages.
That is a very important and valuable thing to have in an
international community. And you can apply such a database
in something like the Dublin Core and Resource Description
Framework (RDF).
The Resource Description
Framework (RDF)
Let me talk to you a little bit
about the Resource Description Framework (RDF), because RDF
can provide you with the kind of architecture that allows
you to snap different components together in a modular way.
RDF is a World Wide Web Consortium (W3C) initiative. The RDF
group is a formal working group under the W3C, and it is
intended to develop conventions to support interoperability
among applications that exchange metadata, not only among
people, but among machines as well. The syntax is expressed
in XML, which I mentioned a little bit earlier. RDF provides
the kind of architecture that will allow stakeholders to
define the semantics, whether it is Dublin Core or Global
Information Locator Service (GILS) or AACR2 or FGDC–you name
it, the stakeholders get to do it. That is a very important
aspect of this. RDF has been proposed, and it is in "Last
Call" now. If it passes through Last Call, my understanding
is that is will become an officially proposed recommendation
in the W3C.
The reason for RDF’s importance
is that it provides a data model–a data model that is very
flexible and will allow us to do lots of very good things
with it (see previous page). In its very simplest form, the
RDF data model is nodes connected by named properties or
"arcs," so it is an arc and node model. You have a little
thing called R1 on one side. P1 is a property pointing to
R2. That is really all it is. It's very simple–but, of
course, it is not that simple because you can do lots of
very flexible things with it. The simplest thing you might
want to do is a terminal node string. So you can say that R1
has a named property and the value of that named property is
foo. You can hook these things together in grafts of
arbitrary complexity that become RDF insertions or RDF
statements that are really statements about that resource.
The Dublin Core community, which
has had a substantial influence on this model, has its own
application of the RDF data model, and it is structured like
this: You have the resource on the left side. You have one
of the 15 elements that is the named property pointing to an
empty node in the middle (we called it a structural or
intermediate node–it is just a little piece of structure
that allows us to do some more interesting things). On the
far right you have an RDF value, which is the value of the
property that can either be a string or another node, and
which can be further expanded. So, it is a very flexible
model.
What do we actually do with the
node in the middle? We hang some other properties on it. So,
for example, we have something called type qualifier, and we
also have something called scheme qualifiers. These two
things are what we find in what we call a qualifier name
space for the Dublin Core. That is what I mean by DCQ. A DC
element is in the DC name space, types and schemes are in
the DC qualifier name space.
What do these things actually
do? A type qualifier gives us further information about the
characteristics of the element itself, and a scheme
qualifier gives us further information about the value that
we are trying to assign to that element. One point: We
really don't care about the element. The purpose of
qualifying the element is to further qualify the value to
give more information about the nature and the context of
that value. That is what we are trying to be more specific
about.
This is our basic Dublin Core
version of the RDF data model (see next two pages). Let me
give you some examples of that. I can tell you one of the
top-level elements that has given us a lot of difficulty is
the date element. As it turns out there are lots of
different kinds of dates, and it is very important to be
able to use the date element precisely. The DC date itself
is, frankly, kind of a brain-dead element, so we want to be
able to provide qualifiers to make much more specific
exactly what we are saying. One of the things we might want
to say about a DC date is that, okay, the date type is
created. It is a date of creation, but that still is not
enough. Consider this date: 11/10/98. Depending on whether
you are in Europe or in the United States, that date means
two different things: 11/10/98 either represents the tenth
day of November or the eleventh day of October. You won’t
know which date is meant unless you have a standard that
tells you how to take that date apart in an authoritative
manner. In this particular case, that standard is ISO-8601.
ISO-8601 gives us a standard that says how the dates are
arranged, so we can unambiguously take that date apart. We
have qualifiers: I have an encoding scheme, ISO-8601, and I
have a specified date, which is the date of creation. In
this way, I have given you a date that is very precise and
unambiguous and can be taken apart by people or machines in
an algorithmic and reliable way.
Another example of DC relation:
Relation turns out to be a tricky field to modify, to
qualify. In this particular case, I have taken a very simple
relation type. All I am saying is that the resource that we
are talking about, R1, is part of a resource specified in
the box at the right, the <http:\\parent>, which is just a
dummy URL for something of which R1 is a part. So, in this
way I can very precisely specify the relationship between
two resources. These terms had been translated in 10
different languages in Europe, and multilingual metadata is
really an important concept that we have to be able to
support. The RDF data model and the DC version of that give
us a very good way to do that.
Then I have an RDF value. But
that value, in fact, is a compound value. It is not a simple
string. It is a set of nodes and arcs that have structure of
their own. In this particular variety of structure we have
three different alternate versions (although it can be 10 or
it can be however many you need) of the value–one of which
is in English, one of which is in French, and one of which
is in German. If I had all the details filled in, then a
machine could come in and take this apart and say: I am
interested in the value of the subject here, but I am only
interested in French, or I am only interested in English, or
I am interested in all of them. You can do that, and the
machine can take it apart in a way that is unambiguous.
Here is another example:
Creator. Names are problematic. Lots of people are already
talking at this meeting about authorities. We all need
authorities, and every community has different authorities.
So if I have just sort of an unwashed, very simple DC
application, I just might have undifferentiated strings in
there. Though there is some structure to that, unless you
know what that structure is, it is hard to use. Therefore,
that is really not the way to do it. Instead, I might want
to have something called an LCNA–a Library of Congress Name
Authority entry for this particular item. So, I can point to
that. I can put to it with the ((??URL??)) that tells me
exactly what I am pointing to.
I might also want to have
something not used by libraries but beginning to be used by
the business community. It is called vCard. vCard is a sort
of an Internet business card, but it is a structured name
that has useful structure that I can go into and take out of
an authoritative version of a name. I am sure that your
community has other versions of authoritative names. You can
use your own scheme in that data so that you can take apart
names in a way that your community understands, but with
which other communities will still have a chance of
extracting a useful name.
In Summary
Let me finish up here with
summary about DC and RDF. First of all, DC semantics are
fairly widely accepted on the Net at this point. There are
several hundred different projects using this in at least 20
different countries, and it has developed a fair amount of
momentum. The mechanisms for qualifying it so that it can be
used in a much more precise fashion are under development.
The infrastructure is evolving rapidly. HTML has been used
to encode metadata, but only in kind of a clumsy way. RDF
provides us for the first time with a really flexible way to
do a variety of different types of metadata. The tools to
support these kinds of activity are beginning to appear.
There are lots of tools that people are developing on their
own, and there are also the browser manufactures starting to
provide support for RDF in browsers. Interoperability
testbeds are underway. One that some of you might be aware
of is the CIMI testbed, involving a dozen or more museums
and including some natural history museums.
If you want additional
information on Dublin Core, we have a brand new homepage.
You can get to it in Netscape just by typing Pearl\DC or by
typing <http://purl.org.dc>. (The URL that is on that NFAIS
handout still points to the old Dublin Core Home Page, but
it probably is already linked to the brand-new homepage.) To
learn more about RDF, see the RDF homepage at
http://www.w3c.org/RDF.
Previous |
Next
Questions:
Email us or Call (215)
893-1561
Copyright © 2003 NFAIS. All rights
reserved. No part of this product or service may be
reproduced, stored in a retrieval system or transmitted in any
form or by any means, electronic, mechanical, photocopying,
recording or otherwise, without prior written consent.
Privacy
Policy |