Search NFAIS

Home
About NFAIS
Events

Promotions
Information Community News
Press Releases
Members
Committees
Join NFAIS
Contact NFAIS

Member Login



 

 

 

 

 

 

 

 

 

 

 

Home  >>  Publications  >>  Metadiversity  >>  Preprints Contents
 
Preprints of the Metadiversity Conference Proceedings

  Session 1: The Nation’s Call to Action

The Metadata Landscape: Conventions for Semantics, Syntax, and Structure in the Internet Commons

STUART L. WEIBEL, Senior Research Scientist, OCLC

ABSTRACT

The Internet has brought previously distinct communities into closer contact in the Internet Commons. Effective resource discovery in this global information environment requires international, cross-disciplinary conventions for the creation management and exchange of resource description information. The Dublin Core and the Resource Description Framework provide two of the foundation building blocks necessary to support a resource description infrastructure of sufficient power and extensibility to meet the needs of the digital information age.

This is a daunting task, because I am an interloper here. I am not part of your community, but I used to be. I actually got my undergraduate degree in biology and, in a previous life, was a pharmacologist. I actually taught biology for a while. That is about the extent of my qualifications to talk to you this morning except to say that yes, indeed, I have been working in the area of metadata for the past two years. What I would like to talk to you about today are three aspects of metadata and why metadata might be relevant to you in your community. First I will discuss the motivation for developing new conventions for resource description and, in particular, resource descriptions about electronic resources. Secondly, I would like to tell you a little bit about the Dublin Core metadata initiative, which is the development of semantics for resource description on the Net. And finally I would like to tell you why you really don't need to care about the Dublin Core, because–whether you like it or not–there are in fact some other things that can help you, irrespective of your particular choice of a metadata standard.

What Is Metadata?

I bet that everybody in this room already knows the definition of metadata, or you all probably would not be here now. It is such a popular topic in so many venues that it is really hard to avoid the definition, but the standard definition is "data about data." I would like to modify that definition and say that, in fact, metadata is "structured data about data." Now, before I get off this topic, I want to point out that there is a strong temptation for people to say, we might need metadata about other metadata–does that make it "metametadata"? In fact, I think that if you use the term "metametadata," you are barking up the wrong tree. It is a failure to understand what metadata is really all about. Metadata is a relationship. You know, one person's metadata is just another person's data, so if you are talking about it as the object, you don't have to say "metametadata"–it is just "data." So don't be confused by that issue any more than I have managed to confuse you.

Resource Description Communities

One of the phrases important to discussions of this kind is the phrase "resource description community." I would like to suggest that a resource description community is any group of institutions and people that is characterized by a common understanding of three things: semantics, structure, and syntax. The community I come from–the library community–has had these common understandings for really 30 or more years now. We call them MARC and AACR2–our names for the rules in populating element sets. That is really all they are–a way that we can pass them around. So, if all libraries want to do is talk to themselves, we know how to do that, and we have known how for a long time. We use MARC cataloging. And the rules that we use to fill those MARC’d cataloging fields are the Anglo-American cataloging rules.

Living in the Internet Commons

But we live now in what I like to refer to as the Internet Commons. This is one of the important metaphors I will bring to you today. The notion of the Internet Commons means that we, in fact, all live in that little box that is now sitting on top of our desks. If we do not live there, then our users live there, and they want to get to us and to our information through the screens on their desks. Period. They don't want to walk to your library. They don't want to walk to your data repository. They don't want to go to the nearest store. They want it to be available on their desks. That is what I mean by the Internet Commons.

So now geospatial repositories and museums and libraries are all forced into this common box that I call the Internet Commons. And now our own little communities have to learn how to speak with one another, because we do not have those shared conventions about syntax, structure, and semantics–we have to develop them all anew.

In addition to that, there are lots of scientific databases that you have in your laboratories and that you might want to make available and visible to other communities. There also are 14-year-old boys and girls doing work on the Internet. And you know, some of those 14-year-olds are going to eventually win Nobel prizes, and we would like to be able to find metadata about what they did back when they were 14 years old. This kind of information also is going to be important.

Also, people are doing commerce on the Internet, so we want to be able to find information about commerce as well. You name it and there will be people who will want to provide metadata for it. I am, in fact, suggesting to you that it will be useful to be able to have these common conventions so that you can find even those sorts of kinds of information.

Even when you are doing scientific data searches, there is often heard this complaint about our community: We don't talk to each other. I had a lunch conversation yesterday at which someone remarked that the ornithologists don't talk to the ichthyologists and the ichthyologists don't talk to the herpetologists. Well, we need to be able to talk to one another more effectively. We need to be able to share information. So, developing those conventions is, I think, important for a lot of us.

Three Levels of Interoperability

Semantic Interoperability. Semantic interoperability is achieved through agreements about content description standards. The Dublin Core is an example of one of those. It is a new one. AACR2 is an old one. TEI, very popular in the humanities, is a relatively new one. The Federal Geographic Data Committee (FGDC) is one that is popular among many of the communities represented here. In other words, you name it and there is somebody who has a content description standard for it. There are a zillion of these things.

Structural Interoperability. Structural interoperability is another level of interoperability that we need to be able to define. I am going to be talking about mechanisms for supporting structural interoperability later. But basically it is going to be about the Resource Description Framework (RDF), which is a data model for specifying semantic schemas in a way that they can be shared.

Syntactic Interoperability. Finally, we need syntactic interoperability. This is the easiest interoperability to understand. It just means that we want to be able to mark up our data in a similar fashion so we can share the data and so that our machines can understand and take the data apart in sensible ways. The kind of syntactic interoperability that I will be talking about later on is supported by something called eXtensible Markup Language (XML)–a markup idiom for structured data on the web. If you are unsure as to whether this is going to be an important standard in the future because it is relatively new, I will tell you just one salient fact: Microsoft has publicly announced that its future versions of Word are going to marked up in XML. And if the future versions of Word and also the future versions of Excel and Access and all of these other Microsoft products are marked up in XML, it is because they need to be able to exchange data. I think that is enough said about whether XML is going to be important to us in the future. Whether you like it or not, Microsoft makes a lot of the decisions for us.

The Dublin Core Metadata Workshop Series

I would like to talk a little bit about the metadata workshop series that I have been involved in for the last several years. It is called the Dublin Core because it began at a workshop in Dublin, Ohio, where I worked three and a half years ago. These workshops were initially called simply to answer the question of how we could improve resource discovery on the Web. We were looking for simple resource description semantics. (Remember, at the time when we originated the series there were all of 500,000 individually addressable items on the web, compared to today’s 500 million and growing.) The goal then was to set up an interdisciplinary consensus about a core element set for resource discovery. It was very important that it be interdisciplinary–not just librarians, not just archivists, not just museum people, but a broad range of content experts and disciplines for resource discovery for electronic information. This was our starting point, so we wanted it to be simple and intuitive. We wanted it to be cross-disciplinary. We certainly wanted it to be international–after all, we were not talking about the Ohio Wide Web or the U.S. Wide Web, but the World Wide Web. We also wanted it to be flexible enough that it could be applied to a broad diversity of problems and a broad diversity of complexity as well.

Characteristics of the Dublin Core Metadata Element Set. The central characteristics of what was developed at that workshop have been elaborated since.

There are 15 elements. They are descriptive metadata for resource discovery. Half of these elements are the kinds of things you would expect to see in a catalog card, so this is the kind of simple metaphor for the Dublin Core. It is a catalog card for resource description. But just as the catalog card does not hold all of the information that libraries keep about resources, there are additional elements that provide you with the ability to add richness. The 15 elements in the Dublin Core Metadata Element Set are:
 

1. Title
2. Creator
3. Keywords
4. Description
5. Publisher
6. Contributor
7. Date
8. Type
9. Format
10. Identifier
11. Source
12. Language
13. Relation
14. Coverage
15. Rights

All elements are optional. You see something you don't like, don't use it.

All elements are repeatable.

The element set should be extensible, a starting place for richer description. Fifteen elements will not provide all the richness that all of us want in ultimate metadata element sets. So we want to be able to extend it, to enrich it through a variety of ways

The element set should be interdisciplinary.

The element set should be international. We now have translations in the Dublin Core in 20 different languages and there are new ones appearing on a regular basis.

Let's talk about extensibility for a moment. How can you take this catalogue card and, in fact, provide within it enough flexibility to build in the richness that you need to support much more sophisticated metadata applications? There are a couple of kinds of extensibility that I would like to talk about. The first metaphor that I will offer you is the Ukrainian doll model. You take the top off one doll and there is another doll inside, and another doll inside that one, and so on. In other words, there is a substructure. Without a substructure, interoperability is not supported very well. The idea here is that if you take a basic element called Creator, you could just plop some stuff in there. For example, you could plop in an unstructured name. But that is not going to support interoperability very well. So, you want to add some additional substructure to that element–perhaps a given name and a surname. You might also want to have some information about affiliation there because one of the ways you find people and resources is to know with what organizations they are connected. In addition, you might want a telephone number or e-mail address or something like that. So, you are basically unpacking this structure and finding that within it there are some well-defined additional structures that support the needs that you have in your particular database. That is one kind of extensibility.

The second, and I think perhaps the more important, kind of extensibility is what I refer to as the Lego™ metaphor–modular extensibility. Let’s say you want additional elements to support local or disciplinary specific requirements. In addition, you want them to be complementary–that is to say, you want them to be able to fit together. So, you might have a block of metadata that we call description metadata, such as the Dublin Core, but you also want species distribution metadata. This morning, it was stated that there are something like 750 million specimens in natural history museums around the world. Wouldn't it be great to have unified databases that would allow those sorts of things to snap together, Lego™-like, with description metadata. I like the Lego™ metaphor, because it really has a lot of richness to it. Legos™ are child's play–except that they are not child's play at all. It’s true our kids play with them, but there are some interesting things about them. One is that there are lots of different kinds of Legos™. There are the Jacques Cousteau Undersea Legos™. And there are the Astronaut Legos™. And there are the Medieval Knight Legos™. The amazing thing is that they all snap together and work together–they interoperate.

Now I don’t exactly know the semantics of mixing Jacques Cousteau and medieval castles, but my 12-year-old does. And one of the things we want about the metadata environment is not to have to anticipate what the semantics are gong to be in the future. We want people to be able to invent new semantics, the things that we don't think about or haven't thought of or maybe don't even think are even a good idea, but that might be important in the environments and for the tasks and problems in the future. That is one point about Legos™. Another is that your kids’ Legos™ can interoperate with the Legos™ that you played with 30 years ago. That is a very impressive degree of interoperability, and it comes at the cost of very highly engineered products. Legos™ are manufactured to tolerances that approach the internal combustion engine. So yes, they are child's play. But they are child's play only because they are easy to snap together from simple components into much more elaborate components. They interoperate because somebody has put a lot of thought into engineering those things so that they fit together right (and will continue to fit together right). This is the kind of interoperable metadata architecture we are trying to develop for the future.

What does this sort of extensibility mean to you, to the scientific data communities? You can think of the Dublin Core as a semantic framework–a set of top-level descriptors. These high-level descriptors can be used to describe data sets in ways that allow you to find things in a relatively straightforward manner. In addition to that you can use domains, specific schemes for further precision, so that a collection of four billion objects might be aggregated according to a particular set of standards and specifications regarding how they are encoded. You can have your own schemes that refine the semantics of subjects, of formats, of relations, and of coverage. You can use controlled vocabularies and thesauri, name spaces, and coding rules to make your metadata very specific.

Someone was also telling me yesterday about a database that was developed in Europe that has the same set of descriptors in 10 different languages. That is a very important and valuable thing to have in an international community. And you can apply such a database in something like the Dublin Core and Resource Description Framework (RDF).

The Resource Description Framework (RDF)

Let me talk to you a little bit about the Resource Description Framework (RDF), because RDF can provide you with the kind of architecture that allows you to snap different components together in a modular way. RDF is a World Wide Web Consortium (W3C) initiative. The RDF group is a formal working group under the W3C, and it is intended to develop conventions to support interoperability among applications that exchange metadata, not only among people, but among machines as well. The syntax is expressed in XML, which I mentioned a little bit earlier. RDF provides the kind of architecture that will allow stakeholders to define the semantics, whether it is Dublin Core or Global Information Locator Service (GILS) or AACR2 or FGDC–you name it, the stakeholders get to do it. That is a very important aspect of this. RDF has been proposed, and it is in "Last Call" now. If it passes through Last Call, my understanding is that is will become an officially proposed recommendation in the W3C.

The reason for RDF’s importance is that it provides a data model–a data model that is very flexible and will allow us to do lots of very good things with it (see previous page). In its very simplest form, the RDF data model is nodes connected by named properties or "arcs," so it is an arc and node model. You have a little thing called R1 on one side. P1 is a property pointing to R2. That is really all it is. It's very simple–but, of course, it is not that simple because you can do lots of very flexible things with it. The simplest thing you might want to do is a terminal node string. So you can say that R1 has a named property and the value of that named property is foo. You can hook these things together in grafts of arbitrary complexity that become RDF insertions or RDF statements that are really statements about that resource.

The Dublin Core community, which has had a substantial influence on this model, has its own application of the RDF data model, and it is structured like this: You have the resource on the left side. You have one of the 15 elements that is the named property pointing to an empty node in the middle (we called it a structural or intermediate node–it is just a little piece of structure that allows us to do some more interesting things). On the far right you have an RDF value, which is the value of the property that can either be a string or another node, and which can be further expanded. So, it is a very flexible model.

What do we actually do with the node in the middle? We hang some other properties on it. So, for example, we have something called type qualifier, and we also have something called scheme qualifiers. These two things are what we find in what we call a qualifier name space for the Dublin Core. That is what I mean by DCQ. A DC element is in the DC name space, types and schemes are in the DC qualifier name space.

What do these things actually do? A type qualifier gives us further information about the characteristics of the element itself, and a scheme qualifier gives us further information about the value that we are trying to assign to that element. One point: We really don't care about the element. The purpose of qualifying the element is to further qualify the value to give more information about the nature and the context of that value. That is what we are trying to be more specific about.

This is our basic Dublin Core version of the RDF data model (see next two pages). Let me give you some examples of that. I can tell you one of the top-level elements that has given us a lot of difficulty is the date element. As it turns out there are lots of different kinds of dates, and it is very important to be able to use the date element precisely. The DC date itself is, frankly, kind of a brain-dead element, so we want to be able to provide qualifiers to make much more specific exactly what we are saying. One of the things we might want to say about a DC date is that, okay, the date type is created. It is a date of creation, but that still is not enough. Consider this date: 11/10/98. Depending on whether you are in Europe or in the United States, that date means two different things: 11/10/98 either represents the tenth day of November or the eleventh day of October. You won’t know which date is meant unless you have a standard that tells you how to take that date apart in an authoritative manner. In this particular case, that standard is ISO-8601. ISO-8601 gives us a standard that says how the dates are arranged, so we can unambiguously take that date apart. We have qualifiers: I have an encoding scheme, ISO-8601, and I have a specified date, which is the date of creation. In this way, I have given you a date that is very precise and unambiguous and can be taken apart by people or machines in an algorithmic and reliable way.

Another example of DC relation: Relation turns out to be a tricky field to modify, to qualify. In this particular case, I have taken a very simple relation type. All I am saying is that the resource that we are talking about, R1, is part of a resource specified in the box at the right, the <http:\\parent>, which is just a dummy URL for something of which R1 is a part. So, in this way I can very precisely specify the relationship between two resources. These terms had been translated in 10 different languages in Europe, and multilingual metadata is really an important concept that we have to be able to support. The RDF data model and the DC version of that give us a very good way to do that.

Then I have an RDF value. But that value, in fact, is a compound value. It is not a simple string. It is a set of nodes and arcs that have structure of their own. In this particular variety of structure we have three different alternate versions (although it can be 10 or it can be however many you need) of the value–one of which is in English, one of which is in French, and one of which is in German. If I had all the details filled in, then a machine could come in and take this apart and say: I am interested in the value of the subject here, but I am only interested in French, or I am only interested in English, or I am interested in all of them. You can do that, and the machine can take it apart in a way that is unambiguous.

Here is another example: Creator. Names are problematic. Lots of people are already talking at this meeting about authorities. We all need authorities, and every community has different authorities. So if I have just sort of an unwashed, very simple DC application, I just might have undifferentiated strings in there. Though there is some structure to that, unless you know what that structure is, it is hard to use. Therefore, that is really not the way to do it. Instead, I might want to have something called an LCNA–a Library of Congress Name Authority entry for this particular item. So, I can point to that. I can put to it with the ((??URL??)) that tells me exactly what I am pointing to.

I might also want to have something not used by libraries but beginning to be used by the business community. It is called vCard. vCard is a sort of an Internet business card, but it is a structured name that has useful structure that I can go into and take out of an authoritative version of a name. I am sure that your community has other versions of authoritative names. You can use your own scheme in that data so that you can take apart names in a way that your community understands, but with which other communities will still have a chance of extracting a useful name.

In Summary

Let me finish up here with summary about DC and RDF. First of all, DC semantics are fairly widely accepted on the Net at this point. There are several hundred different projects using this in at least 20 different countries, and it has developed a fair amount of momentum. The mechanisms for qualifying it so that it can be used in a much more precise fashion are under development. The infrastructure is evolving rapidly. HTML has been used to encode metadata, but only in kind of a clumsy way. RDF provides us for the first time with a really flexible way to do a variety of different types of metadata. The tools to support these kinds of activity are beginning to appear. There are lots of tools that people are developing on their own, and there are also the browser manufactures starting to provide support for RDF in browsers. Interoperability testbeds are underway. One that some of you might be aware of is the CIMI testbed, involving a dozen or more museums and including some natural history museums.

If you want additional information on Dublin Core, we have a brand new homepage. You can get to it in Netscape just by typing Pearl\DC or by typing <http://purl.org.dc>. (The URL that is on that NFAIS handout still points to the old Dublin Core Home Page, but it probably is already linked to the brand-new homepage.) To learn more about RDF, see the RDF homepage at http://www.w3c.org/RDF.

Previous | Next

 


Questions: Email us or Call (215) 893-1561

Copyright © 2003 NFAIS. All rights reserved. No part of this product or service may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, without prior written consent.

Privacy Policy