Preprints of the
Metadiversity
Conference
Proceedings
Session 2: The Challenge in
Species Discovery and Taxonomic Information
Doing the Impossible:
Creating a Stable Species Index and Operating a Common
Access System on the Internet
FRANK BISBY, Director,
Centre for Plant Diversity and Systematics,
University of Reading, Species 2000
|
ABSTRACT
The Species 2000
Project is working with some novel techniques in
its ambitious mission to create an index of the
world’s known species. One is the creation of
stable taxonomic indexes for individual groups of
organisms by the member organizations: the Global
Species Databases. How may this be done? How may
the taxonomy be stabilized yet fluid enough to
accommodate change? Another is the creation of a
common access system to address an array of such
databases so that they can operate as a single
virtual index covering all groups. If existing
Global Species Databases are to be used, this
becomes a demanding specification at the computer
science level, quite apart from the challenge of
forming a seamless index from the components that
are compiled independently. The task may indeed be
severe, but it is not impossible. Species 2000 can
report progress in both areas. |
I must admit, I was not
initially delighted to receive this request to speak on
"Doing the Impossible–Creating a Stable Species Index and
Operating a Common Access System on the Internet." Apart
from being the longest title I have ever had for a paper
given in a symposium, I also felt that this might be a
poisoned chalice. But I decided that, in fact, it provided a
nice challenge, and I shall try to face that challenge.
Creating Stable Taxonomic
Indexes
The first thing you need to know
is that I have been working with a team of people around the
world who call themselves Species 2000. Our motive is to
create an index to the world's species–not by creating one
database with a list of all the species in it, but, in fact,
by setting up an interoperable system that, using a common
access system, will address a central array of taxonomic
databases. The second thing you need to know is that we have
already made some structural plans regarding how to do that.
We are trying to make a stable index–or what we call a
Global Species Database (GSD)–for each taxon of organisms.
Clearly, there are many
taxonomic databases around the world that have species
checklists and taxonomic opinions at their call. Many of
those databases are very good databases with excellent data
in them. But if you think about taking those databases and
using them as the source of information to make a universal
list of all organisms, you run into two problems. One is
that the data sets overlap. But worse than that, each of
them has been internally optimized for one region of the
world but not globalized among the different systems. So, if
you were to put them together you not only would have to
deal with the overlapping species, which might be classified
differently in the different databases, but also with the
fact that the species may be categorized in the taxonomic
structure of families and orders and so on differently.
Therefore, you cannot put these databases together end to
end. You have to look inside them and rework them.
If you can persuade different
communities around the world to create a global index, a
global checklist of species for each taxon, then the
organization that tries to compile a universal list does not
have to understand how each database is structured. They can
be put together end to end. Provided there are no overlaps,
no demarcation disputes, then that is satisfactory from the
point of view of a global list. So, this is the reason why
we set ourselves the two tasks that are addressed in the
title for this talk. First, how can different groups of
specialists create a stable taxonomic index for each group
of organisms? And second, how can we create a common access
system on the Internet that will allow us to address those
components?
My title begins with the words
"Doing the impossible." The question to answer is, is the
taxonomist capable of producing a stable species list? If
the answer is "yes," the next question is, would that list
be a useful thing to have? I am going to respond to this
with two examples. The first is one that is very close to
home. Those of you who know me know that I am a botanist and
I work on legumes. For 11 years now, I have led an
enterprise worldwide in which we have been creating a Global
Species Database for legumes. It is on the Web as Legume
Web. It has many faults, this project. It is far from being
an ideal, but I can use it as a vehicle for explaining to
you how it is that I believe we can make stable species
indexes. I will then move away from being egocentric by
discussing some of the other models, which will indicate how
this is, in fact, a reasonably achievable goal throughout
the community.
Legumes: A Global Species
Database
With the legume database, we are
talking about creating a list of species recognized to exist
by specialists. That means that we must include not only
taxa and the Latin names of taxa as some people accept them,
but also synonyms and taxonomic opinion.
How have we tried to do this for
the 19,000 legumes around the world? We have taken it as a
two-stage process. For the first stage, we organized
regional centers that have been compiling species lists of
legumes for their parts of the world, and those lists are
the starting material. In most cases the centers use the
same software, and in most cases we had extreme
difficulty–and still have extreme difficulty–in merging
those databases together into one file.
The second stage is to get
panels of experts for different groups of legumes. Legumes
are normally thought of as falling into 32 tribes of plants.
For each of the small tribes or for each of the large
genera, we have anywhere from one to four monographic
specialists around the world, thus creating a network
approaching 100 people whom we contact to try and bring the
taxonomic checklist into a responsible opinion.
One job of this network of
people is to globalize, to establish a system of genera of
the species that will function on all the continents. Of
course, the regional data sets sent to us include some local
features in the taxonomy. But we have to make sure that the
features of acacias, for example, are treated the same way
for African plants as they are for American plants as they
for Australian plants. For example, one of the Australian
acacia experts has decided that acacias should be divided
into three genera. That may cause a problem if he has not
stated where the African species or the American species
would fit in those three genera. We get group panel
specialists to say which system will work for the whole
world.
Now, of course, the result is an
opinion. And there are alternative schemes. The alternative
schemes must be cross-indexed in the system through the
synonyms, so you can see the data from another scheme if you
go in using the preferred or default scheme that we have
adopted.
So while we are going for a
preferred or usable system, we are also cross-linking the
alternatives. This is achieved through panels of experts,
who subdivide the various tribes of plants. For example, a
friend at the Royal Botanical Gardens is one of four people
who work on the Caesalpinia tribe. He, in fact, is the one
to whom others defer. He recently did his thesis on
Caesalpinia and sorts that particular genus. So that
database has been through two processes. It reflects the
local expertise of which species are where, and it captures
global taxonomic expertise to bring them together into a
coherent system. It is available on the Web, at our Legume
Web Service at <www.ILDIS.org>. That is just one example,
then, of a team of people from around the world deeply
imbedded in the taxonomic profession making a Global Species
Database for one group of plants.
Going Beyond Legumes
How can that be accomplished for
other groups of organisms? Well, there is not a single route
to that destination. In fact, there are various routes. For
instance, some organizations appear to me to be working in a
region-by-region system. For example, we are talking to the
producers of a mollusk database in the U.S.A. and a mollusk
database in Paris. In some taxon-by-taxon systems, many of
the families have been provided by special family experts
putting them into the larger system. The International
Legume Database and Information Service (ILDIS) used a
combination of these two techniques. Some people start with
the names from an index or from the zoological record. So
Marshal Crosby, making the mollusk list for the world,
started from the names–and, of course, some databases didn't
inherit the taxonomy from specialists–and then worked with
them to create the database. In the Philippines, experts are
working on a fish base, but the baseline taxonomy comes from
experts in California. Similarly, a database on bacteria
takes its base list of species from the International
Journal of Systematic Biology (IJSB). So these are different
routes by which existing databases or data systems can
approach this ideal of becoming part of a world species
list.
Now the ideals here are very
demanding. Once we thought we would not find one database in
the world that met all the demands. We are now talking to 65
such database organizations around the world covering many
more than 65 groups. And I have to say that my own project–ILDIS–comes
fairly close, but it is certainly not completely there. The
one that comes closest in my mind is the world's list of
mammals based in the Smithsonian. The only question about it
is whether or not the taxonomic expertise put into it is
fully global or whether it, in fact, is rather restricted by
the set of 20 Americans who developed it. But apart from
that, it meets all the demands that I know of for such a
system.
So, stable species lists do
exist and I would contend that they can be produced and they
can be maintained through time. They must be embedded deeply
in the taxonomic community so that they can move forward and
be fluid. Nobody is talking about their being frozen.
Rather, we are talking about their being decoupled. We are
talking about their together forming a responsible taxonomic
consensus–a practical system decoupled from some of the
minutiae of the day-to-day taxonomic debates that move to
and fro.
Creating a Dynamic Access
System
The second part of my talk is
about how we are going to organize these different systems
to be available on the Internet through a dynamic access
system and what challenges are faced in creating that
system. The key word here is federated systems. But
federated systems are completely different levels of
endeavor.
Let us look quickly at the
different challenges that make what we were doing seem
impossible and that were, to some of us, seemingly
insurmountable at the start. I would like to report to you
that we are making at least some progress with them.
At the top level, there is great
complexity. We–the taxonomic community--are not a
multinational organization telling its offices around the
world how to use identical software and how to proceed. We
are a moving, seething mass of heterogeneous, different
databases around the world operating on various platforms
and using different database management systems. Of course,
this is the classic heterogeneity problem. We have to have
interoperability by cross-mapping onto a very simple model
at the center to ensure that we can get minimum data to and
from those systems.
The Problem of Scalability
We have prototypes working for
five or ten databases. Will this extend or operate nicely
with 100 or more other systems?
The Problem of Autonomy
Autonomy is another issue. We
need a model that makes it possible and desirable for
participation by these different projects. We must accept
their heterogeneity and learn to live with their autonomist
behavior.
The Problem of Stability
Another issue is stability. You
might think this is just a matter of there being an ice
storm in Montreal or a tornado in the Philippines that puts
all the systems out of commission for a day or two.
Actually, the most frequent reason for the databases going
down is because of internal management problems within
multilayered institutions. So, I am at the University. I go
away for a week to a conference and I get back and find that
our server is down or our server is disconnected. Why?
Because bureaucrats in the Computer Service Department
changed the allocation of machines and a little piece of
paper went around telling us about it six months before and
that paper went to the head of the department and not to me.
So, I get back and find that the system is disconnected, and
it takes me two days to get it back up. Now, this happens in
all multilayered institutions around the world. If you are
confident that this never happened in your institution, then
that is great–but just watch out. That is computer science.
There are other issues to
consider with regard to stability, including the issue of
interoperability and the question of which standards to use.
One of our prototypes provided by the Japanese uses CORBA to
link the different databases. It also is necessary to decide
whether or not everything goes by a server hub and out to
the databases. And this is where the question of stability
comes in.
Clearly, we can replicate the
servers by having mirror sites. But what about the actual
taxonomic databases? At present if you go to the American
site or the Japanese site and ask about legumes, you still
go to the same server that holds the legume database. If
that server is down, you will not get your reply about
legumes. We have two ways of dealing with this. We are going
to have a backup to the so-called "annual checklist." So, if
you cannot get a live version for any sector, you will fall
back to a static version. Of course we could also just
duplicate each of the databases at the peripheral sites by
having a second site holding each database.
The Problem of Taxonomic
Knowledge
How are we going to create a
seamless catalogue produced from these bits and pieces from
different peoples’ databases? The answer is that we know a
great deal about how different databases vary. We do have a
model we are working on, for which the main challenge is
getting the name base and the taxon base interfaces to give
a harmonious appearance.
The Problem of Demarcation
and Overlaps
Of course, some of the databases
have duplication. And it is not true that each species that
is covered with a Global Species Database is covered only
once. So, there are overlaps, and it becomes a question of
shading in. For example, we may want, as a Global Species
Database, to go to the Missouri Botanical Garden just for
mosses and at that particular point in time have the
flowering plants shaded out.
Pluralism, of course, worries
people a great deal. They are a little bit afraid that our
Species 2000 project is going to somehow impose on them, and
that everybody will have to conduct research on a fish
according to a certain person’s system, or legumes according
to somebody else’s system. What we need is at least one good
Global Species Database for each group of organisms. If we
have two or three, then is that not a wonderful excess, for
then we can then choose among them.
There is more than one world
system for mammals, for fishes, and for bacteria. The list
does not go a lot further than that for groups that are
duplicated. Where we have duplicate groups, then there are
at least two different user attitudes: Some people really
know which system they want to use, and others do not care.
The people who do not care often ask the question: Will you
tell us the taxonomy that we can use just to name these
organisms (which must be the same as the one you tell the
other people down the street)? They need this question
answered because if they use the same names, their data will
match. So, in areas of taxonomy where there is pluralism,
there is pressure for us to get some organization–maybe
BIOSIS, which can monitor the uses around the world in the
literature–to tell us which taxonomy to use for the default
and for the people who do not care. But clearly you want to
offer a choice to those people who do care, who want to
follow a particular system for fishes or whatever.
The Problem of Missing
Sectors
Another problem is missing
sectors. The databases that we are working with, if they
were full, would cover only 40 percent of the world's known
organisms. So there is a remaining 60 percent to be done. We
are working very carefully with the Organization for
Economic Cooperation and Development (OECD) and with the
Global Environment Facility (GEF) to try to make proposals
as to how new projects might be started or existing projects
might be diverted to achieve Global Species Database status.
The Problem of the Human
Element
The human element–the
sociology–is enormously important. The great
institutions–the Smithsonian, the Royal Botanical Garden,
the Missouri Botanical Garden–must come alongside network
projects like the ILDIS project I described to you earlier,
alongside smaller institutions, and alongside individual
people whose whole careers have gone into making one
database such as their personal property. All of these
different databases have to be used, and we must figure out
how to bring these people alongside each other in a
federation.
Then there is the question of
nationalism and regionalism. Our plan is a global plan, but
there are almost no global resources. So, we have to set up
Species 2000 Japan. We have to work with the Integrated
Taxonomic Information System (ITIS) program here in the
States. We have to work with the European Union to try and
mobilize parts of the project with regional or nationalistic
names on them even though they are part of the global
program.
The Problem of Money
Lastly, of course is the
question of whether to make things available for free or
whether there has to be some cost recovery on the usage of
systems. Desires and attitudes vary enormously around the
world, and this is very troublesome to all global
organizations. We are trying to live with a heterogeneity
there as well.
So, these are the challenges
that we continue to face. Scalability and autonomy are, in
my opinion, more likely to trip us up on the computer
science than on the system heterogeneity or the stability,
which are our priorities. With taxonomic knowledge, we do
know how to handle the heterogeneity. On demarcation, we
have some ideas. On pluralism, I think we know how to handle
it, but it is a very sticky issue with the taxonomist. The
question is whether the taxonomists in a particular group of
organisms will allow one system to be used or whether they
will insist on slugging it out with alternatives.
Lastly, we are working very hard
to draw many institutions together. We are also working very
hard in Australia, in Europe, and in the U.S. and North
America to make sure that nationalism does not pull us
apart. We need to use national and regional funds, but we
must aspire to a global program.
Previous |
Next
Questions:
Email us or Call (215)
893-1561
Copyright © 2003 NFAIS. All rights
reserved. No part of this product or service may be
reproduced, stored in a retrieval system or transmitted in any
form or by any means, electronic, mechanical, photocopying,
recording or otherwise, without prior written consent.
Privacy
Policy |