Preprints of the
Metadiversity
Conference
Proceedings
Special Presentation
Perspectives on
Information Management on the Internet
WAYNE MOORE, Senior
Scientific Software Designer, Flow Cytometry Instrumentation
and Software Development Group, Stanford University
|
ABSTRACT
We have recently
developed an Internet-based system for acquiring,
storing, retrieving, and working with complex flow
cytometry data. This system, which incorporates
radically different methods for serving Internet
data, is applicable across a wide variety of
scientific disciplines, ranging from genomics to
astronomy. In addition, it offers rapid and more
efficient access to published information and
provides more efficient routes for sharing and
combining information from disparate sources. This
technology may be particularly useful for
information storage and exchange in the biological
and medical sciences and in other areas that
similarly deal with very large distributed
information collections that are difficult to serve
with current approaches. Our motivation for this
project derives from a necessity to maintain and
serve data from Flow Cytometry (FACS) instruments,
which are used worldwide in basic science and
medicine. These instruments, perhaps best-known for
their use in monitoring CD4T cell counts to
evaluate HIV disease progression, are used to
characterize and determine the functions of cells
from organisms as different as drosophila and man.
We currently maintain an archive of over 200GB of
FACS data, collected mainly in basic science
studies at Stanford over the last 15 years. We will
use our new technology to organize and serve these
newly acquired data, which will be collected
locally at Stanford or elsewhere, stored at the
MSIA Management Sciences Associates, Pittsburgh,
PA, and made available via the Internet to FACS
users and other interested parties. This system,
which is built with readily available components,
is extensible and broadly applicable. It offers
innovative tools for serving the information
located in genomics and other large databases, and
for combining scientific information from these
disparate sources. Thus, it provides a general
model for facilitating the electronic interchange
of scientific data and the publication of
scientific findings. This work was supported by
grants from the National Institutes of Health,
LM04836 and CA42509. In this presentation, I will
describe the technology we plan to use and discuss
its advantages over traditional relational database
approaches to serving scientific information. These
advantages include global information service,
fine-grained access control, federated servers that
need not be located within a single organization,
and compatible client software that is widely
available and runs on "lightweight clients" (e.g.,
PCs and Macs). I will illustrate this discussion
with examples from our laboratory’s work in
lymphocyte biology and flow cytometry and from a
variety of other areas, including genetics,
genomics, taxonomy, museum and scholarly
collections, electronic publication, and scientific
literature index services. |
Earlier at this meeting, someone
said that the nice thing about standards is that there are
so many of them. Therefore, I’m sure it won't come as a
shock to anybody that there actually is a standard for
directories.
Directory Service
Directory Service fundamentally
is a database. The standardization effort started in the
late 1980s with the International Standards Organization
(ISO) and a series of standards starting with X500. It is a
very complete, painstaking definition of all the fields of
the directory, and it is based on the ISO. Open systems
interconnect (OSI) protocols that are very heavyweight.
There is a very large buy-in to use them, and they are not
generally available on smaller PCs or for Macintosh-type
clients.
In the early ‘90s, a proposal
was made to the Internet Engineering Task Force for what was
called Lightweight Directory Access Protocol (LDAP), and
this year it was adopted as a proposed standard for use on
the Internet. In fact, some of you may be familiar with
Yellow Pages, which is based on LDAP. So it is really there,
it exists, and it is out on the Web.
Strategic Advantages
What are the advantages of a
Directory Service over a more traditional database? One of
the main advantages is that it provides a global naming
system. ISO moderates the top level in the name space and
parcels it out into countries and other organizations, which
can then define lower-level standards. From the beginning it
has dealt with one of the issues we have discussed at this
meeting: synonymy, or having multiple names for things. The
designers knew in advance that they were not going to be
able to give a unique name to every single thing in the
world. Therefore, we could have multiple common names. We
could, for example, have a person who has multiple roles in
different organizations, and those different organizations
can have different entries, and those different entries can
reference one another.
It was also realized that we
weren't going to get the entire world's database in any
given service. So it is designed from the outset to be
interoperable and federated. It has very flexible searching
mechanisms, including a syntax-specific searching mechanism,
so that different ways of matching can be used for different
types of data that are entered into the database.
Another aspect was the very
fine-grained access control. It was known that we couldn't
make every piece of every directory available to everyone,
so there are very good controls on who can see what data
elements in the directories. These are the strategic
advantages of Directory Service.
Tactical Advantages
One of the tactical advantages
of using Directory Service over other databases is that
Directory Service is based on well-defined Internet
standards. As I say, it is out there now. The United States
government is using it. Stanford is using it. There are
publicly available LDAP servers, with e-mail addresses,
telephone numbers, and so forth–411-, Big Foot-, and Yellow
Pages-kinds of directories.
Directory Service is supported
by several vendors–Novell, Netscape, Sun, and IBM are all
big vendors. In addition, I believe Microsoft has announced
that, in its next version, it is going to become part of the
operating system. There are client support packages in the C
language and the Java language, and the packages are widely
(and freely) available for all the major PC units and
Macintosh platforms.
Directory Service Entries
The Directory is defined in
terms of entries we like to call "cards," because each entry
is the electronic equivalent of a Rolodex card or a catalog
card–it is a defined bit of storage on which we can scribble
down a wide variety of information attributes about some
object, some person, some database, or some document.
The entries follow the so-called
object model. This essentially means that we define a
hierarchy of objects. For example, the base standard
defining a "person" has fields that describe the personal
name, the given name, the surname, the common name, and
contact information. Then that is inherited by the
"organizational person," which gives that person's
relationship to some organization, and in Netscape servers
the "Inet Org Person," which gives the person’s e-mail
address, home directory, and Web page URL.
As I mentioned, every record has
the potential to have individual-access control information
on it, and every record is identified by a distinguished
name. But a distinguished name need not be globally unique.
The components of the distinguished name are themselves
attributes in the records, so they can always be used in a
search.
Finally, if that wasn't flexible
enough, there is the so-called extensible-object class,
which is allowed to have any attribute whatsoever. So, we
can define these things in a way whereby we can start
collecting the data before we define–or even identify–all
the data we need. This is a very important ability in my
field, which is basic research. In basic research, we often
don't know, when we start a set of experiments, what all the
relevant criteria are going to be.
Each entry has a name, and each
entry has a syntax. The syntaxes are defined in what is
called Abstract Notation 1 (ASN-1). The standard defines the
case-exact string and the case-insensitive string. The
case-insensitive string is the case in which we use
different matching rules depending on the syntax, so that
the case-insensitive string can compare strings without
considering the case. Distinguished names themselves are
defined as syntax. Telephone numbers, for example, are
syntax, where the matching rule matches numerals but ignores
punctuation. In the Department of Genetics, we are
interested in extending this to include DNA sequences. But
the searching rules are very different, because we want to
take single-point mutations or deletions or other sources of
comparisons.
The Standard
The standard is defined in such
a way that new syntaxes can be defined and new matching
rules implemented in the server without breaking all the
other levels of the protocol. An attribute can have one or
more values. This makes it easy to have people with multiple
telephone numbers or multiple addresses. Different common
names and attributes can be defined as optional or required,
so it also is very easy to define a record that has a large
number of values that are mostly not present–but
occasionally are present–with very little overhead in the
database.
Names Defined
Names that identify the entries
of the database are composed of attribute-equal value pairs.
The syntax is defined as reading from right to left, and it
is comma-separated. If we use any special characters like
commas or equals, we can put quotes around the values so
that we can include those in the directories. The standard
itself defines three varieties of names: geographic names,
which are named relative to geographic or governmental
entities; organizational names, which are made with respect
to organizations; and domain names, which essentially
subsume the Internet domain-name service.
Flow Cytometry
This naming scheme can be
extended to include other kinds of objects–for example, a
data-collection section, which is when a person comes to the
instrument and collects data from one or more samples. It is
qualified by the Cytometer, which is a kind of instrument
qualified by my unique identifier of the protocol and then
by the organization or organizational unit. It is expressed
in terms of a data collection with a particular protocol
coordinate that identifies which sample in the protocol it
is. In this case, we would expect that the intermediate
levels are optional but define the standard so that, for
example, a small organization could define all their
protocols uniquely within the organization. We could put the
responsibility on the user to give unique identifiers. Or if
we have an intelligent instrument, the instrument itself
could assign its own unique identifier or some combination
of the above, and all of these forms could be put in the
same directory or a federated directory.
Monoclonal antibodies are
biotechnology tools that are very important in Flow
Cytometry and several other fields. For example, I can take
my distinguished name and further qualify it with a clone
name to indicate some monoclonal antibody that I had
produced. If this were a commercially produced antibody, we
could also name it relative to the manufacturer.
On the other hand, when we are
discovering genes, we frequently work with them for a long
time before it becomes it clear what they are and before
everyone agrees on what they are. So, I have a second form
of the name that we can use to uniquely name a gene,
early-on in the discovery process, relative to a specific
investigator, and with the assumption that when and if this
does become a globally recognized gene, we then will change
that for a reference to the new standard name.
Directory Searching
Searching in directories is
somewhat different from searching classical databases. We
can search by object scope, which means that we are looking
for some specific object. In other words, we give the
qualifier that we want and it should come back with one
object. We can also search for all the elements that are in
one level in a tree. For example, we can find all the
samples in a particular protocol, or we can search for a
subtree and find all samples taken by a particular
investigator.
Limiting a Search
Search filters are defined in
terms of the usual Boolean: ANDs, ORs, and NOTs. And, as I
discussed, we can use various exact and approximate matching
patterns to search. We can also put limits on the search. We
can say, for example, "I am expecting to get about 100
results here, so if you are finding 5,000 results, stop and
tell me that before we grind our way through all the
entries." Or we could say, "Don't spend more than 10 minutes
looking for this–if you do, then I need to say something
more specific."
Referrals and Federation
As I said, the Directory was
designed from the ground up to have referrals and
federation. What this means is, if we start with our local
server or some home server, it will return to us the results
of everything it found that matched our search criteria. In
addition, we can decorate those entries with referrals to
other LDAP servers so that it provides a list of other
servers that it thinks might have information that would
fulfill our query. Then we can either tell it to
automatically follow referrals or to put up a list of the
sites to which it was referred and let me choose which ones
to follow.
Attributes of Metadata
We have had two measures of
metadata quality that has been discussed. One is the
20-year-rule: If somebody goes to the directory and finds
these 20-year-old data, can a person who was not the primary
investigator find out enough information about the data to
make intelligent use of them? In regard to that concern,
biotechnology, instrument technology, and computer
technology have changed so fast that 20-year-old data really
are not relevant anymore.
There are attributes or aspects
of the metadata that we feel need to be there and need to be
well-defined in order for assessors to make use of the data.
These attributes include information about who did the data
and on what instrument; references to what data-collection
protocol was used; and information about what reagents and
what antibodies were used. Again, we want to maintain a
directory of the reagents that we use in our facility so
that, instead of just having a common name for the reagent,
we can actually put a distinguished name for the reagent in
the metadata. Then that distinguished name would then refer
back to the card that really documents that particular
reagent. We feel this has a much wider applicability than
just to Flow Cytometry or any other particular field.
The Future of Directory
Service
I mentioned that Stanford has a
directory now that is, in fact, using LDAP. I have also
talked about using these names as standard nomenclature. For
example, taxonomic nomenclature could be subsumed into an
X500 scheme just by adopting a set of coding conventions,
and then existing nomenclatures can be mapped directly into
this scheme.
Reagents
I have talked a bit about
reagents. One of the things that we have been discussing
with the commercial producers of these reagents is producing
directories of catalogues essentially on-line. So we would
have our individual directory, which would list the reagents
that we actually had in our freezer. But if we had some sort
of experiment-planning tool we are using and we decide we
want to stain for something for which we don't have a
reagent, it could potentially go out and look in the
vendor's catalog to decide whether some reagent exists or
could be ordered to fulfill that requirement.
Accessibility of Primary Data
Another thing we would like to
do using this directory structure is to make the primary
data accessible to later users. Flow data is much like
genetic-sequence data in that it is voluminous and very
complicated. It is difficult to analyze in one pass, but
unlike genetic data, it is not currently available or
databased after it is published. So, by establishing
standards for naming and retrieval of these data, we would
like to advance Flow Cytometry to the level at which
publications would reference the primary data, so that if
somebody wants to come back and use their favorite
visualization tool or analysis tool or compare the data with
their own data, they could have access.
Tag Sites
Eventually if we start making
entries for all the samples in our experiments and all the
reagents that we use, these subtrees or subcatalogs
essentially become equivalent to a notebook that is online
and searchable.
We think there are a lot of
applications for Directory Service beyond those we have
mentioned. For example, I am talking with the Human Genome
Project about what I call sequenced tag sites, which are
useful in mapping the genome. We can see how there are many
sorts of collections that could be organized, or at least
cataloged, this way online.
For example, radiation mapping
is a panel of clones that has an object class and a common
name. One example might have the distinguished name of Panel
G3. That would be The Next Generation (TNG) hybrid panel,
which has a finer resolution than the older G3 panel. A card
would represent a specific clone within that panel. There
are, I believe, almost 200 in the G3 panel, and that is a
pretty typical number. Then when we actually do the mapping,
we find that the tag site is about a 300-based sequence,
which is not polymorphic in the species and is present in
only one copy. So it would be a useful location map, showing
positional location along the chromosome. We might also have
a multiple-valued attribute for every clone in which this
tag site is present in that database. Raw data are used to
calculate what is called a map distance, which is
essentially the probability that two genes will associate
together into certain different fragments.
Finally there is a tag-site
card, which is defined in terms of a particular chromosome
relative to a particular species. It includes the tag
sequence itself, so that we can find it at a map location.
For example, I want things that are near this location on
Chromosome 12. Or I want things that actually have this
sequence in it, and it has a reference back to a
distinguished name and back to the radiation hybrid panel
that was used to make the map.
Taxonomic Names
Another application about which
I spoke is just subsuming existing biological nomenclatures
in a way that can be parsed and manipulated correctly and
automatically by machines. We can also have synonyms, and we
can have references.
Role in the Literature
Finally, we think that this
could also be used for literature searching and references.
For example, further inheritance from the person could
include attributes that are relevant, such as the
professional name, the name that a person uses to site in
the literature, professional specialties, professional
affiliations, and so forth. This probably ought to be a
journal article card that gives us a title of a particular
issue or particular volume in some journal and gives us the
distinguished names of the authors; the distinguished name
of the paper that it references; and, potentially, a
citation of another paper that references it and, for
example, the abstract. Most of the so-called Dublin Core
elements are already defined within the ISO standard in
terms of titles and dates and creator date and the abstracts
and descriptions and so on.
Previous |
Next
Questions:
Email us or Call (215)
893-1561
Copyright © 2003 NFAIS. All rights
reserved. No part of this product or service may be
reproduced, stored in a retrieval system or transmitted in any
form or by any means, electronic, mechanical, photocopying,
recording or otherwise, without prior written consent.
Privacy
Policy |