Search NFAIS

Home
About NFAIS
Events

Promotions
Information Community News
Press Releases
Members
Committees
Join NFAIS
Contact NFAIS

Member Login



 

 

 

 

 

 

 

 

 

 

 

Home  >>  Publications  >>  Metadiversity  >>  Preprints Contents
 
Preprints of the Metadiversity Conference Proceedings

  Special Presentation

Perspectives on Information Management on the Internet

WAYNE MOORE, Senior Scientific Software Designer, Flow Cytometry Instrumentation and Software Development Group, Stanford University

ABSTRACT

We have recently developed an Internet-based system for acquiring, storing, retrieving, and working with complex flow cytometry data. This system, which incorporates radically different methods for serving Internet data, is applicable across a wide variety of scientific disciplines, ranging from genomics to astronomy. In addition, it offers rapid and more efficient access to published information and provides more efficient routes for sharing and combining information from disparate sources. This technology may be particularly useful for information storage and exchange in the biological and medical sciences and in other areas that similarly deal with very large distributed information collections that are difficult to serve with current approaches. Our motivation for this project derives from a necessity to maintain and serve data from Flow Cytometry (FACS) instruments, which are used worldwide in basic science and medicine. These instruments, perhaps best-known for their use in monitoring CD4T cell counts to evaluate HIV disease progression, are used to characterize and determine the functions of cells from organisms as different as drosophila and man. We currently maintain an archive of over 200GB of FACS data, collected mainly in basic science studies at Stanford over the last 15 years. We will use our new technology to organize and serve these newly acquired data, which will be collected locally at Stanford or elsewhere, stored at the MSIA Management Sciences Associates, Pittsburgh, PA, and made available via the Internet to FACS users and other interested parties. This system, which is built with readily available components, is extensible and broadly applicable. It offers innovative tools for serving the information located in genomics and other large databases, and for combining scientific information from these disparate sources. Thus, it provides a general model for facilitating the electronic interchange of scientific data and the publication of scientific findings. This work was supported by grants from the National Institutes of Health, LM04836 and CA42509. In this presentation, I will describe the technology we plan to use and discuss its advantages over traditional relational database approaches to serving scientific information. These advantages include global information service, fine-grained access control, federated servers that need not be located within a single organization, and compatible client software that is widely available and runs on "lightweight clients" (e.g., PCs and Macs). I will illustrate this discussion with examples from our laboratory’s work in lymphocyte biology and flow cytometry and from a variety of other areas, including genetics, genomics, taxonomy, museum and scholarly collections, electronic publication, and scientific literature index services.

Earlier at this meeting, someone said that the nice thing about standards is that there are so many of them. Therefore, I’m sure it won't come as a shock to anybody that there actually is a standard for directories.

Directory Service

Directory Service fundamentally is a database. The standardization effort started in the late 1980s with the International Standards Organization (ISO) and a series of standards starting with X500. It is a very complete, painstaking definition of all the fields of the directory, and it is based on the ISO. Open systems interconnect (OSI) protocols that are very heavyweight. There is a very large buy-in to use them, and they are not generally available on smaller PCs or for Macintosh-type clients.

In the early ‘90s, a proposal was made to the Internet Engineering Task Force for what was called Lightweight Directory Access Protocol (LDAP), and this year it was adopted as a proposed standard for use on the Internet. In fact, some of you may be familiar with Yellow Pages, which is based on LDAP. So it is really there, it exists, and it is out on the Web.

Strategic Advantages

What are the advantages of a Directory Service over a more traditional database? One of the main advantages is that it provides a global naming system. ISO moderates the top level in the name space and parcels it out into countries and other organizations, which can then define lower-level standards. From the beginning it has dealt with one of the issues we have discussed at this meeting: synonymy, or having multiple names for things. The designers knew in advance that they were not going to be able to give a unique name to every single thing in the world. Therefore, we could have multiple common names. We could, for example, have a person who has multiple roles in different organizations, and those different organizations can have different entries, and those different entries can reference one another.

It was also realized that we weren't going to get the entire world's database in any given service. So it is designed from the outset to be interoperable and federated. It has very flexible searching mechanisms, including a syntax-specific searching mechanism, so that different ways of matching can be used for different types of data that are entered into the database.

Another aspect was the very fine-grained access control. It was known that we couldn't make every piece of every directory available to everyone, so there are very good controls on who can see what data elements in the directories. These are the strategic advantages of Directory Service.

Tactical Advantages

One of the tactical advantages of using Directory Service over other databases is that Directory Service is based on well-defined Internet standards. As I say, it is out there now. The United States government is using it. Stanford is using it. There are publicly available LDAP servers, with e-mail addresses, telephone numbers, and so forth–411-, Big Foot-, and Yellow Pages-kinds of directories.

Directory Service is supported by several vendors–Novell, Netscape, Sun, and IBM are all big vendors. In addition, I believe Microsoft has announced that, in its next version, it is going to become part of the operating system. There are client support packages in the C language and the Java language, and the packages are widely (and freely) available for all the major PC units and Macintosh platforms.

Directory Service Entries

The Directory is defined in terms of entries we like to call "cards," because each entry is the electronic equivalent of a Rolodex card or a catalog card–it is a defined bit of storage on which we can scribble down a wide variety of information attributes about some object, some person, some database, or some document.

The entries follow the so-called object model. This essentially means that we define a hierarchy of objects. For example, the base standard defining a "person" has fields that describe the personal name, the given name, the surname, the common name, and contact information. Then that is inherited by the "organizational person," which gives that person's relationship to some organization, and in Netscape servers the "Inet Org Person," which gives the person’s e-mail address, home directory, and Web page URL.

As I mentioned, every record has the potential to have individual-access control information on it, and every record is identified by a distinguished name. But a distinguished name need not be globally unique. The components of the distinguished name are themselves attributes in the records, so they can always be used in a search.

Finally, if that wasn't flexible enough, there is the so-called extensible-object class, which is allowed to have any attribute whatsoever. So, we can define these things in a way whereby we can start collecting the data before we define–or even identify–all the data we need. This is a very important ability in my field, which is basic research. In basic research, we often don't know, when we start a set of experiments, what all the relevant criteria are going to be.

Each entry has a name, and each entry has a syntax. The syntaxes are defined in what is called Abstract Notation 1 (ASN-1). The standard defines the case-exact string and the case-insensitive string. The case-insensitive string is the case in which we use different matching rules depending on the syntax, so that the case-insensitive string can compare strings without considering the case. Distinguished names themselves are defined as syntax. Telephone numbers, for example, are syntax, where the matching rule matches numerals but ignores punctuation. In the Department of Genetics, we are interested in extending this to include DNA sequences. But the searching rules are very different, because we want to take single-point mutations or deletions or other sources of comparisons.

The Standard

The standard is defined in such a way that new syntaxes can be defined and new matching rules implemented in the server without breaking all the other levels of the protocol. An attribute can have one or more values. This makes it easy to have people with multiple telephone numbers or multiple addresses. Different common names and attributes can be defined as optional or required, so it also is very easy to define a record that has a large number of values that are mostly not present–but occasionally are present–with very little overhead in the database.

Names Defined

Names that identify the entries of the database are composed of attribute-equal value pairs. The syntax is defined as reading from right to left, and it is comma-separated. If we use any special characters like commas or equals, we can put quotes around the values so that we can include those in the directories. The standard itself defines three varieties of names: geographic names, which are named relative to geographic or governmental entities; organizational names, which are made with respect to organizations; and domain names, which essentially subsume the Internet domain-name service.

Flow Cytometry

This naming scheme can be extended to include other kinds of objects–for example, a data-collection section, which is when a person comes to the instrument and collects data from one or more samples. It is qualified by the Cytometer, which is a kind of instrument qualified by my unique identifier of the protocol and then by the organization or organizational unit. It is expressed in terms of a data collection with a particular protocol coordinate that identifies which sample in the protocol it is. In this case, we would expect that the intermediate levels are optional but define the standard so that, for example, a small organization could define all their protocols uniquely within the organization. We could put the responsibility on the user to give unique identifiers. Or if we have an intelligent instrument, the instrument itself could assign its own unique identifier or some combination of the above, and all of these forms could be put in the same directory or a federated directory.

Monoclonal antibodies are biotechnology tools that are very important in Flow Cytometry and several other fields. For example, I can take my distinguished name and further qualify it with a clone name to indicate some monoclonal antibody that I had produced. If this were a commercially produced antibody, we could also name it relative to the manufacturer.

On the other hand, when we are discovering genes, we frequently work with them for a long time before it becomes it clear what they are and before everyone agrees on what they are. So, I have a second form of the name that we can use to uniquely name a gene, early-on in the discovery process, relative to a specific investigator, and with the assumption that when and if this does become a globally recognized gene, we then will change that for a reference to the new standard name.

Directory Searching

Searching in directories is somewhat different from searching classical databases. We can search by object scope, which means that we are looking for some specific object. In other words, we give the qualifier that we want and it should come back with one object. We can also search for all the elements that are in one level in a tree. For example, we can find all the samples in a particular protocol, or we can search for a subtree and find all samples taken by a particular investigator.

Limiting a Search

Search filters are defined in terms of the usual Boolean: ANDs, ORs, and NOTs. And, as I discussed, we can use various exact and approximate matching patterns to search. We can also put limits on the search. We can say, for example, "I am expecting to get about 100 results here, so if you are finding 5,000 results, stop and tell me that before we grind our way through all the entries." Or we could say, "Don't spend more than 10 minutes looking for this–if you do, then I need to say something more specific."

Referrals and Federation

As I said, the Directory was designed from the ground up to have referrals and federation. What this means is, if we start with our local server or some home server, it will return to us the results of everything it found that matched our search criteria. In addition, we can decorate those entries with referrals to other LDAP servers so that it provides a list of other servers that it thinks might have information that would fulfill our query. Then we can either tell it to automatically follow referrals or to put up a list of the sites to which it was referred and let me choose which ones to follow.

Attributes of Metadata

We have had two measures of metadata quality that has been discussed. One is the 20-year-rule: If somebody goes to the directory and finds these 20-year-old data, can a person who was not the primary investigator find out enough information about the data to make intelligent use of them? In regard to that concern, biotechnology, instrument technology, and computer technology have changed so fast that 20-year-old data really are not relevant anymore.

There are attributes or aspects of the metadata that we feel need to be there and need to be well-defined in order for assessors to make use of the data. These attributes include information about who did the data and on what instrument; references to what data-collection protocol was used; and information about what reagents and what antibodies were used. Again, we want to maintain a directory of the reagents that we use in our facility so that, instead of just having a common name for the reagent, we can actually put a distinguished name for the reagent in the metadata. Then that distinguished name would then refer back to the card that really documents that particular reagent. We feel this has a much wider applicability than just to Flow Cytometry or any other particular field.

The Future of Directory Service

I mentioned that Stanford has a directory now that is, in fact, using LDAP. I have also talked about using these names as standard nomenclature. For example, taxonomic nomenclature could be subsumed into an X500 scheme just by adopting a set of coding conventions, and then existing nomenclatures can be mapped directly into this scheme.

Reagents

I have talked a bit about reagents. One of the things that we have been discussing with the commercial producers of these reagents is producing directories of catalogues essentially on-line. So we would have our individual directory, which would list the reagents that we actually had in our freezer. But if we had some sort of experiment-planning tool we are using and we decide we want to stain for something for which we don't have a reagent, it could potentially go out and look in the vendor's catalog to decide whether some reagent exists or could be ordered to fulfill that requirement.

Accessibility of Primary Data

Another thing we would like to do using this directory structure is to make the primary data accessible to later users. Flow data is much like genetic-sequence data in that it is voluminous and very complicated. It is difficult to analyze in one pass, but unlike genetic data, it is not currently available or databased after it is published. So, by establishing standards for naming and retrieval of these data, we would like to advance Flow Cytometry to the level at which publications would reference the primary data, so that if somebody wants to come back and use their favorite visualization tool or analysis tool or compare the data with their own data, they could have access.

Tag Sites

Eventually if we start making entries for all the samples in our experiments and all the reagents that we use, these subtrees or subcatalogs essentially become equivalent to a notebook that is online and searchable.

We think there are a lot of applications for Directory Service beyond those we have mentioned. For example, I am talking with the Human Genome Project about what I call sequenced tag sites, which are useful in mapping the genome. We can see how there are many sorts of collections that could be organized, or at least cataloged, this way online.

For example, radiation mapping is a panel of clones that has an object class and a common name. One example might have the distinguished name of Panel G3. That would be The Next Generation (TNG) hybrid panel, which has a finer resolution than the older G3 panel. A card would represent a specific clone within that panel. There are, I believe, almost 200 in the G3 panel, and that is a pretty typical number. Then when we actually do the mapping, we find that the tag site is about a 300-based sequence, which is not polymorphic in the species and is present in only one copy. So it would be a useful location map, showing positional location along the chromosome. We might also have a multiple-valued attribute for every clone in which this tag site is present in that database. Raw data are used to calculate what is called a map distance, which is essentially the probability that two genes will associate together into certain different fragments.

Finally there is a tag-site card, which is defined in terms of a particular chromosome relative to a particular species. It includes the tag sequence itself, so that we can find it at a map location. For example, I want things that are near this location on Chromosome 12. Or I want things that actually have this sequence in it, and it has a reference back to a distinguished name and back to the radiation hybrid panel that was used to make the map.

Taxonomic Names

Another application about which I spoke is just subsuming existing biological nomenclatures in a way that can be parsed and manipulated correctly and automatically by machines. We can also have synonyms, and we can have references.

Role in the Literature

Finally, we think that this could also be used for literature searching and references. For example, further inheritance from the person could include attributes that are relevant, such as the professional name, the name that a person uses to site in the literature, professional specialties, professional affiliations, and so forth. This probably ought to be a journal article card that gives us a title of a particular issue or particular volume in some journal and gives us the distinguished names of the authors; the distinguished name of the paper that it references; and, potentially, a citation of another paper that references it and, for example, the abstract. Most of the so-called Dublin Core elements are already defined within the ISO standard in terms of titles and dates and creator date and the abstracts and descriptions and so on.

Previous | Next

 


Questions: Email us or Call (215) 893-1561

Copyright © 2003 NFAIS. All rights reserved. No part of this product or service may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, without prior written consent.

Privacy Policy