Preprints of the
Metadiversity
Conference
Proceedings
Session 3: The Challenge in Earth Observation, Ecosystem
Monitoring, and Environmental Information
Beyond Metadata: Scientific Information Management
Approaches Supporting Ecosystem Monitoring and Assessment
Activities
JEFFREY FRITHSEN,
National Center for Environmental Assessment (NCEA) of the
U.S. Environmental Protection Agency’s Office of Research
and Development
ROBERT F. SHEPANEK,
Senior Scientist and Director of the Information Resources
Development Staff (IRDS) in the National Center for
Environmental Assessment (NCEA)
|
ABSTRACT
We present an
integrated vision for scientific information
management approaches supporting long-term
monitoring and assessment activities within the
USEPA’s Office of Research and Development (ORD).
This vision was developed based upon lessons
learned from the implementation of several
scientific information management systems and from
development of the ORD’s strategic and
implementation plans for scientific information
management. The vision reflects that effective
management of scientific information must address
technical, cultural, and management challenges.
Technical challenges include management and
integration of metadata, data, and the modeling,
analysis, and visualization tools used as part of
assessment activities. Cultural challenges relate
mainly to the protection of intellectual capital
produced by individual investigators. Management
issues include commitment of adequate resources for
systems development and operation, support for
related policies and procedures, and appropriate
incentives for involvement by staff and project
participants. Past experience with EPA and other
organizations have shown that the management issues
are frequently most limiting to successful
implementation of integrated information management
solutions. USEPA ORD’s vision for information
management addresses the following technical
challenges: developing directories of environmental
resources collected and maintained by multiple
organizations, providing access to descriptive
information (metadata) sufficient to support
secondary use of those resources; integrating data
collected at multiple spatial and temporal scales;
and integrating data resources with analytical
tools and models. Metadata efforts have focused
initially on the development of environmental
resource directories enabling users to find data of
potential interest, and development of detailed
catalogs of descriptive information that enable
users to evaluate the use of data as part of some
assessment activity. In ORD’s strategy, the concept
of a data directory has been extended to include
analysis tools, models, documents, and multimedia
products to better reflect the complexity of
environmental inventory and monitoring activities.
Additionally, the strategic vision expands the
focus of technical efforts such that various levels
of metadata can support integration of data and
data systems and integration of data with modeling,
analysis, and visualization tools. This type of
integration becomes useful for integrated
assessments of biodiversity and is exemplified by
integration of project-specific systems with a
common data dictionary, or a common reference
database for taxonomy, such as the Integrated
Taxonomic Information System (ITIS). Effective
information management approaches supporting
monitoring and assessment activities must also
recognize that there exist significant cultural
challenges that must be met to ensure success of a
long-term monitoring project. The cultural
challenges relate to the sharing of data, loss of
control of the use of the data, and realizing
credit for collecting data, or adding value to
data. ORD’s vision for information management
addresses these challenges by leveraging technology
to restrict access to data and information as
assessment products are developed, and proposes an
incentive-based approach to catalyze sharing of
data. |
The title of today’s talk is,
"Beyond Metadata: Scientific Information Management
Approaches Supporting Ecosystem Monitoring and Assessment
Activities." What we will be talking about is the need to–as
much as possible–leverage the use of information
technologies to support all aspects of the environmental
assessment process.
Information Diversity
In order to set the stage and
put this topic in context, I would like to describe ever so
briefly the scientific information management environment
and what are we dealing with when we are talking about
scientific assessments. First of all, we are dealing with
information diversity. We are dealing with a lot of
different types of information–not just biodiversity. But
even if you are just looking at biodiversity, we still have
to bring in a lot of other types of data in order to deal
with the subject.
Environmental assessments are
becoming much more multidisciplinary. In many of the
government agencies, we have to consider environmental
assessments in the context of combining ecology with human
health. And all of a sudden, we have a whole mess of data
that we have to pull together. The big challenge in terms of
information management here is to manage many small pieces
of information and a few very large pieces of information.
We also have the scale problem
when we do an environmental assessment. For example, we can
start off with large remote sensing data sets. These are
large-scale data sets that may need to be combined with
regional monitoring studies in order to conclude something
about status and trends in the environment.
But even if we stop there, we
still don't have the full picture, because we haven’t yet
considered the ecological processes. Therefore we have to go
down to some site-specific intensive studies. It is the
combination of these three types of studies–large-scale,
regional, and site–that makes a complete environmental
assessment. This is not just a message from the EPA–this is
the Committee on the Environment and Natural Resources
Monitoring Framework that came out in 1996. The Framework is
a federal monitoring strategy to combine these three levels
of data in order to do environmental assessments. It is
actually pretty complex.
Systems Diversity
In addition to information
diversity, the other thing that we have in the scientific
realm is systems diversity. I am not talking about
ecological systems here–I am talking about data systems
where we have multiple information management systems. These
systems are all individually developed for individual
organizations and they, by-and-large, don't talk to each
other. So the challenge here is to develop and provide
interoperability between systems and with reference
databases.
If I am developing a database
for Project A here and another one for Project B there, then
one of the things that I want to bring is some consistency
in terms of the way I name data elements. If I refer to
water temperature in one way for one database, for example,
then I should refer to it in the same way in another
database. At the least, we must have some sort of translator
in-between the two databases that can interpret what has
been stored in each.
We have heard before about the
Integrated Taxonomic Information System (ITIS). One of the
uses of ITIS is to promote a common way of naming the same
taxa or taxon. Well, our databases in Project A and Project
B, therefore, ought to refer to this reference database of
taxonomy in the same manner so that Project A and Project B
are calling the same species the same thing. Similarly, we
have the same problem with chemical names.
The final complexity here is we
have a very distributed workforce. Gone are the days of the
individual investigator in academia coming up with some
grand discovery and publishing it. No. Now we are forming
research teams that transcend organizational and geographic
boundaries. And the participants in those teams bring to the
ball game their own information technology, their own
information management environment. This means that there is
by necessity another level of integration required. The
challenge here is to link heterogeneous environments.
Three Challenges for
Scientific Information Management
This situation brings up various
challenges for scientific information management, and we
categorize them into three big categories. We have the
technical challenges–those that are related to the
management of metadata and the tools needed to complete
assessments. We have the management challenges–those that
have to do with providing adequate resources–we are always
asking for more money, right?–and also the support for
policies and procedures to make the information management
systems work. (Remember, a system is comprised of people,
software, and hardware. If management is not enforcing the
procedures, then people are not part of the equation there.)
And, we have the cultural challenges. The cultural
challenges relating to scientific information management
have to do with the protection of the intellectual property
rights of authors. If we don't acknowledge that, then we are
going to be developing systems that don't work.
Cultural Challenges
Let me start with the cultural
challenges. The cultural challenges basically are to provide
protection for the actual property rights of others. If I as
an investigator have collected a chunk of data, I usually
want first publication rights to those data, because my
career depends upon getting the results of my work published
in a journal. If we don't acknowledge that, then we are not
going to have buy-in at the principal investigator level. At
the same time we are going to have to promote data-sharing
and, to a certain extent, change the thinking of the
scientific community. What we need to do is achieve
recognition that the publication of metadata and data are as
important as the publication of a journal article. One way
to achieve this is to work with the scientific societies,
the professional organizations, peer review panels, and so
on, to reinforce the fact that there ought to be "brownie
points" given out for someone who publishes metadata as well
as data. Because until they get that credit, until the
principal investigator can say, "Hey! I got something for
that," they are not going to do it. Earlier someone
mentioned a publication that came out a few months ago that
said exactly those things. And to reinforce that, one of
NASA's campaigns came up with a few "commandments" for their
working group. I will share just a couple of them:
- Thou shalt make thy data
available even unto thine enemies. (Now that is promoting
data-sharing!)
- Thou shalt release thy
data from bondage. (How many times have we heard about a
guy still sitting on the data two years after the research
is completed? Just hasn't published yet–and that doesn't
help the community.)
- Thou shalt not covet thy
neighbor's data until they have had a crack at them. You
may laugh and it may sound trite, but you know, we do have
those impediments that keep some scientists from using the
information management systems that we develop.
Management Challenges
Some of the management
challenges involve pleas for more money, commitment of
adequate resources, and various publications advocating that
10 percent to 20 percent of the research budget ought to be
allocated for information-management activities. Management
challenges are probably seen more in the beginning of a
program and less as time goes on. We need support for
related policies and procedures, and we need appropriate
incentives for the involvement of staff and project
participants. Again, this aspect relates to the need for
management to acknowledge that you published your metadata.
Technical Challenges
I classify the technical
challenges into two different types of needs. First, we need
tools to help users find relevant data and information in a
distributed environment. We need to provide adequate
descriptions of data so that a user can judge whether he or
she can use those particular data for some particular use
(often a use that was not considered by the guys who
originally collected the data). And secondly we need to
provide access to that metadata and the other resources.
Most of our scientific
information management efforts so far have focused on those
two needs, but there are some additional technical
challenges. We need to develop approaches and standards that
facilitate data integration. This will allow us to pull
together data from multiple data sets and have information
technology help with that process, instead of having to
change the headings in your spreadsheet, for example. We
need to enhance the interoperability of data systems. We
must develop and use some sort of intelligent agents that
can bring together information from multiple databases so
that data integration is not a lot of laborious work on the
part of individual investigators.
We are obviously providing some
model and analysis and visualization tools now. However,
there has to be an integration of those tools with the data
themselves. In other words, choose your data set, choose
your tool, and information technology can bring them
together. I am not saying that we have all this developed,
and I am not saying we have all the answers. But this is the
vision of where we want to go. And I think information
technologies can be used to support more of these kinds of
activities, which are part of the assessment process.
EPA Efforts in Scientific
Information Management
Within the EPA we have recently
developed an implementation plan that spells out a vision
for information management within the office of Research and
Development. The plan encompasses the next three, four, and
five years, so not everything is in place yet. But the major
crux of it is to basically leverage information management
technology to support all aspects of the assessment process.
Part of that is to adopt or develop (but hopefully adopt as
much as possible) approaches, standards, and procedures to
maximize the integration of data, data systems, modules, and
other analysis tools. We are using information technology to
make this assessment activity more efficient.
We also are trying to integrate
as much as possible our efforts with ongoing national and
international efforts, because the EPA as an agency realizes
that we can't do it all, we certainly haven't done it all,
and–to some extent–we are behind organizations like NASA and
NOAA in having effective data-management policies and
systems in place.
This vision of the EPA’s Office
of Research and Development (ORD) attempts to address the
technical management and cultural challenges that I have
already discussed. This vision is developed and guided by
the newly formed ORD Science Information Management
Coordination Board, so that there is actually an
organizational entity within our shop that is trying to pay
attention to what information resources management should
provide to support the types of activities that EPA has to
conduct. If you wish, you can download the strategic plan
from the Web page (http://www.epa.gov/ord).
What we are trying to achieve in
terms of scientific management systems is an end result that
combines these five elements: a metadata directory
(how do I find something and describe it?); a data format
wizard (how do I bring together various types of data
that are in a distributed environment?); a geographic
module (how do I deal with data that has some sort of
spatial context in terms of management and reorganization?);
a statistical module (how can I pull statistical
routines and combine them with data?); and a modeling
module (how do I pull together all those various
modules, atmospheric depositions, ground water infiltration,
agricultural run off, and so on, with the data that I
have?). In application, what we envision is that at the
start of a project the principal investigator would come
along, enter their project description, and then begin to
discover the background material needed to start the project
using the metadata directory. As they pull together data
they would use something like the data format wizard for the
collection and integration of data. As they got into the
analysis they would use the other modules, such as the
geographic module, the statistical module, and the modeling
module, to analyze and add value to the data they pull
together. Finally they produce the report, putting another
entry back into the metadata directory that essentially
tracks the project from there. Thus, the metadata directory
as we conceive it is fairly robust, representing various
types of metadata objects, data sets, databases, projects,
modules, documents, and even multimedia material.
Recommendations
I would like to close with a few
lessons we have learned from going through the process of
trying to understand the scientific information management
environment. I will present these in the form of
recommendations. First, I would put forward three general
recommendations: 1) view information management as more than
just storing or capturing data sets and distributing them;
2) use an incremental type of process–start with the
metadata, go to the data, add on the tools, and so on; 3)
use the best practical technology. (Using state-of-the-art
technology usually means someone gets caught on the bleeding
edge and it is tough to be there, so opt for practicality.)
The 20-Year-Rule
Recommendation
Data are a resource that needs
to be protected. A lot of money goes into collecting data.
Some experts speak of the idea of "data entropy," where the
value of data is very high as the principal investigator
collects it. Gradually the value tapers off and goes off
into nothingness as distance and time gets put in-between
the data collection effort and later steps. Data entropy
doesn't have to happen if there are adequate metadata. We
think in terms of a 20-year-rule. The 20-year-rule simply
asks: Will someone 20 years from now, not familiar with your
data, be able to use and understand the data solely with the
metadata that you provided?
I submit that there are not a
lot of records out there that could pass the 20-year-rule.
But if we could create such records now, we would avoid data
entropy in the future. We need robust directories of
environmental data information and tools–the types of things
that represent more than just data sets. And the metadata
standards that we use need to be developed based upon the
needs of science, which may mean not built from top down.
With metadata, we need to build basic, starting with the
basic entry, such as title, abstract, contact, description
of themes, spatial and temporal extent, and so on.
Network Architecture
Recommendations
A few words about network
architecture: We have all come to the conclusion in the
field that we need to implement some sort of hybrid of
centralized and distributive approaches. A purely
centralized approach does not work, and a purely
distributive approach does not work either. We probably, at
a directory level, need to restrict network nodes to
summary-level metadata. Detailed metadata and the data
themselves are probably best stored close to the originating
sources. So there is some need for data archiving
facilities.
Management Recommendations
Some suggestions for management:
Plan for and provide adequate resources. Again, it sounds
like I have my hand out, but we have all been involved with
projects that were at some point in time inadequately
funded. In addition, management needs to provide incentives
for data-sharing and publication. It also must get people to
use the systems that we develop, share the vision for an
integrated information management environment, and promote
collaborative efforts.
Within the EPA, for example, we
have an Ultraviolet Band monitoring program, as well as
other atmospheric monitoring programs. Wouldn't it be neat
if they were developing an interoperable type of data
system? Well, before we weren’t, but now we are. We need to
link administrative management and scientific systems to
reduce the burden of preparing data documentation. For it is
burdensome. It does take time. If you describe the project
once for the budget people because you are about to go out
and spend extra dollars, for example, can't you use that
description as part of your description in your metadata
system?
Cultural Recommendations
Perceived threats to loss of
intellectual property can impede the use of IM systems. I
think mostly those threats are overstated and
overemphasized, but they are real. They keep people from
using IM. Data-sharing as an approach needs to be promoted,
because data-sharing can lead to mutual career advancement.
I am reminded that the most influential or the most
interesting scientific advancements often are those that are
as a result of merging two fields.
In addition, publication of
metadata and data should be recognized as a worthwhile
effort by peers. That idea is currently supported by several
journals, including those published by The Ecological
Society of America, The American Geophysical Union, and The
Geological Society of America.
Publishing Good Metadata
Finally, publication of good
metadata minimizes inappropriate use, another concern that
scientists have about giving up their data into a system.
The highest priority, though, in terms of doing
environmental assessments, is to develop good directories of
environmental data to help us find the information that is
already out there.
Previous |
Next
Questions:
Email us or Call (215)
893-1561
Copyright © 2003 NFAIS. All rights
reserved. No part of this product or service may be
reproduced, stored in a retrieval system or transmitted in any
form or by any means, electronic, mechanical, photocopying,
recording or otherwise, without prior written consent.
Privacy
Policy |