NSF 2004 CI in Chemistry

Introduction
The emergence of computer-enabling technologies in the late 20th century has

transformed the way science is practiced in the U.S and around the world. Science
researchers and educators now take advantage of electronic literature databases and virtual
classrooms, and have benefited from a sustained growth in all areas of low- mid-, and
high-performance computing hardware. So profound has been the change that the use of
information, computational and communication technologies is now an accepted part of
the research infrastructure we expect to maintain and grow in the 21st century.
Cyberinfrastructure needs in each area of science will be different depending on the
discipline. The chemistry community encompasses researchers in a broad range of
subdisciplines including the core chemical sciences, interdisciplinary activities at the
interfaces of biology, geosciences, materials, physics and engineering, and through the
interplay of computational modeling and prediction with experimental chemistry. It is
probably not surprising that, given the unifying chemistry theme, the subdisciplines share
common hardware needs, experience similar algorithmic bottlenecks, and face similar
information-technology issues. At the same time, each subdiscipline of chemistry has its
own unique cyberinfrastructure problems that are inhibiting scientific advances or
education and training in that area. Finally, the next generation of grand-challenge
chemistries could help to define and anticipate cyberinfrastructure bottlenecks in
advancing those emerging areas in the chemical sciences.
The NSF-sponsored workshop was organized around chemical sciences drivers,
with a primary purpose to determine how cyberinfrastructure solutions can enable
chemical science research and education, and how best to educate and train our future
workforce to use and benefit from cyberinfrastructure advances. Having both identified
common cyberinfrastructure solutions for the chemistry community as a whole, and
discerned distinct needs of the chemical sciences drivers, we offer the following
recommendations on cyberinfrastructure solutions identified and discussed during the two
days of the workshop.
Workshop organization
A scientifically diverse set of approximately 40 participants from academia,
government laboratories, and the private sector, along with observers from various
agencies as well as international representatives, were invited to attend the 2-day
workshop. The workshop participants gathered on the first night with a reception and
opening remarks from the co-organizers and representatives from the National Science
Foundation that defined the charge to workshop participants. The remainder of the evening
was devoted to talks by four plenary speakers who presented a broad vision regarding the
issues of cyberinfrastructure and its impact on the chemistry community. The plenary
speakers addressed the various levels of complexity, detail, and control desired by different
groups within the computational chemistry community, which ranges from algorithm
developers to point-and-click users. Their goals centered on a set of highly modular and
extensible elementary and composite workflow pieces that, by individual combination,
allow researchers to explore new uses of the codes. In practice this meant an
unprecedented level of integration of a variety of computational codes and tools including
1
computational-chemistry codes; preparation, analysis, and visualization software; and
database systems. Examples of development of such an infrastructure include the
“Resurgence” project presented by Kim Baldridge, and Thanh Truong’s ITR project,
“Computational Science and Engineering Online.”
The following day was devoted to breakout sessions organized around four
chemical sciences drivers: Core Computational Chemistries, Computational Chemistry at
the Interface, Computational and Experimental Chemistry Interaction, and Grand
Challenge Chemistries. These discussions were meant to first identify outstanding
scientific problems in those areas, and then focus on what cyberinfrastructure solutions
could advance those areas. Participants in the chemical sciences driver breakout sessions
were reshuffled so that all of the first session’s identified needs would be represented in
each of a second set of breakout sessions held later in the day addressing core
infrastructure solutions: Software and Algorithms, Hardware Infrastructure, Databases and
ChemInformatics, Education and Training, and Remote Chemistries.
On the final day, the facilitators for each session provided brief oral reports, which
they submitted later as written reports summarizing those discussions. These reports have
been assembled by the co-organizers to draw out the main findings and recommendations
that follow. We also gratefully acknowledge several previous workshops reports [1, 2], and
blue-ribbon panel report [3], which have identified several important areas of
cyberinfrastructure development for the chemical sciences and engineering and which
served as background for the participants of the Cyber-enabled Chemistry workshop.
Chemical Sciences Drivers
1. Core Chemistry
Any endeavor called “cyber-enabled chemistry” should share at least two
prominent characteristics. First, cyber-enabled chemistry (CEC) relies on the presence of
ubiquitous and substantial network, information, and computing resources. Second, CEC is
problem-centered, and is directed in particular toward problems so complex as to resist
solution by any single method or approach. Giving some examples of problems in
chemistry that stand to benefit from better cyberinfrastructure may help bring the term
“cyber-enabled chemistry” into tighter focus. A key feature of the following illustrative
problems is that none of them can be solved using solely methods from just one of those
subdisciplines, and that a complete solution must combine different areas of experiment,
theory, and simulation.
• Molecular design of new drugs, molecular materials, and catalysts
• Electrochemistry problems such as molecular structure of double layers, and active
chemical species at an electrode surface
• Influence of complex environments on chemical reactions, understanding and
predicting environmental contaminant flows
• Chemistry of electronically excited states and harnessing light energy
• Tribology, for example molecular origins of macroscopic behavior such as friction
and lubrication
2
• Rules for self-assembly at the nanoscale, with emphasis on non-covalent
interactions
• Combustion, which involves integrating electronic-structure, experiment, and
reaction-kinetics models
Cyber-enabled Challenges. Tools for CEC should allow researchers to focus on the
problems themselves by freeing them from an enforced focus on simulation details. While
these details should be open to examination on demand, their unobtrusiveness would allow
researchers to focus on the chemical problem being simulated. Cyberinfrastructure could
promote progress in the core chemistries by improving software interoperability,
prototyping modeling to a prescribed accuracy with well-defined benchmark experiments
and simulation data, archiving and data warehousing of experimental and simulation
benchmarks, providing education in computational science and model chemistries, and
enabling collaborations between experimentalists and theorists.
Software Interoperability. The subdisciplines in theoretical chemistry generally
have their own associated public and private software; in many cases, no single code
suffices even to solve parts of the problem residing entirely within a single subdiscipline.
Therefore, interoperability between codes both within and across these subdisciplines is
required. There is a clear opportunity here for the NSF to play a strong role in encouraging
the establishment of standards such as file formats to enable interoperability. However,
such standards should be imposed only where the problems involved are sufficiently
mature, and these standards should be extensible.
A problem-oriented simulation language – a crucial missing ingredient in achieving
interoperability across subdisciplines – could be realized through the definition of large-
scale tasks with standardized input and output that can be threaded together in a way that
avoids detailed specification of what happens inside each task. Along with extensibility,
such a language needs to provide rudimentary type-checking to ensure that all the
information needed for any task will be available from prior tasks.
Benchmarks for Model Accuracy. Simulation has become increasingly accepted as
an essential tool in chemistry: for instance, many experimentalists use quantum-chemistry
packages to help interpret results, and often with little or no outside consultation with
theoretical and/or computational chemists. One advance for furthering this acceptance was
the introduction of “model chemistries”: levels of theory that were empirically determined
to provide a certain level of accuracy for a restricted set of questions. This “prototyping
with prescribed accuracy” (PWPA) can be identified as a goal that needs to be achieved in
the broader context of chemical simulation. Ideally, each such PWPA scheme would be a
hierarchical approach with guaranteed accuracy (given sufficient computational resources)
based on standardized protocols that the community can benchmark empirically in order to
determine the level of accuracy expected.
Databases and Data Warehousing. Cyberinfrastructure solutions that can enable
PWPA include an extensive set of databases containing readily accessible experimental
and theoretical results. Such data would allow testing of a specific proposed modular
subsystem protocol (say, for use with a new database or new hardware) without carrying
out all of the component simulations comprising the protocol. Furthermore, theoretical and
3
experimental results must be stored with an associated “pedigree,” allowing database users
to assess the data’s reliability with estimated error bars and sufficient information to allow
researchers to revise these estimates if necessary. The NIST WebBook database of gas-
phase bond energies is a notable example of a current effort directed along these lines.
Ideally, automated implementation of these protocols should actually proceed from
specification not of the protocols themselves but rather of a desired accuracy. The
simulation language would allow for automatic tests of the sensitivity of specified final
results (for example, the space-time profile of the concentration of some species in a
flame) to accuracy in component sub-problems (for example, computed reaction energies,
diffusion constants, and fluid properties, to name but a few).
There are simulation cases where a compelling reason exists to archive relatively
vast amounts of simulation data. Such “community simulations” have value outside their
original stated intent because, for example, they provide initial conditions that can be
leveraged to go beyond the original questions asked, because they can be used in
benchmarking and creating standardized protocols needed for PWPA, or because they
provide useful model-consistent boundary conditions or averages for multiscale methods.
In the first category are simulations of activated events such as the millisecond folding
simulation of the Kollman group or the simulation of water freezing from the Ohmine
group. These simulations were so challenging because of the long time scales involved,
which is in turn directly related to the presence of free-energy barriers. Most methods
developed to model rare events without brute-force simulation techniques rely on the
availability of some representative samples where the event occurs. These “heroic”
simulations can be put to use in a more statistically significant context. Simulations of
water at different pressures and temperatures would fall in the second category – providing
benchmarks for empirical water-potential models and aiding development of standardized
protocols for simulations involving water as a solvent. In the last category are simulations
of biological membranes, which could be used as boundary conditions for embedding
membrane-bound proteins in studies of active site chemistry.
Education in Computational Science and Engineering. The development of an
improved cyberinfrastructure could go a long way towards achieving an equal partnership
between simulation and experiment in solving chemical problems. But such an equal
partnership implies a drastic change in the culture of chemistry. Thus, it is particularly
important that such a change be cultivated at the level of the undergraduate curriculum. In
some disciplines, such as physics, it is natural for students to turn to modeling as an aid in
understanding the ramifications of a complex problem, but this is rarely the case in
chemistry. Improvements in cyberinfrastructure will enable earlier and more aggressive
introduction of simulation techniques into the classroom. However, cultivation of a “model
it first” attitude among undergraduates and/or chemists as a whole is more useful if the
simulation data come with a trust factor, i.e., error bars. Development of PWPA techniques
is therefore critical. Moreover, the concept of error in simulation needs to be emphasized in
order that simulation tools are not misused.
Collaborations between Experiment and Theory. What would be the practical
outcome of “simulation and experiment as equal partners”? Computing and modeling at
the lab bench would become routine, both suggesting new experiments and, just as
4
valuably, helping avoid experiments with little or no hope of success. Numerous
cyberinfrastructure tools are explicitly designed to enhance or facilitate collaborations.
Thus, there is a double effect whereby cyberinfrastructure promotes collaborations, and
these collaborative efforts in turn increase the demand for improved cyberinfrastructure.
This is welcome, since increased collaboration between experiment and theory is a must
for progress on complex chemical problems and also for the validation of protocols needed
for PWPA.
2. Chemistry at the Interface

Many of the important potential advances for chemistry in the 21st century involve
crossing an interface of one type or another. Significant intersections of chemistry and
other disciplines (in parentheses) include:
• Understanding the chemistry of living systems in detail, including the development
of medicines and therapies (biology, biochemistry, mathematical biology,
bioinformatics, medicinal and pharmaceutical chemistry)
• Understanding the complex chemistry of the earth (geology, environmental science,
atmospheric science)
• Designing and producing new materials that can be predicted, tailored, and tuned
before production, including investigating self-assembly as a useful approach to
synthesis and manufacturing (physics, electrical engineering, materials science,
biotechnology)
Developing cyberinfrastructure to make these interfaces as seamless as possible

will help address the challenges that arise. It is important to acknowledge multiple types of
interfaces in the specification of needed infrastructure. One scientific theme underlying
many of the areas described above is the requirement to cross multiple time and length
scales. Examples range from representations or models for the breaking of bonds (quantum
chemistry) to descriptions of molecular ensembles (force fields, molecular dynamics,
Monte Carlo) to modeling of chemistry of complex environments (e.g., stochastic
methods) to entire systems. Today, computational scientists are generally trained in depth
in one sub-area, but are not expert in the models used for other time and length scales.
Herein lies a challenge, since frequently data from a shorter time/length scale is used as
input for the next model. Developing the interfaces between theory, computation, and
experiment are also required to understand a new area of science. But again, because of
specialization, no seamless interface exists between theorists, computational scientists, and
experimentalists.
Other interfaces deserve consideration, as well: Interfacing across institutions –
academic, industrial labs, government labs, funding agencies – is needed to disseminate
advances among the different institutions conducting research in the U.S., as well as across
geographical locations, to take advantage of research already done around the globe. Better
coordination between research and education is required to introduce new research topics
into the undergraduate and K-12 curriculum, as well as for explaining significant new
chemistry solutions that impact public policy such as stem-cell research or genetically
modified foods.
5
Cyber-enabled Challenges. Tools for cyber-enabled chemistry should allow for
clear communication across the interfaces, broadly defined. The science interfaces listed
above, for example, require scientific research involving complexity of representation, of
analysis, of models, and of experimental design and measurement, even within individual
sub-areas. How can all of the relevant information from one sub-area be conveyed (with
advice about its reliability) to scientists in other sub-areas who use different terminology?
How do we educate students at scientific boundaries, and also promote collaboration
across disciplines?
Chemistry Research and Education Communication. How do we present problems
to the broader research and education community in the most engaging manner? Different
disciplines may use different terminology and concepts to describe similar chemistry. A
science search-engine – such as the recently introduced Google Scholar – would be highly
desirable. Research, development, and use of ontologies, thesauri, knowledge
representations, and standards for data representation are all ways to tackle these issues,
and are computer science research efforts in their own right.
In some cases, a problem is best expressed at a higher level of abstraction, e.g., in a
way that a mathematical scientist might understand and use generic techniques to solve.
Alternatively, a problem may need to be expressed in the language of a different sub-area
in order to encourage experts in that area to generate data needed as input to a model at a
different scale. It is therefore important to present theories and algorithms in a context that
can be understood by those in related fields. Along another dimension, free access to the
scientific literature carries particular importance for projects that span interfaces, because
much of the literature required for these projects is not in core chemistry areas, but rather
in a multitude of other disciplines. Today, in these cases, cost is a significant inhibitor to
learning from the literature. As a result, advancement of the science is slower than it could
be.
Members of a newly formed multidisciplinary research team must initially learn
about one another’s disciplines, and the results of their research must later be conveyed to
the wider academic community. Web pages, links to related literature, and online courses
may all be useful. In addition, the academic curriculum should be updated to teach the
basic concepts of multiple traditional disciplines (rather than just chemistry) to the next
generation of students so that they may more easily understand and contribute to new
areas.
Interfacing Data and Software across Disciplines. Potential inhibitors to clear
communication across interfaces and to the success of multidisciplinary projects include,
first, some sociological issues around data sharing. For example, traditional scientists, who
have been encouraged to become deep experts in a particular field, may not feel the need,
or have the time, to make their data available to others. To combat this and to encourage
widespread deposition of protein structural data, the Protein Data Bank (PDB) has
successfully allied itself with influential journals that frown on submission of articles
without deposition of data in the databank. However, it is not clear that this model can be
extended to all areas of chemistry. This raises another issue, namely support for centralized
data sources that can be centrally curated, as is the PDB, versus distributed autonomous
sources where the maintenance and support is managed more cost-effectively by multiple
6
groups, but where the data models and curation protocols are likely to differ, thus
hampering integration of the data.
Finding and understanding relevant data requires reliability and accuracy. Users of
data need to understand its accuracy and the assumptions used in its derivation in order to
use it wisely. Algorithm developers need to know what degree of accuracy is required (i.e.,
when their algorithm is good enough) for different uses of the data. Cyberinfrastructure
“validation services” could provide information on what to compare, how to compare, and
what protocols to follow in the comparison. Curation is essential for ensuring improved
reliability, although, in general, expert curation (unlike automated curation) is not scalable.
Data provenance and annotation are also important.
Standards are essential for interoperability among applied programming interfaces
(APIs) as well as among data models, although the difficulties of standards adoption
should not be underestimated. Adoption of standards partially addresses this concern, but
there is also a significant need for robust software-engineering practices, so that software
that is developed for one subdiscipline can be easily transferred to codes for other
subdisciplines. This facilitates building on proven technology where appropriate, and for
funding of software maintenance and support.
Development of Collaboration Tools. Cyberinfrastructure can help existing,
geographically dispersed teams communicate more effectively. Examples of useful
collaboration tools are those that would improve point-to-point communication with usable
remote-whiteboard technology, or would better enable viable international
videoconferencing, such as VRVS (Virtual Rooms Videoconferencing System). In order to
make many multidisciplinary projects successful, budgets have to cover technical people
much depth as an expert in any one area. New information technology with sophisticated
knowledge representation may someday fill this gap, but in the foreseeable future such
people will be vital to the success of a multidisciplinary project. Large projects also need
expert project managers to facilitate collaborations and supervise design, development,
testing, and deployment of robust software. Funding should be available for
multidisciplinary scientific projects (as long as such funding does not negatively impinge
upon individual PI funding) that focus on novel science and novel computer science as
well as for projects that focus on novel science enabled by design and deployment of
infrastructure based on current technology. (In fact, any one project may involve both of
these aspects).
3. Computational and Experimental Chemistry Interactions

Increasingly, experimentalists and computational chemists are teaming up to tackle
the challenging problems presented by complex chemical and biological systems.
However, many of the most challenging and critical problems are pushing the limits of the
capabilities of current simulation methods. A key aspect of the computational/experimental
interface is validation. This is a two-way process: High-quality experimental data are
necessary for validating computational models, and results from highly accurate
computational methods can frequently play an important role in validating experimental
data, or provide qualitative insight that permits the development of new experimental
directions.
7
Improved Computational Models that Connect with Experiment. A defining area of
intersection between computation and experiment – the use of efficient computational
models to drive experiments (for example, to predict optimal experimental conditions for
deriving the highest-quality experimental data, or to design "smart experiments") – could
lower cost of discovery and process design. Several key prospective infrastructural
advances will help researchers bridge the computational/experimental interface. High-
bandwidth networks will allow large amounts of data to rapidly move among researchers.
Robust visualization and analysis tools will give researchers better chemical insight from
data exchanges. Better data access and database querying tools are also needed. Continued
development of new and improved methods for modeling systems with non-bonded
interactions, hard materials (e.g., ceramics), and interfacial processes should be given high
priority. For many of these problems, there is a need to develop computational methods
that truly bridge different time and length scales. To validate such approaches, it is
important to generate and maintain databases with data from experiments and simulations,
which in turn means developing mechanisms for certifying data, establishing standardized
formats for data from different sources, and developing new tools (expert systems and
visualization software) for querying a data base and analyzing data.
Promoting Experimental/Theoretical Collaboration. Real-time interactions
between experiments and simulations are needed in order to maximize the benefit for
groups to interact effectively. Although groups are beginning to explore opportunities to
enhance these collaborations, real-time interactions at present limit their full exploitation.
One problem is the time required to carry out experiments or simulations. Faster
algorithms and peak-performance models/methods should help facilitate cross-
computational/experimental interactions. Better software and analyses of experimental
data/databases, perhaps using expert systems, should help computational modelers access
experimental data/results. The computational/experimental interface also has an
educational dimension. First, the Internet era has made it easier than ever for experimental
and theoretical groups in different locations to interact. Modern cybertechnology also
makes it possible for these interactions to involve students and not just faculty members.
4. Grand Challenges in Chemistry

Some very broadly defined areas of chemistry may yield only to next-generation
technologies and innovations, which will in turn rely heavily on the development and
application of novel cyberinfrastructures enhancing both computational power and
collaborative efforts. In particular, three key grand challenges were discussed:
• Development of modeling protocols that can represent chemical systems at levels of
detail spanning several orders of magnitude in length scale, time scale, or both. For
example, modeling the materials properties of nanostructured composites might
involve detail from the molecular level up to the mesoscopic scale. The bulk
modulus of a solid, for example, is a property associated with the large scale. But
developing parameters to describe that scale in a tractable coarse-grained fashion
requires understanding the details of unit cell structures at the molecular scale.
• Development of modeling protocols that can represent very large sections of
potential-energy surfaces of very high dimensionality to chemical accuracy, typically
8
defined as within 1 kcal mol–1 of experiment. This level of accuracy will be critical
to the successful modeling of such multiscale problems as protein folding,
aggregation, self-assembly, phase separation, and phase changes such as those
involved in conversion between crystal polymorphs. With respect to the latter
problem, simply predicting the most stable crystal structure for an arbitrary molecule
remains an outstanding grand challenge.
• Development of algorithms and data-handling protocols capable of providing real-
time feedback to control a reacting system actively monitored by sensor technology
(for example, controlling combustion of a reactive gas in a flow chamber).
Solving such multiscale problems requires transferring data among adjacent scales,
so that smaller-scale results can be the foundation for larger-scale model parameters and, at
the same time, the larger-scale results can feed back to the smaller-scale model for
refinement (e.g., improved accuracy). Addressing this grand challenge means developing
algorithms for propagating deterministic or probabilistic system evolution of fine and
coarse scales. In addition, most systems of interest are expected to be multiphase in nature,
e.g., solids in contact with gases, or high polymers in solution, or a substance that is poorly
characterized with respect to phase, as is a glass. Characterization of any of these systems
will require considering significant ranges of system variables such as temperature and
pressure. An added level of difficulty may arise when the system is not limited to its
ground electronic state.
Another key point is that it is not really the potential energy surface, but the free
energy surface, that needs to be modeled accurately. This requires an accurate modeling of
entropy. It is unlikely that ideal-gas molecular partition functions will be sufficiently
robust for this task. Improved algorithms for estimating entropy and other thermodynamic
parameters will be critical to better modeling in this area.
For problem areas such as combustion and sensor control, attaining the speeds
needed for controlling combustion of a reactive gas in a flow chamber may require
development of specialized hardware optimized to the algorithms involved. In addition,
methods for handling very large data flows arriving from the sensors (and possibly being
passed to control mechanisms) will need to be developed.
All three grand challenges share several common features that will place an onus
on cyberinfrastructure development. First, model quality cannot be evaluated in the
absence of experimental data against which to conduct validation studies. Useful data are
not always available, and support for further measurement should not be ignored.
Centralization of validation data into convenient databases – ideally with quality review of
individual entries and standardization of formats – would contribute to more-efficient
development efforts. Second, model/algorithm development at all but the very smallest
scale inevitably involves some parameterization. Support for cyberinfrastructure tools that
might speed parameter optimization (e.g., via grid computing across multiple sites) and
simplify analysis of parameter sensitivity should also be a priority. Third, approaches to
grand-challenge problems will benefit from improvements in processor speeds, memory
usage, parallel-algorithm development, and grid-management technology. Finally, to
ensure the maximum utility of tools developed as cyberinfrastructure, developers need to
9
be multidisciplinary, either individually or as teams, so that the tools themselves will be
characterized both by good chemistry and physics and by good software engineering.
Cyberinfrastructure Solution Areas

1. Models, Algorithms & Software
Several models, algorithms, and software developments are needed to carry out
computational and theoretical chemistry that will enhance or be enhanced by developments
in cyberinfrastructure. Nearly all problems at the forefront of the chemical sciences require
bridging across multiple length and time scales for their solution. Techniques are needed to
reversibly map quantum-mechanical scales to atomic scales, atomic scales to mesoscales,
mesoscales to macroscales. These mapping techniques may not generalize across all
chemical systems, materials, and processes. Consequently, development of coarse-grained
models and methods for bridging models across length and time scales is a high priority.
Models and Algorithms. At the level of quantum-mechanical (QM) methods,
accurate order-N methods are needed to enable the determination of ground and excited
states of systems containing on the order of several thousand atoms, and to calculate
system dynamics, kinetics, and transition states. Another approach worth pursuing is the
application of ensemble and sampling methods to QM problems with greater sophistication
and quality than is now possible. In the electronic-structure calculation of condensed
phases, there is a need to include relativistic effects for heavy metals and to develop
methods beyond density functional theory (DFT). It is not clear at this point how to
systematically improve DFT and related methods.
At the classical, atomistic level, more-sophisticated force fields, including reactive
force fields and empirical potentials that can realistically model heterogeneous, disparate
materials and complex (e.g., intermolecular) systems, will be beneficial. Development of
these approaches would benefit from access to systematic databases of QM and
experimental data and from new tools to automate force-field optimization. Databases
should be provided in standard data formats with tags, and it should be easy to use tools to
access data for validation and parameterization of models. Similarly, improved mesoscale
models and methods for the consideration of heterogeneous, multiphase, multi-component,
and multi-material systems are needed. Time-scale meshing is an issue, for example, in
combustion and environmental flow (e.g., through soil) problems. Additionally, methods
are needed to model both chemical and physical processes at the cellular level and such
basic processes as solvation.
Better algorithms – for instance, multiscale integrators, coarse-sampling methods,
and ergodic sampling methods, such as Wang-Landau and hyperparallel tempering – are
needed for global optimization of structures and, e.g., wave functions, as well as for the
efficient exploration of time or efficient generation of equilibrium or metastable states.
This would include better rare-event and transition-state methods, and, importantly, robust
methods for validating these methods. Additionally, “bridging methods” that hand off parts
of a simulation to different models and methods adaptively would permit seamless
bridging across scales without the need to fully integrate these methods. To accomplish
10
these goals, community-endorsed standard data formats and I/O protocols are needed, as
are modular programming environments.
Software Development Tools. Software tools enabling massively parallel
computation, such as scalable algorithms, improved networking and communication
protocols, and the ability to adapt to different architectures are needed. Improved
performance tools, such as profilers and debuggers, load-balancing, checkpointing, and
rapid approaches to assess fault tolerance comprise the necessary support for the
parallelization of good, older serial codes, and to facilitate the generation of large new
community codes. Tools to benchmark software accuracy and speed in a standard way
would be helpful, as would metrics for quality assurance and comparison. One way to
achieve this goal would be with community benchmarking challenges, similar to those for
predicting protein structures or the NIST fluids challenge. Inclusion of benchmarking as a
standard feature of larger projects should be encouraged. While visualization at the
electronic-structure and atomic levels is generally adequate, there is a need for easy-to-use,
affordable software to visualize unique shapes (e.g., nanoparticles, colloidal particles) and
macromolecular objects as well as to visualize very large data sets or multiple levels and
dimensions. Interactive capabilities of data streams would allow for computational steering
of simulations.
Finally, models, algorithms and software must be carefully integrated. Common
component architectures, common interfaces, and inter-language “translators,” for
example, will facilitate software integration and interoperability; while better methods of
integrating I/O, standardized formats for input (such as babel [and CML), and
error/accuracy assessments within a framework will facilitate integration. We encourage
the support, development, and integration of public-domain community codes and, in
particular, the solving of associated problems such as long-term maintenance, intellectual
property, and tutorial development (with recognition of the varying expertise of likely
users).
Grid computing should be supported for model parameterization and to enable the
coupling of databases to computations, access pre-computed and pre-measured values, and
avoid duplication. Collaborations between chemical scientists and computer scientists for
continued development of tools, languages, and middleware to facilitate grid computing
should be encouraged. As grid computing increases, so will the need for reliable data-
mining tools. Data repositories – which should include negative computational results
where applicable – would also facilitate benchmarking.
2. Hardware Cyberinfrastructure
Hardware resources that can be brought to bear on a particular scientific challenge
are not hard to identify. Yet these resources are the underpinnings of all the layers of the
hierarchy of cyber-enabled chemistry (CEC). The areas discussed in this report – data and
storage, networking, and computing resources – do not cover the full gamut of what
constitutes the entire base infrastructure, but it is in these three areas that the need for
sustained funding and opportunistic possibilities is most acute.
Data and Storage. A low-latency, high-bandwidth, transparent, hierarchical storage
system (or systems) will be needed for facilitating federated-database access and archival
11
storage. The system interfaces must account for the mapping of data, information, and
knowledge at all levels. Significant infrastructure research will be required to federate and
use the multitude of databases that are and will be both available and required for CEC.
The fidelity and pedigree or provenance of that data must be maintained. With the near-
exponential growth of data now occurring, the situation will only get worse. The hardware
should be tuned to allow for ubiquitous access to the literature, and to existing chemistry-
specific databases such as the PDB, thermochemistry, NIST, and other chemical-sciences
and engineering databases. The hardware system functions should include the collection of
user-based and automatically generated metadata both from site-specific systems – e.g., the
NSF-funded supercomputer centers – and from applications run at those centers. Site
specifics from experimental facilities and their associated instruments need to be included,
as well. Finally, access to data must be carefully controlled at the user, group, facility, and
public levels. Users must be able to determine what data gets pushed out to the various
levels within the user community and how.
Networking. The network components of cyberinfrastructure are essential to the
success of any cyber-enabled science. Without a high-performance, seamless, robust
network, none of the components of cyber-enabled science will work. The network has to
identify and mitigate failures and performance issues in a productive way. Timely
integration of new technology into the network infrastructure is critical, and evaluation of
next- and future-generation innovations must be carried out on a continuing basis and in
cooperation with others evaluating the same or similar technology. The network
infrastructure’s critical nature, and the expected demands placed on it by CEC, demand
that the current network backbone be immediately upgraded to the latest production-
quality technology and that regional and network-wide links be upgraded in a timely
fashion. Furthermore, appropriate testbeds for evaluating new technology must be put into
place or strengthened where needed. Network research testbeds are available within other
federal agencies (e.g., DOE Office of Science) and should be leveraged where appropriate.
Computing Resources. The desktop, or principal-investigator, level of resources at
the medium-to-high end of the current market – used for individual work and for accessing
other cyber-enabled facilities on the network – is frequently adequate for users within the
CEC community. However, desktop resources must keep pace with changes in
cyberinfrastructure as base computing capability evolves. Periodic technological
refreshment of these resources must be supported on a continuing basis.
The next levels in the hierarchy are the department and larger regional capacity
centers. These are not high-end computing (HEC) resources with unique capabilities but,
rather, facilities with the capacity to stage jobs to the high-end resources and serve the
computational needs of scientists that do not require high-end resources. Currently these
are underrepresented resources in the NSF landscape. Many people use high-end resources
because those resources have significant capacity in addition to their capability. This
obviously results in reduced access to the capabilities of the high-end resources for the
capability user, while the capacity user often has throughput issues on his/her work. NSF’s
Mathematical and Physical Sciences Directorate, which includes the Chemistry Division,
and CISE should cooperate in developing new regional capacity centers, possibly in
partnership with existing ones, in addition to augmenting the resources at the HEC
12
facilities and targeting applications to appropriate resources. Funds for the staffing
necessary for maintaining, appropriating, and operating these expanded resources must be
provided.
The third level of the hierarchy is high-end computing (HEC) resources. NSF
national facilities such as NCSA, SDSC, and PSC, provided to meet the programmatic
expectations of the various NSF divisions, are critical for expediting grand-challenge
science, and they are the computational workhorses of cyber-enabled systems. The need
for technology refresh is extremely important for the HEC centers. Mechanisms to identify
experimental and future technologies, to take advantage of high-risk opportunistic
technologies, and to determine these technologies’ suitability for enhancing production-
quality capabilities for the computational-science community and cyber-enabled science –
always in a cost-effective manner, with engagement from the NSF computing community
and possibly in cooperation with other agencies – should be put in place.
At the highest level of the current hierarchy there is the TeraGrid, a large,
comprehensive, distributed infrastructure for open scientific research. CEC will make use
of the TeraGrid at some level. The NSF Chemistry Division must understand the
objectives and goals of the TeraGrid program and determine a path forward to exploit these
resources and any new cyber-enabled resources. It is likely that the TeraGrid will be one
component of cyber-enabled science.
The micro-architecture of all these resources must feature balanced characteristics
among the many subsystems available (e.g., memory bandwidth and latency must match
the computational horsepower, and cache efficiency is critical to achieve high
performance). Parallel-subsystem characteristics (e.g., inter-node communication systems
and secondary storage) must also be balanced. Deployed systems’ characteristics must be
appropriate for chemistry and chemical-engineering applications. In addition,
programming models and compiler technology must be advanced to increase programmer
efficiency. This is an obvious area for cross-collaborative development with other NSF
directorates and divisions.
Visualization systems will become even more critical to the insights and
engagement of experts and non-experts alike. It is easy to visualize a handful of atoms, but
simulations of protein systems with chemically reactive sites, reactive chemical flows,
microscopic systems, etc., will demand improved visualization capabilities, which the
chemistry community, in turn, must learn how to develop and use. Visualization tools such
as immersive caves, power walls, high-end flat-panel displays, and, most importantly,
appropriate software tools will lead the practicing cyber-chemist to new chemical
discoveries. Remote visualization of data via distributed data sources is a difficult
challenge, but one that must be met.
3. Databases and ChemInformatics

Shared cyber infrastructure and data resources will be needed to solve grand-
challenge issues identified in recent NAS reports [1, 2], as well as improve baseline
productivity and enable day-to-day progress in scientific advancement. Current databases
are not the universal answer. For example, the current protein databases are very useful,
13
but we can’t mine them to learn about protein-protein interactions. However, several
concrete examples of forward thinking on database organization and querying exist:
• Protein Data Bank (PDB): The first chaos of data access was exhilarating, and
many discoveries were made by virtue of having data in the same place. The
hierarchy and rules developed later.
• Cambridge DB is successful because they created a community of users with a
common need on a focused problem.
• Thermodynamics Resource Center at NIST is developing a program in dynamic
data evaluation whereby literature data is searched and a crude evaluation of
uncertainty is performed by an expert system.
• JPL/NASA Data Atmospheric Chemistry and Kinetics panel is an example
where standards have been agreed upon to evaluate data, but the rest of the
community does not embrace these standards.
However, there still remains a number of standard problems that continue to be wholly or
partly unsolved in regards to databases, their management, and their use in the chemical
sciences context. Data can exist in database form or in the more amorphous literature or on
the Web. Cyberinfrastructure will be needed to provide tools to access data, organize it,
and convert it into a form that enables chemical insight and better technical decision
making, as well as to facilitate communication to and from non-experts to bridge the gap
between scientists and public perceptions. On one level, data can be defined as a
disorganized collection of facts. Information can be defined as organized data that enables
interpretation and insight. Knowledge is understanding what happens. Wisdom is the
ability to reliably predict what will happen. Cyberinfrastructure is essential to move
between these levels.
The activity of validation and consistency is extremely labor intensive, but
essential. Tools should be developed that can cross-check data as much as possible.
Experimental and computational results can be used for mutual screening. One example is
the automatic evaluation of consistency for thermodynamic-property data performed by the
NIST Thermodynamics Resource Center. Stored data and information should reference
details of how the data was acquired (the “metadata” or “pedigree”) to enable experts to
evaluate the data. It would be beneficial to have authors assign uncertainty to published
data, or to have sufficient information available for an expert or expert system to quantify
uncertainty in a measurement or predictive model. If data is very crude with a high
uncertainty (e.g., more than an order of magnitude), this is important for a non-expert to
know, as the person may have to engineer a system with greater allowances. Converting
non-evaluated data in paper legacy systems into a validated, refereed database is an
extremely time-consuming activity that is currently done by experts only in their spare
time. Is this a valuable use of an expert’s time? How else can we do it? Are there some
aspects of evaluation that could be performed by an expert system? How do we ensure that
data published from now on can be readily evaluated and captured? New approaches to
these problems is an activity that should be supported.
Standards will be very important for interoperability and communication, but how
to get people to adopt and use these standards? Should journals require them? The demand
14
on experts who evaluate data after it has been published could be relieved by having
journals require the entry of raw as well as evaluated data, uncertainty estimates, and
similar metadata. Standardization can enable automated data capture. Standardization
should be tiered, perhaps consistent with the maturity of the data or medium being
captured. Standardization would be very effective in capturing current literature, for
example. Raw data has the longest-lasting value and should be archived. Interpretations
may change over time as science and understanding progress. Long-term archiving or
legacy databases or other collections of knowledge, especially non-electronic ones, require
experts to extract information. How will we need to access information in the future?
Visualization is critical. Data needs to be visualized in a manner consistent with
how a person thinks. This may vary to some degree depending upon the field of expertise:
i.e., visualization of the same process for a chemist may differ from what is most effective
for a physicist, materials scientist, biologist, or lay person. Creative visualization at a
fundamental level may be an effective way to bridge vocabulary and conceptual paradigm
differences across disciplines.
Other issues include new database paradigms for collection, archiving, data-mining
tools, validation, and retrieval needed to facilitate interdisciplinary collaborations
Interoperability between databases at different levels as well as user interfaces will
facilitate data mining. Automated dictionaries and translators that will greatly facilitate
communication between scientific disciplines and between scientists and the lay public.
This will require new software tools.
There is no substitute for critical thinking and the human element. Hence for
discovery, the emphasis in the near term should be on tool development to archive, gather,
extract, and present data and information in a manner that will enable creative thinking and
insight. The development of artificial intelligence to draw conclusions from data and
information at this stage may be best suited for collection and evaluation of objective,
quantitative information such as property data. This will certainly evolve in the future.
4. Education Infrastructure
Educational activities represent an important component of virtually all topics
associated with any emerging cyber-enabled chemistry (CEC) project or initiative. Because
of CEC’s multidisciplinary focus and the nature of new science potentially facilitated by it,
even individual investigators in single, well-defined subdisciplines of computational
chemistry are likely to benefit from educational components that might be developed.
Defining the possible scope of these educational endeavors, as broad as they might be, is a
helpful first step in this vision-generating process. The prototype audiences for educational
efforts can be divided into four groups, though they undoubtedly share some information
needs.
The first group is composed of research scientists, both within chemistry and from
allied fields. While disparities certainly exist, by and large the individuals in this group can
be described as problem-solving experts with strong motivation and capabilities for
learning. Thus, non-chemists may need to become more proficient in molecular sciences,
while chemists might require education in computational methodologies and limitations –
15
but all members of this category are probably capable of self-directed instruction given
appropriate materials.
A second category of professionals, who will require education, are established and
emerging educators themselves. If students are to be reached, particularly at early points in
their studies, those who teach them will need both greater depth and greater breadth of
information about the nature of CEC. High-school teachers in particular may face barriers
to learning that are associated with missing background (in physical chemistry or
mathematics) or concomitant fear of materials that appear to require extensive background
in the areas to be understood.
Educators, either at the high-school or introductory college levels of the
curriculum, serve as the conduit for the next vital group of people, students (in the
traditionally defined sense). Computationally-related science in general, and CEC in
particular, should be infused into the undergraduate curriculum. Inclusion of computational
philosophy in the undergraduate chemistry curriculum is important both from an
educational perspective, and from a pragmatic one: Future scientists who populate the
world of CEC will need longer and more-complete training in scientific computing and
modeling methodologies. Computational approaches must be introduced carefully into
classroom and laboratories, so as to avoid fueling student misconceptions about chemistry
– a possibility probably necessitating continuing educational research as new materials,
methodologies, and curricula are developed.
The fourth identifiable group consists of the general public and legislative
leadership within political bodies, particularly at the federal level. CEC is likely to provide
the ability to construct complex models whose accomplishments and limitations must be
communicated in an intelligible way to the general public. Because there is some tendency
of non-scientists to view science as a body of factual information, it will be necessary to
address and, if possible, forestall confusion associated with findings from CEC-related
models that, while important in advancing understanding, do not represent the “final” step
in the treatment of a particular topic. The intricacies of complex modeling may not need to
be expounded upon in this forum, but the process of model building, with an eye to both its
probabilities for success and its potential limitations, should be presented.
Cyberinfrastructure Challenges. The educational demands of training for a
multidisciplinary CEC environment pose specific challenges and opportunities. It is
possible to specify certain educational activities likely to support large-scale development
and deployment of CEC in the future, such as instructional materials and software and
middleware.
Instructional materials. Self-training or tutorial materials are an important
component of the educational strategies for several of the four groups delineated above.
These materials should be developed with several factors in mind. First, materials for
learning computational chemistry, visualization tools, and reaction animations already
exist; we need to benefit fully from lessons gained in the creation of those materials.
Second, in the same way that workflow models for the CEC enterprise require human-
factors research, educational materials must be tailored to the cognitive demands made on
the target audiences, the different cognitive capabilities of different audiences, and the
fundamental constraints associated with any asynchronous learning environment. Because,
16
for example, online educational materials tend to stress certain innate cognitive skills in the
learner, educational research regarding differences in learning styles is likely to be
valuable as part of the development process. Ultimately, materials development in CEC-
oriented education may best proceed as multiple small projects, with users competitively
“rating” the materials offered. A self-assessment component to calibrate users’ learning
gains might also provide important systematic data for educational assessment and overall
project evaluation.
Software and middleware. Educational software and middleware constitute another
important area for future development. It will be important to develop interdisciplinary
educational materials and programs that include both computational chemistry and
morefundamental computational sciences. As CEC works to help practitioners maximize
the efficiency of their modeling efforts, the same lessons learned in improving research
productivity will probably also enhance new learning materials’ effectiveness. Interfaces to
state-of-the-art computational resources for novices should be both intelligent and multi-
level. As novice users gain proficiency in specific modeling technologies, the educational
interfaces they are using should automatically allow more flexibility for those users and
reveal new, more complex modes of the CEC environment. Some students in this category
will be expert learners already. The CEC modeling environment’s power as well as its
limitations should be emphasized in ways that scientists who are not computer modelers
will find useful.
Specific issues associated with cyberinfrastructure are common to educational
efforts as well. Cybersecurity is important across the entire spectrum of CEC deployment,
including educational developments. Networking-infrastructure disparities (including those
that exist on local scales within most or all educational institutions) may play a bigger role
in the educational components of CEC than in its research components, where
infrastructure is more nearly uniform. Finally, the changing of hardware availability in the
educational environment may render such issues as interoperability of educational
materials even more paramount within education than within CEC as a whole.
5. Remote Chemistry
The emerging national cyberinfrastructure is already enabling new scientific
activities through, for example, the use of remote computers, the development and use of
community databases, virtual laboratories, electronic support for geographically dispersed
collaborations, and numerous capabilities hosted as web-accessible services. New virtual
organizations are being established (such as virtual centers of excellence) that assemble
distributed expertise and resources to target research and educational grand challenges.
Cyberinfrastructure research currently being driven by other scientific domains – in areas
such as scientific portals; workflow management; computational modeling; and data
analysis, visualization, and management – is clearly relevant to chemistry as well.
However, certain characteristics of the chemistry community – specifically, the
broad range of computational techniques and data types in use and the large number of
independent data producers – pose unique challenges for remote chemistry. Distributed-
database federation, sample and data provenance tracking (e.g., as in laboratory
information management systems, or LIMS), and mechanisms to support data fusion and
17
community curation of data are particularly relevant to chemistry and thus are areas where
this community may drive cyberinfrastructure requirements. In addition, environmental
chemistry – which may soon involve experiments drawing data from thousands to millions
of sensors – and high-throughput chemistry will be leading-edge drivers for new
cyberinfrastructure capabilities.
While the term “remote chemistry” suggests an emphasis on bridging physical
distances, much more challenging gulfs to bridge than distance are, in fact, differences in
distributed collaborations’ and organizations’ cultures, levels of expertise, organizational
practice and policies, and scientific vocabularies. Close interaction between practicing
chemists and information technology developers, iterative approaches to development and
deployment, and mechanisms to share best practices will all be critical in developing new
remote-chemistry capabilities to meet the needs of a diverse chemistry community.
Remote communities and practitioners may also be confronted with currently
poorly understood social/cultural constraints. For example, members of certain
constituencies (e.g., based on ethnicity, race, culture, nationality, age, and/or gender) may
adapt to the remote-community concept far more readily after initial strong personal or
even face-to-face contact with the other members of the community. Expecting remote
communities to develop spontaneously and rapidly in a manner that reflects the current
population of interest groups may or may not be realistic. Social research may be necessary
to understand how to expand remote communities to accurately reflect national and
international demographics.
Access to and Use of Remote Instruments. Advances in information technologies
have made it possible to access and control scientific instruments in real-time from
computers anywhere on the Internet. Technologies such as Web-controlled laboratory
cameras, electronic notebooks, and videoconferencing provide a sense of virtual presence
in a laboratory that partially duplicates the experience of being there. More than a decade
of R&D and technological evolution has greatly reduced the time and effort required to
offer secure remote-instrument access and proved the viability of remote-instrument
services. Instrumentation such as Pacific Northwest National Laboratory’s Virtual NMR
Facility has migrated from being research projects to ongoing operations, and setting up
new instruments for remote operation can now be as simple as running screen-sharing
software or enabling remote options in control software (e.g., in National Instruments’
LabView).
The numerous benefits provided by access to remote instruments include sharing
the acquisition, maintenance, and operating costs of expensive, cutting-edge instruments;
broadening the range of capabilities available to local researchers and students; more-
effective utilization of instruments; and easing the adoption of new techniques in research
projects and student laboratory courses. While there can be drawbacks to remote facilities
– for example, conflicts between the service and research missions of a facility, loss of
“bragging rights” and control of instruments, and loss of contact with colleagues at an
instrument site – the potential benefits far outweigh the drawbacks.
Enhanced access to remote instruments would benefit the chemistry community.
Remote access to expensive, high-end, state-of-the-art instruments will maximize their
scientific impact, serve broader audiences, and allow more widespread use of current-
18
generation technologies in both research and education. Technical support for planning and
operating facilities will be a key enabler. On the other hand, problems limiting adoption of
this new research and education mode are potential users’ and facility operators’
unfamiliarity with state-of-the-art networking and distributed-computing technologies,
with best practices developed by current remote facilities, and with the learning curve
associated with the use of the software tools themselves.
Cyberinfrastructure research in support of remote facilities will be needed in
several areas, including the continuing improvements in ease of use and support for
multiple levels of instrument access (e.g., simplified interfaces for novice users or the
ability to allow data collection while prohibiting instrument recalibration), mechanisms for
coordinating across experiments (e.g., experiments guided by simulation results or by other
experiments, or creating large-scale shared community data resources that aggregate
individual remote experiments), and managing distributed facilities (e.g., with instruments
and experts in various techniques at multiple facilities).
Access to and Use of Advanced Computational Modeling Capabilities.
Computational chemistry, in all of its forms, has made enormous advances. It is now
possible to predict the properties of small molecules to an accuracy comparable to that of
all but the most sophisticated experiments. Computational studies of complex molecules
(e.g., proteins) have provided insights into their behavior that cannot be obtained from
experiment alone. Investments are still required to continue to advance the core areas of
computational chemistry. But high-bandwidth networking, remote computing, and
distributed data and information storage, along with resource discovery and wide-area
scheduling, promise to spark the development of new computational studies and
approaches, providing opportunities to solve large, complex research problems and open
new scientific horizons.
Of particular interest here are portals, workflow management, and distributed
computing and data storage, especially as envisioned in the notion of the “grid,” whose
goal is to couple geographically distributed clusters, supercomputers, workstations, and
data stores into a seamless set of services and resources. Grids have the potential to
increase not only the efficiency with which computational studies may be performed but
also the broader community’s access to computational approaches. In this regard, an
important target for the chemistry community will be to develop tools that allow the
scientist to couple computational codes together to build complex, flexible, and reusable
workflows for innovative studies of molecular behavior.
Collaboratories. Collaboratories enable researchers and educators to work across
geographic and organizational as well as disciplinary boundaries to solve complex
scientific and engineering problems. They enable researchers and educators to share
computing and data resources as well as ideas and concepts, to collaboratively execute and
analyze computations, to compare the resulting output with experimental results, and to
collectively document their work. While early collaboratories tended to focus on rich
interactions in small groups or lightweight coordination within a community, next-
generation collaboratories will be able to operate far more effectively, allowing large
groups to organize to tackle grand-challenge problems, form subgroups as needed to
accomplish tasks, and publish results that are then made available to the larger community.
19
Examples of such activities that are relevant to the chemistry community include
the Collaboratory for Multiscale Chemical Science (http://cmcs.org/), which is being used
by groups of quantum chemists, thermodynamicists, kineticists, reaction model developers,
and reactive-flow modelers to coordinate combustion research, as well as the National
Biomedical Informatics Research Network (http://www.nbirn.net), which supports
researchers studying neurobiology across a wide range of length and time scales. These
projects and other emerging frameworks allow scientists and engineers to access securely
distributed data and computational services, share their work in small groups or across the
community, and collaborate directly via conferencing, desktop sharing, etc. Over the next
few years, collaboratories will provide increasingly powerful capabilities for community
data curation (tracking data provenance across disciplines, assessing data quality,
annotating shared information), automating analysis workflows, and translating across
formats and models used in different subdomains.
Collaboratories have great potential in chemical research and education,
particularly in bringing together researchers from multiple subdisciplines and multiple
cultures. In particular, the solutions of many grand-challenge problems in chemistry, e.g.,
the design of new catalysts and more-efficient photoconversion systems or the integration
of computation into the chemistry curriculum, would benefit greatly from the services
provided by collaboratories.
Workshop Summary
Crossing Bridges: Toward an Unbounded Chemical Sciences Landscape. As
illustrated by recurring themes in the breakout-session reports, workshop participants
shared a consensus vision of cyber-enabled chemistry: that of an unbounded chemical
sciences landscape in which different disciplines, computational and experimental
methods, institutions, geographical areas, levels of user sophistication and specialization,
and subdisciplines within chemistry itself are bridged by seamless, high-bandwidth
telecommunications networks, computing resources, and disparate databases of curated
data of known pedigree that can be conveniently accessed and retrieved and processed by
modular algorithms with compatible codes. Realizing this vision will require significant
enhancements and expansions of the existing cyberinfrastructure, as well as the
development and deployment of innovative models and technologies. With regard to
accelerating progress in this direction, the consistency with which certain themes were
voiced by workshop participants suggests a consensus across the chemical sciences
community concerning recommended courses of action. In particular, breakout group
participants:
• noted the community’s increasing acceptance of and reliance upon simulation as a
tool in chemistry. Participants concurred that advances in the field will be achieved
through an “equal partnership” between simulation and experiment, whereby
simulations first corroborate and interpret existing experiments and, subsequently,
suggest new experiments.
• observed that the chemical sciences are characterized by the use of a broad range of
computational techniques and data types and by a large number of independent data
20
producers. Certain areas of the field, such as environmental and high-throughput
chemistry, are hugely data-intensive, and accumulated data must be available to be
shared and processed in a distributed fashion by collaboratories.
• agreed on the importance both of using cyberinfrastructure to educate audiences at
all levels from K-12 through college-level to the broad public sector about science
topics (e.g., genetically engineered crops, stem cells), and of introducing
cyberscience techniques to undergraduate chemistry curricula.
• noted that major advances in networking and distributed-computing technologies
can make possible new modes of activity for chemistry researchers and educators by
allowing them to perform their work without regard to geographical location –
interacting with colleagues and accessing instruments and sensors and computational
and data resources scattered all over the country (indeed, the world). These new
modes have great potential to enlarge the community of scientists engaged in
advancing the frontiers of chemical knowledge and in developing new approaches to
chemical education. However, this potential will not be realized unless chemical
researchers and educators actively participate in guiding cyberinfrastructure research
and development, obtain the needed assistance and support as these new
technologies are deployed, and take advantage of the new forms of organization that
these technologies make possible.
Recommendations
(1) It is suggested that NSF Chemistry research funding criteria for cyberinfrastructure
focus on the chemical science drivers themselves, encouraging investigators to define the
amount and type of bridging activities and mechanisms that would best enable the focus on
their next-generation grand-challenge problems.
Development of cyberinfrastructure that maximizes the impact of new chemistry in the

grand-challenge arena is strongly recommended. A focus area of Multiscale Modeling in
the Chemical Sciences would provide an example of an immediate effort that will drive
both the underlying chemistry problems of bridging length and time scales with the
cyberinfrastructure investments described below. As long as individual PI funding is not
negatively impacted, the NSF should sponsor multidisciplinary, collaborative projects such
as Multiscale Modeling, and ensure funding is specifically allocated for development of
cyberinfrastructure as a partner to scientific research.
The early-year projects would allow the cyberinfrastructure to be guided (and
critiqued) as it is developed, and give the software developers both real data and scientists
with whom to work. Consideration should be given to the funding of joint
academic/industry projects wherein both existing and new information technologies can be
exploited. A collaborative environment should be established bridging the
cyberinfrastructure portions of all of the projects, with a funded technical person managing
synergies across the cyberinfrastructure backbone, developing standard data models and
API representation, acquiring knowledge – and, potentially, software – that could be
passed on to future projects, and gathering requirements and use cases for future rounds of
21
cyberinfrastructure for chemistry. The chemical research community’s needs, going
forward, include:
• Modeling protocols that bridge different subdisciplines and that can collectively
represent chemical systems over many orders of magnitude of time and space scales,
with output from one computational stage (along with accompanying information
about precision) becoming input at the next stage
• Modular algorithms that can be used in diverse combinations in varying
applications and/or automated, freeing researchers to concentrate on chemistry issues
rather than computational “busywork”
• Improved tools for visualizing complex macromolecules and their interactions as
well as large data sets
• Access to shared, heterogeneous community databases with accompanying
information about data provenance and with standardized formats that allow cross-
talk among, and data fusion from, different database types across interfaces of
different disciplines
• Person-to-person communication technologies (e.g., screen sharing with
audio/video) to render collaborations more effective
• Grid-enabling technologies – speedy, interoperable hardware and software and
remote-instrumentation capabilities to enable both more-efficient collaboration in
computational studies and more access in general
(2) It is suggested that NSF Chemistry research funding criteria for cyber infrastructure
support for the educational community promote interdisciplinary educational-materials
centers to accelerate development and deployment of useful educational materials from
ideas generated by individuals from any of the chemistry subdisciplines.
This is because it is unlikely that single individuals will be in a position to develop an

entire array of educational materials. It may, therefore, be prudent to build centers (perhaps
cyber-centers) conducive to collaborations for developing new educational materials. A
gathering of experts – in scientific-computing content, software engineering, education and
education research, and human factors – might provide an environment that could speed
the development and deployment of useful educational materials from ideas generated by
individuals from any number of the subdisciplines of the fundamentally multidisciplinary
CEC community. Thus, a computational chemist who has an idea about how to present a
topic (such as, say, sampling in statistical-mechanical models) could bring that idea to an
established center and receive assistance in developing and deploying the envisioned
teaching module. This interdisciplinary educational-materials center is only one type of
structure that could be devised. For example, the ability of local campuses to build
specifically interdisciplinary programs – minors in scientific computing, perhaps, or
interdisciplinary components in established curricula – would help sustain CEC, once it is
established, by preparing students who could readily function within this environment.
The development of an environment using computational resources to solve
research problems opens immediate avenues for enhanced research in chemistry education
and, more generally, science education. The capture of data about usage of both
22
educational materials and research problem-solving workspace might yield insight into the
cognitive factors at play in the development of a problem-solving facility. In this manner,
the development and deployment of CEC and its accompanying educational resources can
help to build our understanding of the learning process in a more general sense, providing
an important spin-off benefit.
Specific recommendations include:
• Development of “tunable” algorithms that can be configured for several different
levels of user proficiency
• Development of improved visualization tools to educate the broad public and
novice chemists
• Promotion of the blending of experiment and simulation in the education system
• Funding of cognitive research on learning-style differences in order to tailor and
optimize remote or asynchronous learning approaches
(3) It is suggested that NSF Chemistry seek partnerships with appropriate divisions within
NSF, such as CISE, or explore other partnering mechanisms to help fund and support the
development and deployment of hardware, databases, and remote collaborations
technologies.
Hardware Infrastructure. The chemical sciences will require computer performance well
beyond what is currently available, but computational speed alone will not ensure that the
computer is useful for any specific application area. Several other factors are critical,
including the size of primary system memory (RAM), the size and speed of secondary
storage, and the overall architecture of the computer. Grid computing, visualization, and
remote instrumentation are areas that continue to evolve toward becoming standard
cyberinfrastructure needs. Hardware design, affordability, and procurement of desktop,
midrange, and high-performance computing solutions are needed. Access to disk farms,
midrange computing facilities, and sustained support for high-performance computing
solutions that are available for “grand challenge” problems are important long-term
investments. Specific recommendations include:
• To realize cyber-enabled chemistry or science in general, the NSF and, in
particular, the Chemistry Division must track and cooperate with other agency
programs such as the DARPA high-productivity computing program, the DOE
Leadership Class System Development program, the INCITE program @ DOE’s
NERSC facility, NASA’s Leadership Class systems, the NIH NCBC program, and
the DOD computational infrastructure program. The last of these has significant
experience in midrange systems such as the regional capacity systems outlined
above. NSF should explore lessons learned from these facilities.
• The NSF and the Chemistry Division should also explore stronger industry
partnerships to develop hardware and software, where appropriate.
Databases and Cheminformatics. Chemistry is becoming an information science,

especially in industry, and the field is poised to put into practice new information science
and data management technologies directly. New techniques and solutions from computer
23
science will focus on developing new databases, integration and interoperability of
databases, distributed-networking data exchange, and the use of ultra-high-speed networks
to interconnect students, experimentalists, and computational chemists with publicly
funded data repositories. Data warehousing and data management will be critical issues.
This encourages database federation by establishing community standards for data curation
and cybersecurity, and establishing standards to encourage interoperability of algorithms
and hardware. It is important to create incentives (and, of equal importance, overcome
disincentives) for the sharing of individual researchers’ data, and to encourage free access
to cross-disciplinary literature. Specific recommendations include:
• Bridging model and experiment for mutual validation and quality screening should
be a top priority. Predictive models and simulations are among the most effective and
sustainable tools to capture and disseminate knowledge, particularly for use by non-
experts. They can also be used to perform data quality checks for experimental
results. The models, however, must be validated. This requires investment in both
experimental data and standard reference simulations and model development
(validated models with quantified uncertainty). This is not glamorous, but it is an
essential foundation. Models based on first principles are evergreen and provide
insight into mechanisms.
• Flexible cyberinfrastructure is needed to answer questions of the future. We will be
asking different questions tomorrow from the ones we are asking today. It is not
possible to anticipate every potential query or application need and develop an
enduring solution. There is an underlying foundation of factual data that we know we
need: i.e., standardized ways to enter, archive, and extract published chemical and
physical experimental data from legacy systems and to actively capture and
disseminate new data as it is published. We need to go beyond this foundation and
develop methods whereby the cyberinfrastructure learns and grows as needs change.
Remote Instrumentation: Access to remote instruments will become increasingly important

as these instruments become increasingly sophisticated and expensive. It is important to
examine. what categories of instrumentation are most important for chemists to gain
remote access, and what are the most pressing issues in developing cyberinfrastructure
(computing, data, networking) for virtual laboratories (remote control of instrumentation)
References
[1] Information and Communications: Challenges for the Chemical Sciences in the
21st Century. Organizing Committee for the Workshop on Information and
Communications, Committee on Challenges for the Chemical Sciences in the 21st Century,
National Research Council, 2003.
[2] Beyond the Molecular Frontier: Challenges for Chemistry and Chemical
Engineering. Committee on Challenges for the Chemical Sciences in the 21st Century,
National Research Council, 2003
24
[3] Revolutionizing science and engineering through cyberinfrastructure: Report of
the National Science Foundation Blue-Ribbon Advisory Panel on Cyberinfrastructure.
http://www.cise.nsf.gov/sci/reports/toc.cfm
25

NSF 2004 CI in Chemistry

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

NSF 2004 CI in Chemistry

Uploaded by

Copyright:

Available Formats

Introduction

The emergence of computer-enabling technologies in the late 20th century has

Chemical Sciences Drivers

2. Chemistry at the Interface

Developing cyberinfrastructure to make these interfaces as seamless as possible

3. Computational and Experimental Chemistry Interactions

4. Grand Challenges in Chemistry

Cyberinfrastructure Solution Areas

3. Databases and ChemInformatics

Development of cyberinfrastructure that maximizes the impact of new chemistry in the

This is because it is unlikely that single individuals will be in a position to develop an

Databases and Cheminformatics. Chemistry is becoming an information science,

Remote Instrumentation: Access to remote instruments will become increasingly important

You might also like