Professional Documents
Culture Documents
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
September 2007
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
STATEMENT OF ORIGINALITY
I declare that the work presented in this thesis is, to the best of my knowledge and
belief, original and my own work, except as acknowledged in the text. The material (presented
as my own) has not been submitted previously, either in whole or in part, for a degree at this
or any other institution.
In those cases in which the work presented in this thesis was the product of
collaborative efforts I declare that my contribution was substantial and prominent, involving
the development of original ideas as well as the definition and implementation of subsequent
work. Detailed information about the participation of other researchers in parts of this thesis
is provided in the section "Author's contributions" at the beginning of each chapter.
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
UNIVERSITY OF QUEENSLAND
ABSTRACT
The development of "omic" technologies and its applications into biological sciences
has increased the need for an integrated view of bio-related information. The flood of
information as well as the technological availability has made it necessary for researchers to
share resources and join efforts more than ever in order to understand the function of genes,
proteins and biological systems in general. Integrating biological information has been
addressed mainly from a syntactical perspective. However, as we enter into the post-genomic
era integration has acquired a meaning more related to the capacity of inference (finding
hidden information) and sharebility in large web-based information systems. Ontologies play
a central role when addressing both syntactic and semantic aspects of information integration.
The purpose of this research has been to investigate how the biological community
could develop those highly needed ontologies in a way that ensures both, maintainability and
usability. Although the need for ontologies, as well as the benefits of having them, is obvious;
it has proven to be difficult for the biological community not only to develop but also to
effectively use them. Why? How should they be developed in such a way that they are
maintainable and usable by existing and novel information systems? A feasible methodology
elucidated from the careful study whilst developing biological ontologies is proposed.
Methodological extensions gathered from the acquired experience are also presented.
Throughout the chapters of this thesis diverse integrative approaches have also been analysed
from different perspectives; a workflow-based solution to the integration of analytical tools
was consequently proposed. This made it possible to better understand the need for well-
defined semantics in biological information systems as well as the importance of a thoughtful
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
understanding of the relationship between the semantic structure and the syntactic scaffold
that should ultimately host the former.
From this investigation several conclusions have been drawn, one of them with a
particular significance is the relevance of collaboration between two asymmetric, yet not
antagonist, communities; computer scientists and biologists may work and achieve results in
different ways, nevertheless both communities hold valuable information that could be of
mutual benefit. Within the context of biological ontologies "Romeo and Juliet" proved to be an
apt metaphor that illustrates not only the importance of the collaboration, but also how we
may avoid heading towards "A hundred years of solitude".
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
TABLE OF CONTENTS
ACKNOWLEDGMENTS......................................................................................................... XI
i
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
ii
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
iii
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
5.1 THE USE OF CONCEPT MAPS AND AUTOMATIC TERMINOLOGY EXTRACTION DURING THE
DEVELOPMENT OF A DOMAIN ONTOLOGY. LESSONS LEARNT. ........................................ 130
5.1.1 Introduction......................................................................................................... 130
5.1.2 Survey of methodologies....................................................................................... 131
5.1.3 General view of our methodology. ........................................................................ 133
5.1.4 Our scenario and development process ................................................................. 136
5.1.5 Results: GMS baseline ontology............................................................................ 137
5.1.6 Discussion and conclusions .................................................................................. 140
5.1.7 References ........................................................................................................... 142
5.2 A PROPOSED SEMANTIC FRAMEWORK FOR REPORTING OMICS INVESTIGATIONS. ............. 145
5.2.1 Introduction......................................................................................................... 145
5.2.2 Methodology........................................................................................................ 147
5.2.3 The RSBI Semantic Framework ............................................................................ 148
5.2.4 Conclusions and Future Directions ....................................................................... 149
5.2.5 References ........................................................................................................... 150
iv
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
9.1 BIO-ONTOLOGIES: THE MONTAGUES AND THE CAPULETS, ACT TWO, SCENE TWO: FROM
VERONA TO MACONDO VIA LA MANCHA. ................................................................... 228
v
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
vi
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
LIST OF FIGURES
Chapter 2 - Figure 2. Life cycle, processes, activities, and view of the methodology..................60
Chapter 4 - Figure 1. The major concepts of the argumentation ontology and their relations.121
vii
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
Chapter 5 - Figure 3. Narrative, as seen from those concept maps and ontology models
domain experts were building. ...........................................................................................139
viii
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
Appendix 3 - Figure 1. A portion of the first version of the GMS ontology, Germplasm. .....258
Appendix 3 - Figure 2. The Germplasm Method section of the first version of the GMS
ontology. ...............................................................................................................................259
Appendix 3 - Figure 3. The Germplasm Identifier section of the first version of the GMS
ontology. ...............................................................................................................................260
Appendix 4 - Figure 3. Germplasm Breeding Stock, a portion of the second version of the
GMS ontology......................................................................................................................263
Appendix 4 - Figure 4. Naming convention according to the second version of the GMS
ontology. ...............................................................................................................................263
Appendix 4 - Figure 5. Plant Breeding Method according to the second version of the GMS
ontology. ...............................................................................................................................264
ix
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
LIST OF TABLES
Chapter 6 - Table 2. Some of the most commonly used Graphical User Interfaces (GUIs) for
EMBOSS and GCG® ........................................................................................................176
x
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
ACKNOWLEDGMENTS
“Acknowledgments” is usually the part of the thesis in which the author mention those
who have participated in the development and evolution of the research work. Expressing
gratitude to all those who had any kind of participation in the development of this work is, in
my opinion, mandatory. I do certainly thank all of them, for their understanding,
consideration, patience, and constant support throughout these, almost four years. However,
it is usually the case that some people acquired a more prominent role, and I am reserving this
section in order to express in a special way my gratitude for their actions.
Firstly, I thank my mother, without whose example and constant support I would ever
have found the courage for going through the whole doctoral process. I thank my sister, who
taught me an important lesson that helped me to understand the value of family in those
times in which I may not fully have appreciated it. My deepest gratitude goes to my entire
family, for those obvious things, but most of all for their unconditional love.
Robert Stevens, Mark Wilkinson, Limsoong Wong, Vladimir Brusic and Kaye Basford
are persons for whom I feel a deep gratitude for having understood the importance of my
work; but most of all for having had trust in me.
xi
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
Finally, I am reserving my words to say “thanks”; not so much for the knowledge that
we all shared thought these years, but the humanity that allowed us to relate to each other as
human beings. Fortunately this research work proved to have a direct impact not only in my
understanding of the domain of knowledge, but also, and more importantly, in the
importance of those human factors within all of us.
xii
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
INTRODUCTION
OVERVIEW
This introductory portion of the thesis is organised as follows. Initially, a brief overview
is given. The presentation of those main components and concepts (ontologies and
communities) of this thesis is given in pages XVII, and XIX. In these sections the broad
problem-space within which this research is situated is illustrated. The next section presents
the thesis outline; then, page XXIII exhibits hypothesis and research questions addressed by
this investigation. A list of those publications as well as software products that have arisen
from this thesis is given in the last section of this introductory chapter.
xiii
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
WHAT IS AN ONTOLOGY?
Definitions for the word ontology vary depending on the field; computer scientists tend
to understand the term in a more utilitarian way, whereas philosophers tend to have a more
holistic understanding of it. The term “ontology” (Greek on=being, logos=to reason) has its
roots in philosophy; it has traditionally been defined as the philosophical study of “what
exists”: the study of the kinds of entities in reality, and the relationships that these entities
bear to one another [2, 3]. Guarino [4] beautifully summarises the meaning of ontology as
being “a branch of metaphysics which deals with the nature and organization of realty”. The meaning of
the word ontology in philosophy is “the metaphysical study of the nature of being and existence” [5].
While within the philosophy community there is consensus on the definition for ontology,
there is still some dispute amongst members of the artificial intelligence (AI) community. This
is partly due to their goal, which is not always to study the nature of “what exists” but how to
classify, manage and organise information.
For those within the AI community the context in which the ontology is going to be
used largely influences the definition of the term. At a glance, an ontology represents a view
of the world with the set of concepts and relations amongst them, all of these defined with
respect to the domain of interest. For instance, John F. Sowa [6], defines the term as:
“The subject of ontology is the study of the categories of things that exist or may exist in some
domain. The product of such a study, called an ontology, is a catalog of the types of things that are
assumed to exist in a domain of interest D from the perspective of a person who uses a language L
for the purpose of talking about D.”
Computer scientists tend to view ontologies as being terminologies with associated
axioms and definitions, structured so as to support software applications [7] or in more detail
explained by Gruber:
“vocabularies of representational terms, classes, relations, functions and object constants with
agreed-upon definitions in the form of human-readable text and machine-enforceable, declarative constraints on
their well-formed use.” [8]
xiv
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
In order to understand this definition Gruber et al., as well as Studer et al. agree on the
following terminology [9, 10]:
“The definition of the basic terms and relations comprising the vocabulary of a topic area, as well as
the rules for combining terms and relations to define extensions to the vocabulary.”
Depending on the understanding of conceptualisation and context there are different
interpretations for the term “ontology”. Independently from the understanding of these terms it
could be said that every ontology model for knowledge representation is either explicitly or
implicitly committed to some conceptualisation. As this thesis’s context is that of an
information system the definition of an ontology that best serves our purpose is:
xv
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
Controlled vocabularies (CVs) are taxonomies of words built upon an is-a hierarchy; as
such they are not mean to support any reasoning process. Controlled vocabularies per se
describe neither relations among entities nor relations among concepts, and consequently
cannot support inference processes. CVs may be part of ontologies when they instantiate
classes. As the process of developing ontologies moves forward the hierarchy is formalised
not only by means of is-a and part-of, but also other relations are used, as well as logical
operators and description logics constructs. Figure 1 illustrates how within the process of
developing ontologies CVs play an important role.
xvi
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
instance, some methodologies suggest the use of lists of lists of words from the beginning as a
mean to facilitate the identification of classes [15]. Others, such as Good et al. [16] use lists of
words to frame the knowledge elicitation process when developing the ontology. A more in-
depth analysis on methodologies for developing ontologies is presented in chapter one.
WHY ONTOLOGIES?
Several authors have extensively discussed the “whys” for ontologies. Within the
computer science community these reasons have been summarised [17, 18].
xvii
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
WHY COMMUNITIES?
A particularly recurrent and important term throughout this thesis is “community”, and
more broadly “community of practice”. Wenger defines communities of practice:
“Communities of practice are the basic building blocks of a social learning system because they
are the social ‘containers’ of the competences that make up such a system… Communities of practice
define competence by combining three elements. First, members are bound together by their
collectively developed understanding of what their community is about and they hold each other
accountable to this sense of joint enterprise. To be competent is to understand the enterprise well
enough to be able to contribute to it. Second, members build their community through mutual
engagement. They interact with one another, establishing norms and relationships of mutuality that
reflect these interactions. To be competent is to be able to engage with the community and be trusted
as a partner in these interactions. Third, communities of practice have produced a shared repertoire of
communal resources—language, routines, sensibilities, artefacts, tools, stories, styles, etc. To be
competent is to have access to this repertoire and be able to use it appropriately.” [22-24].
“Knowledge is a mix of framed experience, values, contextual information, expert insight and
grounded intuition that provides an environment and framework for evaluating and incorporating new
experiences and information. It originates and is applied in the minds of knowers. In organisations, it often
becomes embedded not only in documents or repositories but also in organisational routines, processes, practices
and norms.”[25]
xviii
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
Davenport and Prusak place emphasis on “organisational routines, processes, practices and
norms”. These shared repertoires between these two definitions make it clear that communities
of practice are brought together by their intersecting knowledge. Ontologies in bioinformatics
have been developed by communities of practices for which there is a common need. For
instance the MGED society, which is an international organisation of biologists, computer
scientists, and data analysts that aims to facilitate the sharing of microarray data generated by
functional genomics and proteomics experiments [14], develops and maintains MO. They
have initially focused on establishing standards for microarray data annotation and exchange,
facilitating the creation of microarray databases and related software implementing these
standards [14]. This does not mean that other omics technologies are not currently being
considered.
“Virtual communities of practice are communities of practice (and the social ‘places’ that they
collectively create) that rely primarily (though not necessarily exclusively) on networked
communication media to communicate, connect, and carry out community activities”
Biological communities are indeed communities of practice; not only do they have their
own way to interact (e.g. papers, conferences) but also, and more importantly, no matter how
fragmented they have a common vocabulary. Electronic means have facilitated not only the
xix
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
interaction but also the fragmentation of this community; interestingly it has also facilitated
the standardisation across the entire domain of knowledge by making explicit the need for a
holistic approach. For instance, data entries in GenBank [27] may encode a human Mendelian
disease, for which there are both metabolic pathways as well as reported single nucleotide
polymorphisms (SNPs) of interest. Such information may be scattered across Gene Cards
[28], BioCyc [29] and possibly other databases. Despite the divisions and specialisations of the
field, the systems studied by biological sub-communities interact in reality, and it is precisely
because of this that this community needs ontologies. As the knowledge is not owned by a
particular group, these knowledge should be captured and represented from and by the
community [15, 30].
Communities have been developing ontologies in order to describe those entities that
we study, genes, proteins, DNA-binding factors, as well as biomaterial, and technology-
dependent artefacts. Gene Ontology [31] is an example of an ontology that aims to describe
those things biologists study, whereas MGED [32] ontology may be seen as one that aims to
describe the process by which we study those “things”. It is by using descriptors provided by
both ontologies that accurate representations of research endeavors may be possible. At any
given time a toxicological study may use a part of the liver of a rat in order to profile the
response of genes to a certain perturbation. In order to describe such an effort, different
ontologies are needed, some to describe the biology of the research endeavor (e.g. cells,
cellular compartment, animal, organism, etc.) as well as some to describe the techniques used
(e.g. microarrays, proteomics, PCR, chromatography, etc.).
Some of the required ontologies exist; however they are not always sufficient, nor are
they used for annotation in all biological investigations. Different views on the same issue
may mean that those mechanisms for involving the biological community in the development
of their own ontologies should be improved. The lack of methodologies and software tools
supporting these methodologies is a bottleneck in the development of biological ontologies.
xx
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
This research has analyzed the biological community as well as the intended use of
some of those ontologies currently under development. Different scenarios arise from three
ontology developments in which the author took part. These cases allowed a careful and
exhaustive study of the dynamic and features these kinds of developments have. The
nutrigenomics community permitted the author to understand the behavior of communities
when developing ontologies, as well as the significance of groupware technology for
developing loosely centralised ontologies. From this initial experience it was also possible to
identify and illustrate how concept maps could be used to support knowledge elicitation
during the development of ontologies. A methodology describing how biological ontologies
could be better developed was consequently proposed.
Two other scenarios were studied. The Reporting Structure for Biological
Investigations (RSBI) case was one that aimed to define the structure and semantics for
reporting a biological investigation. Influenced by MIAME, the RSBI working group
addressed the issue of investigations in a broader sense; the context was not limited to
describing a microarray experiment, but any biological experiment. This experience was
interesting not only because of the involvement of three different communities (toxic-
genomics, environmental genomics, and nutrigenomics) but most importantly because it was
easy to understand how difficult it was to describe an investigation; how could technology be
classified in a way that inference is possible within any given Laboratory Information
Management System? As a high level container, investigation, what minimal descriptors should
accompany it in order to provide an insightful, useful and comprehensive view of the whole
investigation?
Finally, another ontology was also supported during its development, the Genealogy
Management System (GMS) Ontology. The GMS Ontology provided us with a fertile ground
in which it was possible to extend the methodology proposed from the nutrigenomics case.
Conceptual maps facilitate knowledge elicitation and sharing, but it is not easy to frame the
view of the domain experts, sometimes they tented to be quite specific, and some other times
quite general. By combining terminology extraction and conceptual mapping it became
xxi
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
possible to constrain the elicitation exercises with domain experts; thus making it possible to
capture classes and differentiate them from instances at early stages of the elicitation process.
The three previously mentioned ontology developments permitted the study of existing
software from the perspective of both users and knowledge engineers in decentralised
settings. Also from these experiences it was possible to identify the argumentative structure
that takes place when developing ontologies.
RESEARCH PROBLEM
This thesis has laid down a series of questions not previously considered when studying
biological ontologies, how to develop them, and their uses when integrating information.
Throughout this thesis, integration of information in bioinformatics is studied mainly from
the semantic perspective, placing particular attention on the actual process by which the
ontology is being developed. Despite this emphasis, other aspects related to integration of
information have also been considered. The overall research problem in this thesis is:
To address this problem, the author presents a series of hypotheses and questions.
These seek to explore and analyze methodological and practical challenges in developing
ontologies within the bioscience community.
xxii
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
By answering these questions this doctoral work addresses the ontology development,
information integration and the study description problem in modern biology and proposes
different methods to facilitate information integration across various information systems.
Initial chapters focus on providing answers to the main research question in this thesis.
By investigating existing methodologies and analyzing them within the context of biological
communities of practice it was possible to propose a methodology and understand the life
cycle of these ontologies. Those experiences that allowed the author to gather information,
test and improve the methodology are presented in chapters 3, 4, and 5. As it was important
for the successful conclusion of this thesis to constantly test those research outcomes a
simple, yet quite illustrative scenario, was laid down. This scenario (chapter 7) allowed the
author to study not just the development of ontologies, but also the use of ontologyes by
software layers within an integrative environment that also had a community of users.
xxiii
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
additional reason to publish our work; this active communication, via different means, enabled
us to receive relatively rapid feedback regarding our work. The combination of collaborations
and publications enriched this work, but more importantly permitted the rapid use and testing
of those intermediary products of this work. A list of research outcomes and outputs of this
thesis is given below.
This thesis begins by addressing the problem of information integration, and examines
the syntactic and semantic factors that should be taken care of when describing experiments.
A particular task was always considered as crucial throughout the development of this thesis:
intelligent information retrieval.
Attention was focused on semantic issues associated with information integration: How
could the reproducibility of biological experiments be ensured? How could experiments be
effectively shared? How could ontologies be built while ensuring the participation of a wider
community? Special attention was given to the involvement of the community when
developing ontologies. The agreement is critical as it, to some exempt, assures the use and
some how the correctness of the ontology.
This thesis is organised into a series of chapters that address some aspects related to
semantic issues in integration of information in bioinformatics. Chapter I, “Communities at the
melting point when building ontologies”, is a critical analysis of those existing methodologies for
developing ontologies; not only existing methodologies are presented, but also it is analyzed
how could these methodologies be used within the biological domain, as well as which issues
should be considered in order to propose a new methodology. Chapter II, “The melting point, a
methodology for developing ontologies within decentralised settings” presents a novel methodology that
has been engineered upon cases extracted form real scenarios. In principle this methodology
may be used not only within the bio domain but also in other contexts. Chapter III, “The use
of concept maps during knowledge elicitation in ontology development processes”, presents
the development of biological ontologies, factors associated with this process as well as a
xxiv
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
process that was followed. Chapter IV presents how cognitive support may be provided by
means of concept maps during the argumentative process that takes place when developing
ontologies. For this particular task we used two unrelated scenarios: Reporting Structure for
Biological Investigations (RSBI) and Genealogy Management Systems (GMS). It is important
to notice that both, Chapters III and IV, proved to be a fertile playground in which the
methodology presented in Chapter II was engineered; these two experiences were for the
development of this thesis experiments from which valuable information was gathered.
Chapter V presents an literature review in which different approaches to integration of
molecular data, as well as analytical tools are analyzed; this chapter aims to facilitate the
transition into chapter VI in which a different scenario, extracted mostly from in silico biology,
is studied from both syntactic and semantic perspectives. Interestingly, during the
development of this part of the doctoral work it became possible to understand better how
syntactically based solutions, despite being workable tools, still lack some important features
that only the correct use of ontologies could provide. As in silico experiments are also valid
examples of biological investigations another important outcome from Chapter V was the
actual practical use of the ontology proposed in Chapter IV.
Discussions, conclusions and future work are presented in the remaining chapters of
this thesis. In part, this was done by using literary analogies, mostly with Shakespeare’s
masterpiece “Romeo and Juliet” and also with “A hundred years of solitude” by Garcia Marquez.
These analogies seemed ideal because they illustrate what, in my opinion, constitutes a central
problem in the development of biological ontologies, and more broadly in the development
of information systems, namely interdisciplinary work. The relationship between
bioinformatics and the semantic web is used as an introduction to the rest of discussions and
conclusions. Chapter VII presents some future work, using literary analogy here mentioned.
xxv
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
PUBLISHED PAPERS
1. Garcia Castro A, Sansone AS, Rocca-Serra P, Taylor C, Ragan MA: The use of
conceptual maps for two ontology developments: nutrigenomics, and a
management system for genealogies. In: 8th Intl Protégé Conference Protégé: 2005;
Madrid, Spain; 2005: 59-62.
5. Garcia Castro A, Sansone AS, Taylor CF, Rocca-Serra P: A conceptual framework for
describing biological investigations. In: NETTAB: 2005; Naples, Italy; 2005.
8. Garcia Castro A: The Montagues and the Capulets, act two, scene two: from
Verona to Macondo via-La Mancha. Submitted for publication.
xxvi
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
REFERENCES
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
xxix
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
The author conceived the project, and identified those key issues elaborated here. The
manuscript was entirely written by Alex Garcia Castro.
30
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
1.1 INTRODUCTION
Biologists have been building classification systems since before Linnaeus. In the past,
biologists have understood classification systems as systems that allow them to identify, name,
and group organisms according to predefined criteria. This makes it possible for the
community as a whole to be sure they know the exact organism that is being examined and
discussed. More recently, the biological community has started to classify genes and gene
products; with this need in mind the Gene Ontology (GO) was created. The involvement of
the community has played a major role since the foundation of the GO consortium as it is a
collaborative effort that addresses the need for consistent descriptions of gene products in
different databases [8]. Initially GO provided a controlled vocabulary for only model
organisms databases such as FlyBase (Drosophila) [9], the Saccharomyces Genome Database
31
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
(SGD) [10] and the Mouse Genome Database [11] . It has since been adopted as the de-facto
standard ontology for describing gene and gene products.
The plant ontology (PO) [12] also illustrates a biological ontology for which
communities are central to its development. The Plant Ontology Consortium (POC)
(www.plantontology.org) is a collaborative effort that brings together several plant database
administrators, curators and experts in plant systematics, botany and genomics. A primary
goal of the POC is to develop simple yet robust and extensible controlled vocabularies that
accurately reflect the biology of plant structures and developmental stages. These vocabularies
form a network, linked by relationships facilitating thus the construction and execution of
queries that cut across datasets within a database or between multiple databases [12]. The
developers of both GO and PO focus on providing controlled vocabularies, facilitating cross-
database queries, and having strong community involvement.
Despite these efforts, bio-ontologies still tend to be built on an ad hoc basis rather than
by following well-defined engineering processes. To this day, no standard methodology for
building biological ontologies has been agreed upon. The “hacking” process usually involves
gathering terminology and organizing it into a taxonomy, from which key concepts are
identified and related to create a concrete ontology. Case studies have been described for the
development of ontologies in diverse domains, although surprisingly only one of these has
been reported to have been applied in a domain allied to bioscience – the chemical ontology
[13] – and none in bioscience per se. The actual “how to build the ontology” has not been the main
research focus for the bio-ontological community [7].
32
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
Several approaches have been reported for developing ontologies; some of them
provide insights when developing de novo ontologies, whereas others pay more attention to
extending, transforming and re-using existing ontologies. Independently from the focus, both
methods and methodologies have not yet been standardised. Not only are there several
different methodologies, but also there are numerous software tools aiming to assist
knowledge engineers during the process.
Several approaches are here analyzed. Strong points, and shortcomings are reviewed
according to the following criteria (C), heavily influenced by the work done by Fernandez [1]
Mirzaee in [3] and Corcho et al. [19].
33
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
C2. Detail of the methodology. This criterion is used to assess the clarity with which
the methodology specifies the orchestration of methods and techniques.
C3. Strategy for building the ontology. This should provide information about the
purpose of the ontology, as well as the availability of domain experts. There are three main
strategic lines to consider: i) how tightly coupled the ontology is going to be in relation to the
application that should in principle use it; ii) the kinds of domain experts available; iii) the kind
of ontology to be developed. These matters are better explained from C3a to C3e.
C3d. Specialised domain experts: Both C3c and C3e have to do with the kind of
domain experts who are available and willing to participate in the development process. This
influences C4. Specialised domain experts are those with an in-depth knowledge of their field.
Within the biological context these are usually researches with vast laboratory experience, very
focused and narrowed within the domain of knowledge. The ontology is built from very
specific concepts; this is also known as a bottom-up approach.
C3f. Top-level ontologies: These describe very general concepts like space, time, event,
which are independent of a particular problem domain. Such unified top-level ontologies aim
34
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
at serving large communities [20]. These ontologies are also known as foundational
ontologies; see for instance [21].
C3i. Application ontologies: As Sure [20] describes them, application ontologies are
specialisations of domain and task ontologies as they form a base for implementing
applications with a concrete domain and scope.
C4. Strategy for identifying concepts. As has been previously mentioned in C3d and
C3e there are two strategies regarding the construction of the ontology and the kinds of terms
it is possible to capture [22]: The first is to work from the most concrete to the most abstract
(bottom-up), whereas the second is to work from the most abstract to the more concrete
(top-down). An alternative route is to work from the most relevant to the most abstract and
most concrete (middle-out) [1, 22, 23].
C8. Community involvement. As has been pointed out before in this thesis (see
chapter one), it is important to know the level of involvement of the community. Phrasing
35
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
this as a question, is the community a consumer of the ontology or is the community taking
an active role in its development?
C9. Knowledge elicitation. As has been pointed out by [24], knowledge elicitation is a
major bottleneck when representing knowledge. It is therefore important to know if the
methodology assumes it to be an integral part of this process.
36
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
C2. Stages are identified, but no detail is provided. In particular the “Ontology Coding”
“Integration” and “Evaluation” sections are presented in a superfluous manner [3].
C4. For Uschold and King the disadvantage of using the top-down approach is that by
starting with a few general concepts there may be some ambiguity in the final product.
Alternatively, with the bottom-up approach too much detail may be provided, and not all this
detail could be used in the final version of the ontology [22]. This in principle favors the
middle-out approach proposed by Lakoff [23]. The middle-out is not only conceived as a
middle path between bottom-up and top-down, but also relies on the understanding that
categories are not simply organised in hierarchies from the most general to the most specific,
but are rather organised cognitively in such a way that categories are located in the middle of
37
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
the general-to-specific hierarchy. Going up from this level is the generalisation and going
down is the specialisation [3, 23].
C7. The methodology was used to generate the Enterprise ontology [30].
C9. For those activities specified within the building stage the authors do not propose
any specific method for representing the ontology ( e.g. frames, description logic, etc). The
authors place special emphasis on knowledge elicitation. However, they are not specific in
developing this further.
C1. Gruninger and Fox propose a methodology which is heavily influenced by the
development of knowledge based systems using first order logic [19].
C2. Gruninger and Fox do not provide specifics on the activities involved.
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
completeness theorems used to evaluate the ontology. Once the competency questions have
been formally stated, the conditions under which the solutions to the questions must be
defined should be formalised. The authors do not present information about the kind of
domain experts they advise working with
C6. Although Gruninger and Fox emphasised the importance of competency questions
they do not provide techniques or methods to approach this problem.
C7. The Toronto Vrtual Enterprise ontology was built using this methodology.
39
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
C1. Bernaras’ work was developed as part of the KACTUS [27] project which aimed to
investigate the feasibility of knowledge reuse in technical systems. This methodology is thus
heavily influenced by knowledge engineering.
C2. The original paper by Bernaras et al. provides little detail about the methodology.
C5. As the ontology is highly coupled with the software that uses it the life cycle of the
ontology is the same as the software life cycle.
C6. For the specific development of the ontology no particular methods or techniques
are provided. However, as this methodology was meant to support the development of an
ontology at the same time as the software it is reasonable to assume that some software
engineering methods and techniques were also applied to the development of the ontology.
40
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
C1. METHONTOLOGY has its roots in knowledge engineering. The authors aim to
define a standardisation of the ontology life cycle (development) with respect to the
requirements of the Software Development Process (IEEE 1074-1995 standard) [3].
C2. Detail is provided for the ontology development process; Figure 3 illustrates the
methodology. It includes the identification of the ontology development process, a life cycle
based on evolving prototypes, and particular techniques to carry out each activity [19] This
methodology heavily relies on the IEEE software development process as described in [14].
Gomez-Perez et al. consider that all the activities carried out in an ontology development
process may be classified into one of the following three categories:
41
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
C7. This methodology has been used in the development of the Chemical OntoAgent
[33] as well as in the development of the Onto2Agent ontology [33].
42
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
C2. Although there is extensive documentation for those text mining techniques and
developing structures for conceptual machine translation [36-38] no detail is provided as for
the “how” to build the ontology.
C3. As SENSUS makes extensive use of both text mining and conceptual machine
translation the methodology as such is application semi-independent. The methods and
techniques proposed by SENSUS may, in principle, be applied to several domains.
C4. SENSUS follows a bottom-up approach. Initially instances are gathered, as the
process moves forward abstractions are then identified.
C5. No life cycle is identified; from those reported experiences the ontology is deployed
on a one-off basis.
C6. Methods and techniques are identified for gathering instances. However, no further
detail is provided.
C7. SENSUS was the methodology followed for the development of knowledge-based
43
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
1.2.6 DILIGENT
“... an extension of the current Web in which information is given well-defined meaning, better
enabling computers and people to work together to work in cooperation. It is the idea of having data on the
Web defined and linked in a way that it can be used for more effective discovery, automation, integration and
reuse across various applictions... data can be shared and processed by automated tools as well as by
people."[20, 40, 41]
“The goal of the Semantic Web initiative is as broad as that of the Web: to create a universal
medium for the exchange of data. It is envisaged to smoothly interconnect personal information
management, enterprise application integration, and the global sharing of commercial, scientific and
cultural data. “Facilities to put machine-understandable data on the Web are quickly becoming a high priority
for many organizations, individuals and communities” [41]
44
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
C2. DILIGENT provides some details specifically for those developments in which it
has been used.
C5. DILIGENT assumes an iterative life cycle in which the ontology is in constant
evolution.
C7. Some cases for which DILIGENT has been used have been reported, for instance
see [42].
The considerable number of methodologies and the little detail provided by each of
them makes it difficult to find a melting point. Some similarities and shortcomings are
analyzed in this section. A summary of the comparison is given in Table 1.
45
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
methodology
Recommend
Recommend
Applicability
Detail of the
involvement
ed methods,
building the
Community
engineering
ed life cycle
Strategy for
Strategy for
Inheritance
Knowledge
techniques
technology
identifying
knowledge
elicitation
concepts
ontology
from
and
Partial Very AI MOut N/A N/A N/A N/A
Uschold and
little
King
Small Little ASD MOut TBD N/A Business and N/A N/A
Gruninger
foundational
and Fox
ontologies
little
activities developments
missing. reported
Technology
recommended
Inexistent Medium ASD N/A TBD N/A N/A N/A
Swartout
recommend. domain
Technology
recommended
Application-independent= AI, Application-SemiDependent=ASD, Application-Dependant=AD Top-Down=TD, Bottom-Up=BU,
Middle-Out=MOut, Domain Expert Dependent=DED, Terminology Extraction Dependent =TED N/A=Not available, To Be
Detailed=TBD, Evolving Prototypes=EP
Although the investigated methodologies are different from each other, it was possible
to identify some commonalities amongst them. Figure 4 illustrates those shared stages across
all investigated methodologies except DILIGENT.
46
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
• Life cycle: Within the DILIGENT methodology the ontology is constantly evolving,
in a never-ending cycle. The life cycle of the ontology is understood as an open cycle
in which the ontology evolves in a dynamic manner.
• Collaboration: Within the DILIGENT methodology a group of people agrees on
the formal specification of the concepts, relations, attributes, and axioms that the
ontology should provide. This approach empowers domain experts in a way that
sets DILIGENT apart from the other methodologies.
• Knowledge elicitation: Due in part to the involvement of the community and in part
to the importance of an agreement within the DILIGENT methodology knowledge
elicitation is assigned a high level of importance as it supports the process by which
consensus is reached.
From the analysis previously presented it is clear that no single methodology brings
together everything that is needed when developing ontologies; methodologies have been
47
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
developed on an ad hoc basis. Some of the methodologies, such as those of Bernaras, provide
information about the importance of the relationship between the final application using the
ontology and the process by which the ontology is engineered. This consideration is not
always taken from the beginning of the development; clearly the kind of ontology that is
being developed heavily influences this relationship. For instance, foundational ontologies
rarely consider the software using the ontology as an important issue; these ontologies focus
more on fundamental issues affecting the classification system such as time, space, and events.
They tend to study the intrinsic nature of entities independently from the particular domain in
which the ontology is going to be used [20].
The final application in which the ontology will be used also influences the kind of
domain experts that should be considered for the development of the ontologies. For
instance, specialised domain experts are necessary when developing application ontologies,
domain ontologies or task ontologies, but they tend not to have such a predominant role
when building foundational ontologies. For these kinds of ontologies philosophers and
broader knowledge experts are usually more suitable.
None of the investigated methodologies provided real detail; the descriptions for the
processes were scarce, and where present theoretical. No recollection was given about the
ontology building sessions. The methods employed during the development of the ontologies
were not fully described. For instance the reasons for choosing a particular method over a
similar one were not presented. Similarly there was no indication as to what software should
be used to develop the ontologies. METHONTOLOGY was a particular case for which
there is a software environment associated to the methodology; the recommended software
WebODE [32] was developed by the same group to be used within the framework proposed
by their methodology.
Although the investigated methodologies have different views on the life cycle of the
ontology none of them, except for DILIGENT, considers the life cycle to be dynamic. This is
reflected in the processes these methodologies propose. The development happens in a
continuum; some parts within the methodologies are iterative processes, but the steps are
48
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
linear, taking place one after the other. In the case of DILIGENT the different view on the
life cycle is clear. However, there is no clear understanding as to how this life cycle is dynamic
and evolving; the authors don’t present any such discussion.
The lack of support for the continued involvement of domain experts scattered around
the world is a shortcoming in the investigated methodologies. As the SW poses a scenario in
which information is highly decentralised, such a consideration is important. Biological
sciences pose a similar scenario, in which domain experts are geographically distributed and
the interaction takes place mostly on a virtual basis.
Ontologies in the semantic web should not only be domain and/or task specific but
also application oriented. Within the SW the construction of applications and ontologies will
not always take place as part of the same software development projects. It is therefore
important for these ontologies to be easily extensible; their life cycle is one in which the
ontologies are in constant evolution, highly dynamic and highly reusable. Ontologies in
biology have always supported a wide range of applications; MO for instance is used by
several unrelated microarray laboratories information systems around the world. In both
scenarios, SW and biology, not only is the structure of the ontology constantly evolving, but
also the role of the knowledge engineer is not that of a leader but more that of a facilitator of
collaboration and communication among domain experts.
Parallels can be drawn between the biological domain and the SW. Pinto and coworkers
[16] define SW-related scenarios as distributed, loosely controlled and evolving. As has been
pointed out by Garcia et al. [7] domain experts in biological sciences are rarely in one place;
they tend to form virtual organisations where experts with different but complementary skills
collaborate in building an ontology for a specific purpose. The structure of the collaboration
does not necessarily incorporate a central control and different domain experts join and leave
the network at any time and decide on the scope of their contribution to the joint effort.
Biological ontologies are constantly evolving, not only as new instances are added, but also as
new whole/part-of properties are identified as new uses of the ontology are investigated. The
49
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
rapid evolution of biological ontologies is due in part to the fact that ontology builders are
also those who will ultimately use the ontology [43].
Pinto and co-workers [16], as well as Garcia et al [7] have summarised those differences
between classic proposals for building ontologies and those requirements added by the SW in
four key points:
50
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
chapters three, four, and five the methodology as well as its corresponding methods and
illustrative cases will be presented. It is based on real cases worked out within the biological
domain as well as on a thoughtful analysis of previously proposed methodologies.
1.4 ACKNOWLEDGEMENTS
The author specially thanks Oscar Corcho, and Mariano Fernandez for their extremely
helpful suggestions.
1.5 REFERENCES
51
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
52
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
20. Sure Y: Methodology, Tools & Case Studies for Ontology based Knowledge
Managment. Karlsruhe: Universitat Fridericiana zu Karlsruhe; 2003.
21. Gangemi A, Guarino N, Masolo C, Oltramari A, Schneider L: Sweetening
ontologies with DOLCE. In: Proceedings of the 13th International Conference on Knowledge
Engineering and Knowledge Management Ontologies and the Semantic Web: 2002: Springer-
Verlag 2002: 166-181.
22. Uschold M, Gruninger M: Ontologies: Principles, methods and applications.
Knowledge Engineering Review 1996, 11(2):93-136.
23. Lakoff G: Women, fire, and dangerous things: what categories reveal about the
mind. Chicago: Chicago University Press; 1987.
24. Cooke N: Varieies of Knowledge Elicitation Techniques. International Journal of
Human-Computer Studies 1994, 41:801-849.
25. Uschold M, King M: Towards Methodology for Building Ontologies. In:
Workshop on Basic Ontological Issues in Knowledge Sharing, held in conjunction with IJCAI-95:
1995; Cambridge, UK; 1995.
26. Gruninger M, Fox MS: The Role of Competency Questions in Enterprise
Engineering. In: Proceedings of the IFIP WG57 Workshop on Benchmarking - Theory and
Practice: 1994; Trondheim, Norway; 1994.
27. Bernaras A, Laresgoiti I, Correa J: Building and Reusing Ontologies for Electrical
Network Applications. In: Proceedings of the European Conference on Artificial Intelligence
(ECAI96). Budapest; 1996.
28. Swartout B, Ramesh P, Knight K, Russ T: Toward Distributed use of Large-Scale
Ontologies. In: Symposium on Ontological Engineering of AAAI: 1997: Stanford,
California; 1997.
29. Vrandecic D, Pinto HS, Sure Y, Tempich C: The DILIGENT Knowledge
Processes. Journal of Knowledge Management 2005, 9(5):85-96.
30. Uschold M, King M, Moralee S, Zorgios Y: The Enterprise Ontology. The Knowledge
Engineering Review 1998, 13(Special issue on Putting Ontologies to Use).
31. Fernadez-Lopez M, Gomez-Perez A: Overview and Analysis of Methodologies
for Building Ontologies. The Knowledge Engineering Review 2002, 17(2):129-156.
32. Arpirez JC, Corcho O, Fernadez-Lopez M, Gomez-Perez A: WebODE in a
nutshell. AI Magazine 2003, 24(3):37-47.
33. Arpirez JC, Gomez-Perez A, Lozano A, Pinto H, S.: Reference Ontology and
ONTO2 Agent: The Ontology Yellow Pages. In: Workshop on applications of
Ontologies and Problem-solving Methods, European Conference on Artificial Intelligence
(ECAI98): 1998; Brighton, UK; 1998.
53
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
34. Fellbaum C: WordNet, An Electronic Lexical Database. Boston: The MIT Press;
2000.
35. ISI: Information Sciences Institute. In: SENSUS Ontology, http://www.isi.edu/natural-
language/projects/ONTOLOGIEShtml. 2007.
36. Knight K, Luck S: Building a large knowledge base for machine translation. In:
Proceedings of the American Association of Artificial Intelligence: 1994; 1994: 773-778.
37. Knight K, Chander I: Automated Postediting of Documents. In: Proc of the National
Conference on Artificial Intelligence (AAAI): 1994; 1994.
38. Knight K, Graehl J: Machine Trasnliteration. In: Proc of the Conference of the Association
for Computational Linguistics (ACL): 1997; 1997.
39. Valente A, Russ T, McGregor R, Swartout B: Building and (Re)Using an
Ontology of Air Campaign Planning. IEEE Intelligent Systems & Their Applications
1999(January/February).
40. Berners-Lee T: Weaving the Web: HarperCollins; 1999.
41. W3C: Semantic Web Activity Statement. In: http://www.w3.org/2001/sw/Activity.
2007.
42. Pinto S, Staab S, Sure Y, Tempich C: OntoEdit Empowering SWAP: a Case Study
in Supporting DIstributed, Loosely-Controlled and evolvInG Engineering of
oNTologies (DILIGENT). In: ESWS 2004: 2004; 2004: 16-30.
43. Bada M, Stevens R, Goble C, Gil Y, Ashbourner M, Blake J, Cherry J, Harris M,
Lewis S: A short study on the success of the GeneOntology. Journal of Web
Semantics 2004, 1:235-240.
54
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
Not only this methodology is based upon real cases but also, and more importantly,
those steps, methods and techniques described here have been extensible tested. This is the
first methodology engineered for decentralised communities of practice for which designers
of technology and users may be the same group. The use of concept maps throughout the
development process, the importance of the argumentative structures, and the usefulness of
the narratives and text-mining techniques are among methods and techniques here described.
Subsequent chapters present those experiences that allowed the author not only to test and
extend the methodology but also to validate it.
The author engineered the methodology, defined the steps, methods, and techniques
involved. The investigation that allowed gathering all the information and data supporting this
methodology was entirely conducted by Alex Garcia; the involvement of communities of
practice as well as the identification of those areas for which there could be interest in helping
the author to do his research were also activities carried out by Alex Garcia. Manuscripts as
well as the corresponding journal and conference publications were written by Alex Garcia.
55
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
2.1 INTRODUCTION
As presented in the previous chapter, building ontologies has been more of an ad hoc
process rather than a well-engineered practice. It has been argued by several authors that to
this day there are no agreed-upon standard methodology for building ontologies [1-3].
Nonetheless, there exist generic components fundamental to the ontology-building process,
present in most or all ontology developments even if they are not explicitly identified. A
detailed study of methodologies and those generic components was presented in Chapter 1.
In the present chapter “The melting point, a methodology for developing ontologies within decentralized
environments” those generic components are orchestrated in a coherent manner not only with
the way communities build ontologies but also with the life cycle of these ontologies. The
description of features and interrelationships is based upon experimentation and observation
that took place during developments in real scenarios. It was possible not only to have direct
access to domain experts but also to monitor the evolution and intended use of the ontology,
Moreover; it was possible to study the processes by which the community got involved in the
development of the ontology.
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
reduced group of domain experts. This same situation is also true for the whole process in
which a reduced group of domain experts work together with the knowledge engineer during
the development of the ontology; the community is not widely involved.
Within the Semantic Web (SW), as well as within the biological domain, the
involvement of communities of practice is crucial not only for the development, but also for
the maintenance and evolution of ontologies. Domain experts in biological sciences are rarely
in one place; they tend to form virtual organisations in which experts with different but
complementary skills collaborate in building an ontology for a specific purpose. The structure
of the collaboration does not necessarily have a central control; different domain experts join
and leave the network at any time and decide on the scope of their contribution to the joint
effort. Biological ontologies are constantly evolving; new classes, properties, and instances
may be added at any time, and new uses for the ontology may be identified [2]. The rapid
evolution of biological ontologies is due in part to the fact that ontology builders are also
those who will ultimately use the ontology [4].
This chapter presents the methodology inferred from those scenarios for which it was
possible to conduct experiments that allowed the author to understand the importance and
impact of the community, as well as on the structure and orchestration of those fundamental
components in the ontology building process. The initial section of this chapter presents a
brief introduction stressing the important points that will be elaborated throughout this
chapter; some terminological considerations are presented in the second section. This is
followed by the presentation of the methodology and related information; methods,
techniques, activities and tasks are also presented in section three. Section four presents the
incremental evolutionary spiral model of tasks, activities and processes consistent with the life
cycle. Sections five and six present discussion and conclusions.
57
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
Some of the common points across the previously proposed methodologies have been
adapted for the present work. An important contribution to this methodology comes from
observations made by Perez-Gomez et al. [5, 6], Fernandez et al. [7] Pinto et al. [8, 9], and
Garcia et al. [2] -see chapters 3, 4, and 5 for more information on Garcia’s observations-. Both
Fernandez et al. and Perez-Gomez et al. emphasise the importance of complying with the
Institute of Electrical and Electronics Engineers (IEEE) standards, more specifically with the
“IEEE standard for software quality assurance plans” [10]. In the context of the conclusions
drawn in the previous chapter, such concern is understandable; not only does standards
compliance ensure a careful and systematic planning for the development, but it also ensures
the applicability of the methodology to a broad range of problems.
Also, from the previous chapter it became clear that methodologies bring together
techniques and methods in an orchestrated way so that the work can be done. A method is
“an orderly process or procedure used in the engineering of a product or performing a service” [11]. A
technique is defined as a “technical and managerial procedure used to achieve a given objective” [10].
Figure 1 illustrates in a more comprehensive manner these relationships.
58
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
of a method and the way in which the method is executed” [13]. According to the IEEE [14] a process
is a “function that must be performed in the software life cycle. A process is composed by activities”. The
same set of standards defines an activity as “a constituent task of a process” [14]. A task is the
atomic unit of work that may be monitored, evaluated and/or measured; a task is “a well
defined work assignment for one or more project member. Related tasks are usually grouped to form activities”
[14].
For the purpose of the proposed methodology it was decided that the activities
involved would be framed within processes and activities, as illustrated in Figure 1; this
conception is promoted by METHONTOLOGY [7] for centralised settings. As these
activities were not conceived within decentralised settings, their scope has been redefined, so
that they better fit the life cycle of ontologies developed by communities. The methodology
here presented emphasises: decentralised settings and community involvement. It also stresses
the importance of the life cycle these ontologies follow, and provides activities, methods and
techniques coherently embedded within this life cycle.
The methodology and the life cycle are illustrated in Figure 2. The overall process starts
with documentation and management processes; the development process immediately
follows. Managerial activities happen throughout the whole life cycle, as the interaction
amongst domain experts ensures not only the quality of the ontology, but also that those
predefined control activities take place. The development process has four main activities:
specification, conceptualisation, formalisation and implementation, and evaluation. Different
prototypes of the ontology are thus constantly being deployed. Initially these prototypes may
be unstable, as the classes and properties may drastically change. In spite of this, the process
evolves rapidly, achieving a stability that facilitates the use of the ontology; changes become
more focused on the inclusion of classes and instances, rather than on the redefinition of the
class hierarchy.
59
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
Chapter 2 - Figure 2. Life cycle, processes, activities, and view of the methodology.
• Scheduling: Gantt charts are useful when scheduling processes, also simple
spreadsheets, or Word documents may be used.
• Control: flowcharts allow for a simple view of the process and those points for
which there is the need to have a control activity.
Although there are several software suites that assist in project management, some of
them offering workgroup capabilities (see for instance http://www.mindtools.com/), large
60
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
biological ontology projects use simpler solutions such as those facilities Google offers for
networking. Scheduling and controlling activities can be done by using Google calendar
(http://www.google.com/calendar), by the same token sharing documents is facilitated by
google documents (http://docs.google.com). When establishing communication and
exchanging information email, wiki pages and voice over Internet Protocol (IP) systems have
proven to be useful in projects such as the Ontology for Biomedical Investigations (OBI) [15]
and the Microarray Ontology (MO) [16, 17]. A more-detailed description of the involvement
of communities by high-tech means was published by Garcia et al. [2]. For both scheduling
and controlling, the software tool(s) should in principle:
• mailing lists: discussions about why should a class be part of the ontology, why
should it be part of a particular branch, how is it being used by the community, how
a property relates two classes, and in general all discussions relevant to the ontology
happen on an email basis.
61
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
• On-the-Ontology comments: in the cases when domain experts are familiarised with
the ontology editor, they usually comment on classes and properties.
• Use cases: this should be the main source of structured documentation provided by
domain experts. However, gathering use cases is often difficult and time-consuming.
The use cases should illustrate how a term is being used in a particular context, how
the term is related to other terms, and those different uses or meanings a term may
have. Guidance is available for the construction of use cases when developing
software; however this direction is not available when building ontologies. From
those experiences in which the author participated some general guide can be drawn,
for instance:
o use cases should be brief,
o they should be based upon real-life examples,
o knowledge engineers have to be familiar with the terminology as well as with the
domain of knowledge because use cases are usually provided in the form of
narratives describing processes,
o graphical illustrations should be part of the use case, and also
o whenever possible concept maps, or other related KA artefacts, should be used.
These start as soon as there is a decision for developing the ontology and continue
throughout the whole ontology development process. Managerial processes aim to assure the
successful development of the ontology by providing domain experts with all that is needed.
Also, managerial processes define general policies that allow the orchestration of the whole
development. Some of the activities involved in the managerial processes are:
2.3.2.1 Scheduling
2.3.2.2 Control
2.3.2.3 Inbound-interaction
Inbound-interaction specifies how the interaction amongst domain experts will take
place, for instance by phone calls, mailing lists, wiki pages and, web publications.
62
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
2.3.2.4 Outbound-interaction
This activity defines minimal standards for those outputs from each and every process,
activity or task carried out within the development of the ontology.
For both inbound and outbound interactions, there are some key questions that should
be addressed:
Feasibility study: This first activity involves addressing straightforward questions such
as: what is the ontology going to be used for? How is the ontology ultimately going to be used
by the software implementation? What do we want the ontology to be aware of, and what is
the scope of the knowledge we want to have in the ontology?
The milestones for this activity are: competency questions, scenarios in which it is
foreseeable the ontology will be used, is there a “go” for the ontology?
63
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
concepts into higher-level abstractions and validating these with domain experts.
Scaling the models involves the participation of both domain experts and the
knowledge engineer. It is mostly done by direct interview and confrontation with
the models from different perspectives. The participation of new “fresh” domain
experts, as well as the intervention of experts from allied domains, allows
analyzing the models from different angles. This participatory process allows re-
factorising the models by increasing the level of abstraction.
Throughout these activities Gruber’s design principles [20] such as those mentioned
below have to be considered.
• First design principle: “The conceptualization should be specified at the knowledge level without
depending on a particular symbol-level encoding.”
• Second design principle: “Since ontological commitment is based on the consistent use of the
vocabulary, ontological commitment can be minimised by specifying the weakest theory and defining
only those terms that are essential to the communication of knowledge consistent with the theory.”
• Third design principle: “An ontology should communicate effectively the intended meaning of
defined terms. Definitions should be objective. Definitions can be stated on formal axioms, and a
complete definition (defined by necessary and sufficient conditions) is preferred over a partial
definition. All definitions should be documented with natural language.”
For the purpose of DA and KA it is critical to elicit and represent knowledge from
domain experts. They do not, however, have to be aware of knowledge representation
languages; this makes it important that the elicited knowledge is represented in a language-
independent manner. Researchers participating in knowledge elicitation sessions are not
always aware of the importance of the session; however they are aware of their own
operational knowledge. This is consistent with the first of Gruber’s design principles.
Regardless of the syntactic format in which the information is encoded domain experts
have to communicate and exchange information. For this matter it is usually the case that
wide general theories, principles, broad-scope problem specifications are more useful when
engaging domain experts in discussions, as these tend to contain only essential basic terms,
known across the community and causing the minimal number of discrepancies, see the
second design principle. As the community engages in the development process and the
ontology grows, it becomes more important to have definitions that are usable by computer
systems and humans, see the third design principle.
65
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
ACTIVIT IES
The milestones, techniques and tasks identified for DA and KA related activities are:
Iterative building of informal ontology models helps to expand the glossary of terms,
relations, their definition or meaning, and additional information such as examples to clarify
the meaning where appropriate. Different models are built and validated with the domain
experts. There is a fine boundary between the baseline ontology and the refined ontology;
both are works in progress, but the community involved has agreed upon the refined
ontology.
The milestones, techniques and tasks identified for IBOM related activities are:
1 The Ontology LookUp service (OLS) provides a user-friendly single entry point for querying publicly available ontologies in
the Open Biomedical Ontology (OBO) format. By means of the OLS it is possible to verify if an ontology term has already
been defined and in which ontology it available.
66
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
2.3.3.4 Formalisation
Formalisation of the ontology is the activity during which the classes are constrained,
and instances are attached to their corresponding classes. For example: “a male is constrained
to be an animal with a y-chromosome”. During the formalisation domain experts and
knowledge engineers work with an ontology editor. When building iterative models and
formalizing the ontology the model grows in complexity; instances, classes and properties are
added, also logical expressions are built in order to have definitions with necessary and
sufficient conditions. For both, formalisation and IBOM, Gruber’s fourth designing principle
is applicable, as well as Noy and McGuinness’s guidelines [22].
• Fourth principle: “An ontology should be coherent: that is, it should sanction inferences that are
consistent with de definitions. […] If a sentence that can be inferred from the axioms contradicts a
definition or example given informally, then the ontology is inconsistent.”
• Noy and McGuinness’s first guideline: “The ontology should not contain all the possible
information about the domain: you do not need to specialise (or generalise) more than you need for
your application.”
• Noy and McGuinness’s second guideline: “subconcepts of a concept usually i) have
additional relations that the superconcetp does not have, or ii) restrictions different from these of
superconcepts, or iii) participate indifferent relationships than supperconcepts. In other words, we
introduce a new concept in the hierarchy usually only when there is something that we can say about
this concept that we cannot say about the superconcept. As an exception, concepts in terminological
hierarchies do not have to introduce new relations”.
• Noy and McGuinness’s third guideline: “If a distinction is important in the domain and we
think of the objects with different values for the distinction as different kinds of objects, then we
should create a new concept for the distinction”.
• Noy and McGuinness’s fourth guideline: “A concept to which an individual instance
belongs should not change often”.
2.3.3.5 Evaluation
There is no unified framework to evaluate ontologies, and this remains an active field of
research [23]. When developing ontologies on a community basis three main evaluation
activities have been identified:
67
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
This activity was proposed by Perez-Gomez et al. [24]. The goal of the evaluation is to
determine what the ontology defines, and how accurate these definitions are. Perez-Gomez et
al. provides the following criteria for the evaluation:
• Consistency: it is assumed that a given definition is consistent if, and only if, no
contradictory knowledge may be inferred from other definitions and axioms in the
ontology.
• Completeness: it is assumed that ontologies are in principle incomplete [23, 24],
however it should be possible to evaluate the completeness within the context in
which the ontology will be used. An ontology is complete if and only if:
o All that is supposed to be in the ontology is explicitly stated, or can be inferred.
• Conciseness: an ontology is concise if it does not store unnecessary knowledge, and
the redundancy in the set of definitions has been properly removed.
This evaluation is usually carried out by means of reasoned systems such as RACER
[25] and Pellet [26]. The knowledge engineer checks for inconsistencies in the taxonomy,
these may due to errors in the logical expressions that are part of the axioms.
68
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
Ontologies, like software, evolve over time; specifications often change as the
development proceeds, making a straightforward path to the ontology unrealistic. Different
software process models have been proposed; for instance, linear sequential models, also
known as waterfall models [27, 28] are designed for straight-line development. The linear
sequential model suggests a systematic, sequential approach in which the complete system will
be delivered once the linear sequence is completed [28]. The role of domain experts is passive,
as end-users of technology. They are placed in a reacting role in order to give feedback to
designers about the product. The software or knowledge engineer leads the process and
controls the interaction amongst domain experts.
69
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
The prototyping model is more flexible as prototypes are constantly being built.
Prototypes are built as a means for defining requirements [28], this allows for a more active
role from domain experts. A quick design is often obtained in short periods of time. The
model grows as prototypes are being released [23]; engineers and domain experts work on
these quick designs. They focus on representational aspects of the ontology, while, the main
development of the ontology (building the models, defining what is important, documenting,
etc) is left to the knowledge engineer. A high-speed adaptation of the linear sequential model
is the Rapid Application Development (RAD) model [29, 30]. This emphasises short
development cycles for which it is possible to add new software components, as they are
needed. RAD also strongly suggests reusing existing program components, or creating
reusable ones [28].
70
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
KA=Knowledge acquisition, DA=Domain Analysis, IBOM= Iterative Building of Ontology Models F=Formalisation,
EVAL=Evaluation
Chapter 2 - Figure 3. An incremental evolutionary spiral model of tasks, activities and processes.
Figure 3 illustrates how tasks and activities are incremental in the spiral and how the
process is constantly evolving. Activities such as Knowledge Acquisition (KA), Domain
Analysis (DA), Iterative Building of Ontology Models (IBOM) Formalisation (F), and
evaluation (EVAL) take place within the spiral; not necessarily following a strict order.
Initially those processes related to management occur. As soon as there is a “GO” for the
ontology development process these activities start with KA, DA, and IBOM. Once the first
prototype of the ontology has been modelled then activities, tasks and processes can coexist
simultaneously at some level of detail within the spiral. The process allows for its own
incremental growth by facilitating the incorporation of other activities and/or processes such
as Evaluation and Formalisation.
71
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
2.5 DISCUSSION
72
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
73
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
group-oriented, making it necessary to re-evaluate the whole process as well as the way by
which it is described. The IEEE proposes a set of concepts that should in principle facilitate
the description of a methodology; however these guidelines should be better-scoped for
decentralized environments.
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
involvement of the community allows for rapid evolution, as well as for very high quality
standards; errors are identified and discussed then corrections are made available within short
time frames.
The model upon which this proposed methodology is based brings together ideas from,
linear sequential modelling [28, 35] , prototyping, spiral [36], incremental [37, 38] and the
evolutionary models [28, 39]. Due to the dynamic nature of the interaction when developing
ontologies on a community basis the model grows rapidly and continuously. As this happens
prototypes are being delivered, documentation is constantly being generated, and evaluation
takes place at all times as the growth of the model is due to the argumentation amongst
domain experts. The development process is incremental as new activities may happen
without disrupting the evolution of the collaboration. The model is therefore an incremental
evolutionary spiral in which tasks and activities can coexist simultaneously at some level of
detail. As the process moves forward activities and/or tasks are applied recursively depending
on the needs. The evolution of the model is dynamic and the interaction amongst domain
experts and with the model happens all the time. Figure 3 illustrates the model as well as how
processes, activities and tasks are consistent with the model.
2.6 CONCLUSIONS
The methodology proposed in this chapter reuses some components that various
authors have identified as part of their methodologies. This thesis has investigated how to use
these components within decentralised settings such as the biomedical domain. The proposed
methodology is consistent with the challenges posed by the ontologies needed for the SW.
The importance of this chapter is a detailed description of methods, techniques, activities, and
tasks that could be used for developing community-based ontologies. Furthermore, this
chapter has also explained the model for the life cycle of these ontologies. Both the
methodology and the life cycle are consistent with those proposed processes. The
fundamental contribution of this chapter is the involvement of communities as both domain
75
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
experts and subjects of study. This allowed the author to base his results in real-life cases.
Successive chapters present engineering experiments from which the components presented
in this chapter were studied.
2.7 ACKNOWLEDGEMENTS
The author specially thanks Oscar Corcho, and Mariano Fernandez for their extremely
helpful suggestions.
2.8 REFERENCES
76
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
8. Pinto H, Sofia. , Martins P, Joao.: Ontoloigies: How can they be built? Knowledge
and information Systems 2004, 6:441-463.
9. Pinto HS, Staab S, Tempich C: Diligent: towards a fine-grained methodology for
Distributed, Loosely-controlled and evolving engineering of ontologies. In:
European Conference on Artificial Intelligence: 2004; Valencia, Spain; 2004: 393-397.
10. IEEE: IEEE standard for software quality assurance plans. In. Edited by IEEE,
vol. 730-1998: IEEE Computer Society; 1998.
11. IEEE: IEEE Standard Glossary of Software Engineering Terminology. In:
IEEE Standards vol. IEEE Std 610.12-1990: IEEE; 1991.
12. Greenwood E: Metodologia de la investigacion social. Buenos Aires: Paidos;
1973.
13. Gomez-Perez A, Fernadez-Lopez M, Corcho O: Ontological Engineering.
London: Springer-Verlag; 2004.
14. IEEE: IEEE Standard for Developing Software Life Cycle Processes. In. Edited
by IEEE, vol. IEEE Std 1074-1995: IEEE Computer Society; 1996.
15. OBI, Ontology for Biological Investigations [http://obi.sourceforge.net/]
16. Microarray Gene Expression Data [http://www.mged.org/]
17. Stoeckert CJ, Parkinson H: The MGED ontology: a framework for describing
functional genomics experiments. Comparative and Functional Genomics 2003, 4:127-
132.
18. Cooke N: Varieies of Knowledge Elicitation Techniques. International Journal of
Human-Computer Studies 1994, 41:801-849.
19. Gaines B, R.,, Shaw ML, Q.: Knowledge acquisition tools based on personal
construct psychology. The Knowledge Engineering Review 1993, 8(1):49-85.
20. Gruber R, Tom.: Toward Principles for the Design of Ontologies Used for
Knowledge Sharing. In: International Workshop on Formal Ontology: 1993; Padova, Italy,;
1993.
21. Cote R, Jones P, Apweiler R, Hermjakob H: The Ontology Lookup Service, a
lightweight cross-platform tool for controlled vocabulary queries. BMC
Bioinformatics 2006, 7(97).
22. Noy NF, L. MD: Ontology Development 101: a Guide to Creating Your First
Ontology. In: Protege Documentation. Stanford, CA: Stanford University; 2001.
23. Perez AG, Fernadez-Lopez M, Corcho O: Ontological Engineering: Springer;
2004.
77
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
24. Perez AG, Juristo N, Pazos J: Evaluation and assessment of knowledge sharing
technology In: Towards Vey Large Knowledge Bases: Knowledge Building and Knowledge
Sharing(KBK95): 1995; Amsterdan, The Netherlands: IOS Press; 1995: 289-296.
25. Haarslev V, Möller R: Racer: A Core Inference Engine for the Semantic Web. In:
Proceedings of the 2nd International Workshop on Evaluation of Ontology-based Tools
(EON2003): October 20 2003; Sanibel Island, Florida, USA; 2003: 27-36.
26. Sirin E, Parsia B, Cuenca-Grau B, Kalyanpur A, Katz Y: Pellet: A practical OWL-
DL resoner. Journal of Web Semantics 2007, 5(2).
27. Eden HA, Hirshfeld Y: Principles in formal specification of object oriented
design and architecture. In: Proceedings of the 2001 conference of the Centre for Advanced
Studies on Collaborative research: 2001; Toronto, Canada: IBM Press; 2001.
28. Pressman S, Roger.: Software Engineering, A practitioners Approach, Fifth edn:
Thomas Casson; 2001.
29. Kerr J, Hunter R: Inside RAD: Mc-Graw-Hill; 1994.
30. Martin J: Rapid Application Development: Prentice-Hall; 1991.
31. Gilb T: Principles of Software Engineering Management: Addison-Wesley
Longman; 1988.
32. Gilb T: Evolutionary Project Management: Multiple Performance, Quality and
Cost Metrics for Early and Continuous Stakeholder Value Delivery. In:
International Conference on Enterprise Information Systems: April 14-17 2004; Porto, Portugal.;
2004.
33. Sure Y: Methodology, Tools & Case Studies for Ontology based Knowledge
Managment. Karlsruhe: Universitat Fridericiana zu Karlsruhe; 2003.
34. Fernandez M: Overview Of Methodologies For Building Ontologies. In: In
Proceedings of the IJCAI-99 Workshop on Ontologies and Problem-Solving Methods(KRR5):
1999; Stockholm, Sweden; 1999.
35. Dagnino A: Coordination of hardware manufacturing and software
developmentlifecycles for integrated systems development. In: IEEE
International Conference on Systems, Man, and Cybernetics: 2001; 2001: 1850-1855.
36. Boehm B: A spiral model of software development and enhancement. ACM
SIGSOFT Software Engineering Notes 1986, 11(4):14-24.
37. McDermid J, Rook P: Software Developement Process Models. In: Software
Engineer's Reference Book. CRC Press; 1993: 15-28.
38. Larman C, Basili R, Victor. : Iterative and Incremental Development: A Brief
History Computer, IEEE Computer Society 2003, 36:47-56.
39. May L, Elaine, Zimmer A, Barbara.: The Evolutionary Development Model for
Software. HP Journal 1996:http://www.hpl.hp.com/hpjournal/96aug/aug96a94.htm.
78
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
A critical assessment of the state of the art of methodologies for developing ontologies
was initially presented in this thesis work. It was subsequently followed by the presentation of
the proposed methodology; this methodology is the product of several experiments and
analysis addressing some of the previously identified key issues when developing ontologies in
communities of practice such as the biological domain.
This chapter is divided in two sections; the first one presents the process, results, and
conclusions for one the experiments upon which the proposed methodology relays on. Some
specific issues were addressed when conducting this experience; for instance, how could the
knowledge elicitation process be supported throughout the entire process? How could
domain experts be engaged in such a way that interaction could be facilitated? Which parts of
those previously proposed methodologies could be applied within this setting? Important
information was gathered from this experience, not only methodological aspects were
identified, but also the importance of conceptual maps was documented and well established
as part of the development process. The second part of this chapter presents another scenario
(an ontology for a genealogy management system) for which those identified steps were also
evaluated.
The contributions of this chapter are the thorough description of the suggested steps
when building an ontology, example use of concept maps, consideration of applicability to the
development of lower-level ontologies and application to decentralised environments. Other
authors had previously used conceptual maps when eliciting knowledge, but this was the first
reported experience of the use of concept maps with the specific aim of developing
ontologies. It was also found that within the specific presented scenario conceptual maps
played an important role in the development process. Another important outcome from this
experience was the evidence supporting the importance of communities and how these were
79
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
The author investigated and identified those reusable steps from other methodologies
applicable for this specific environment; Alex Garcia also identified and conceptualised the
use of conceptual maps when developing ontologies as well as those different stages within
the development process for which conceptual maps could play a role. As the knowledge
engineer in charge of this experiment Alex Garcia could also explore and document the role
of both, domain experts and knowledge engineers. Manuscripts leading to those published
papers aroused from this chapter were written by Alex Garcia.
AUTHORS' CONTRIBUTIONS
Susanna Sansone conceived of and coordinated the project. Alex Garcia Castro was a
knowledge engineer during his 11-month student project at EBI. Philippe Rocca-Serra
coordinated the nutrigenomics community within MGED RSBI, and organised and
participated in the knowledge elicitation exercises. Karim Nashar contributed to the
knowledge elicitation exercises. Robert Stevens assisted Alex Garcia Castro in conceptualising
the methodology, Susanna Sansone and Philippe Rocca-Serra supervised the knowledge
elicitation exercises and, with Chris Taylor, the associated meetings. Alex Garcia Castro wrote
the initial version of the manuscript; contributions and critical reviews by the other authors, in
particular Susanna Sansone and Robert Stevens, delivered the final manuscript.
80
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
Garcia Castro A, Sansone S, Rocca-Serra P, Taylor C, Ragan MA: The use of concept
maps for two ontology developments: nutrigenomics, and a management system for
genealogies. In: 8th Intl Protégé Conference Protégé: 2005; Madrid, Spain; 2005: 59-62.
81
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
Abstract. Incorporation of ontologies into annotations has enabled ‘semantic integration’ of complex data,
making explicit the knowledge within a certain field. One of the major bottlenecks in developing bio-
ontologies is the lack of a unified methodology. Different methodologies have been proposed for different
scenarios, but there is no agreed-upon standard methodology for building ontologies. The involvement of
geographically distributed domain experts, the need for domain experts to lead the design process, the
application of the ontologies and the life cycles of bio-ontologies are amongst the features not considered by
previously proposed methodologies. Here, we present a methodology for developing ontologies within the
biological domain. We describe our scenario, competency questions, results and milestones for each
methodological stage. We introduce the use of concept maps during knowledge acquisition phases as a
feasible transition between domain expert and knowledge engineer. The contributions of this paper are the
thorough description of the steps we suggest when building an ontology, example use of concept maps,
consideration of applicability to the development of lower-level ontologies and application to decentralised
environments. We have found that within our scenario concept maps played an important role in the
development process.
3.1.1 Background
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
Many methodologies for building ontologies have been described [5] and seminal work
in the field of anatomy provides insights into how to build a successful ontology [6, 7].
Extensive work about the nature of the relations that can be used also provides solid grounds
for consistent development for building ontologies [8]. However, despite these efforts, bio-
ontologies still tend to be built on an ad hoc basis rather than by following a well-defined
engineering process. To this day, no standard methodology for building ontologies has been
agreed upon. Usually terminology is gathered and organised into a taxonomy, from which key
concepts are identified and related to create a concrete ontology. Case studies have been
described for the development of ontologies in diverse domains, although surprisingly only
one of these has been reported to have been applied in a domain allied to bioscience – the
chemical ontology [9] – and none in bioscience per se. Most of the literature focuses on issues
such as the suitability of particular tools and languages for building ontologies, with little
attention being given to how it should be done. This is almost certainly because the main
interest has been in reporting content and use, rather than engineering methodology.
Nevertheless, it is apparent that most ontologies are built with the ontological equivalent of
“hacking”.
83
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
demonstrated a need for bio-ontologies and several characteristics highlight the lack of
support for these requirements:
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
85
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
interviews, process tracing, conceptual methods, and card sorting. Unfortunately, none of
them was described within the context of ontology development in a decentralised setting.
We drew parallels between the biological domain and the Semantic Web (SW). This is a
vision in which the current, largely human-accessible Web, is annotated from ontologies such
that the vast content of the Web is available to machine processing [20]. Pinto and co-workers
[21] define these scenarios as distributed, loosely controlled and evolving. Domain experts in
biological sciences are rarely in one place; they tend to form virtual organisations where
experts with different but complementary skills collaborate in building an ontology for a
specific purpose. The structure of the collaboration does not necessarily have a central control
and different domain experts join and leave the network at any time and decide on the scope
of their contribution to the joint effort. Biological ontologies are constantly evolving, not only
as new instances are added, but also as new whole/part-of properties are identified as new
uses of the ontology are investigated. The rapid evolution of biological ontologies is due in
part to the fact that ontology builders are also those who will ultimately use the ontology [22].
Some of the differences between classic proposals from Knowledge Engineering (KE)
and the requirements of the SW, have been presented by Pinto and co-workers [21], who
summarise these differences in four key points:
2. Domain expert-centric design: within the SW scenario, domain experts guide the
effort while the knowledge engineer assists them. There is a clear and dynamic
separation between the domain of knowledge and the operational domain. In
contrast, traditional KE approaches relegate the role of the expert as an informant
to the knowledge engineer.
87
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
3.1.2 Methods
A key feature of our methodology is the use of CMs throughout our knowledge
elicitation process. CMs are graphs consisting of nodes representing concepts, connected by
arcs representing the relationships between those nodes [23]. Nodes are labelled with text
describing the concept that they represent, and the arcs are labelled (sometimes only
implicitly) with a relationship type. CMs proved, within our development, useful both for
sharing and capturing activities, and in the formalisation of use cases. Figure 1 illustrates a
CM.
Our methodology strongly emphasises: (i) capturing knowledge, (ii) sharing knowledge,
(iii) supporting needs with well-structured use cases, and (iv) supporting collaboration in
distributed (decentralised) environments. Figure 2 presents those steps and milestones that we
envisage to occur during our ontology development process.
88
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
89
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
software implementation? What do we want the ontology to be aware of, and what is the
scope of the knowledge we want to have in the ontology?
Step 3: Domain analysis and knowledge acquisition are processes by which the
information used in a particular domain is identified, captured and organised for the purpose
of making it available in an ontology. This step may be seen as the ‘art of questioning’, since
ultimately all relevant knowledge is either directly or indirectly in the heads of domain experts.
This step involves the definition of the terminology, i.e. the linguistic phase. This starts by the
identification of those reusable ontologies and terminates with the baseline ontology, i.e. a
draft version containing few but seminal elements of an ontology. We found it important to
maintain the following criteria during knowledge acquisition:
• Accuracy in the definition of terms. The linguistic part of our development was also
meant to support the sharing of information/knowledge. Table 2 presents the
structure of our linguistic definitions. The availability of context as part of the
definition proved to be useful when sharing knowledge.
• Coherence: as CMs were being enriched it was important to ensure the coherence of
the story we were capturing. Domain experts were asked to use the CMs as a means
to tell a story; consistency within the narration was therefore crucial.
• Extensibility: Our approach may be seen as an aggregation problem; CMs were
constantly gaining information, which was always part of a bigger narration.
Extending the conceptual model was not only about adding more details to the
90
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
existing CMs, nor it was it just about generating new CMs; it was also about
grouping concepts into higher-level abstractions and validating these with domain
experts. Scaling the models involved the participation of both domain experts and
the knowledge engineer. It was mostly done by direct interview and confrontation
with the models from different perspectives. The participation of new “fresh”
domain experts as well as the intervention of experts from allied domains allowed us
to analyse the models from different angles. This participatory process allowed us to
re-factorise the models by increasing the level of abstraction.
Word Investigation
Verb/Noun Noun
Definition An Investigation is a set, a collection of related studies and assays; a self-
contained contained unit of scientific enquiry.
Context Evaluating the effect of an ingredient in a diet traditionally relies on one or
more
related studies for example where the subject receive different concentrations
of the ingredient. The concept of investigation provides a container that
allows us to group these studies together.
Notes When can we consider an investigation completed? Ongoing discussion. For
instance, according to the Minimal Information About a Microarray
Experiment (MIAME) an Experiment is a set of related hybridisation that are
in some way related (e.g. related to the same publication). In the case of the
Investigation, we do not want to tie this concept to a publication or a
deposition to a database or a submission to regulatory authority. The decision
should be left to the individual investigator.
Chapter 3 - Table 2. Example of the structure of linguistic definitions.
The goal determines the complexity of the process. Creating an ontology intended only
to provide a basic understanding of a domain may require less effort than creating one
intended to support formal logical arguments and proofs in a domain. We must answer
questions such as: Why are we building this ontology? What do we want to use it for? How is
it going to be used by the software layer? Subsections Identification of purpose, scope,
competency questions and scenarios to Iterative building of informal ontology models
explain these steps in detail.
Step 4: Iterative building of informal ontology models helped to expand our glossary of
terms, relations, their definition or meaning, and additional information such as examples to
clarify the meaning where appropriate. Different models were built and validated with the
domain experts.
91
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
Step 5: Formalisation of the ontology was the step during which the classes were
constrained, and instances were attached to their corresponding classes. For example: “a male
is constrained to be an animal with a y-chromosome”. This step involves the use of an
ontology editor.
Step 6: There is no unified framework to evaluate ontologies, and this remains an active
field of research. We consider that ontologies should be evaluated according to their fitness
for purpose, i.e. an ontology developed for annotation purposes should be evaluated by the
quality of the annotation and the usability of the annotation software. By the same token, the
recall and precision of the data, and the usability of the conceptual query builder, should form
the basis of the evaluation of an ontology designed to enable data retrieval.
The methodology we report herein has been applied during the knowledge elicitation
phase with the European nutrigenomics community (NuGO) [24]. Nutrigenomics is the
study of the response of a genome to nutrients, using “omics” technologies such as genomic-
scale mRNA expression (transcriptomics), cell and tissue-wide protein expression
(proteomics), and metabolite profiling (metabolomics) in combination with conventional
methods. NuGO includes twenty-two partner organisations from ten European countries,
and aims to develop and integrate all facets of resources, thereby making future nutrigenomics
research easier. An ontology for nutrigenomics investigations would be one of these
resources, designed to provide semantics for those descriptors relevant to the interpretation
and analysis of the data. When developing an ontology involving geographically distributed
domain experts, as in our case, the domain analysis and knowledge acquisition phases may
become a bottleneck due to difficulties in establishing a formal means of communication (i.e.
in sharing knowledge).
92
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
3.1. 2.2.1 I DEN T IF ICATI ON OF P URP OSE , SCOP E , COMP ET ENCY QUEST IONS
Whilst the high-level framework of the nutrigenomics ontology will be build as a the
collaborative effort with the others MGED RSBI groups, the lower-level framework aims to
provide semantics for those descriptors specific to the nutritional domain.
Having defined the scope of the ontology we discussed the competency questions with
our nutrigenomics researchers (henceforth our domain experts); these were used at a later
stage in order to help evaluate our model. Examples of those competency questions are
presented in Table 3.
93
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
For our particular purposes, we followed a ‘top-down’ approach where experts in the
biological domain work together to identify key concepts, then postulate and capture an initial
high-level ontology. We decided to follow this approach because of the availability of high-
level domain experts who could provide a more general picture. We identified for example
the Microarray Gene Expression Data (MGED) Ontology (henceforth, MO) [27] as a
possible ontology from which we could recycle - extrapolate from one context to another-
some terms and/or structure for investigation employing other omics technologies in addition
to expression microarrays. The Open Biomedical Ontologies project (OBO) [28, 29] was an
invaluable source of information for the identification of possible orthogonal ontologies.
Domain experts and the knowledge engineer worked together in this task; in our scenario, it
was a process where we focused on those high-level concepts that were part of MO and
relevant for the description of a complete investigation. We also studied the structure that
MO proposes, and by doing so came to appreciate that some concepts could be linguistically
different but in essence mean very similar things. This is an iterative process currently done as
part of the FuGO project. FuGO will expand the scope of MO, drawing in large numbers of
experimentalists and developers, and will draw upon the domain-specific knowledge of a wide
range of biological and technical experts.
We hosted a series of meetings during which the domain experts discussed the
terminology and structure used to describe nutrigenomics investigations. For us, domain
analysis is an iterative process that must take place at every stage of the development process.
We focused our discussions on specific descriptions about what the ontology should support,
and sketched the planned area in which the ontology would be applied. Our goal was also to
guide the knowledge engineer and involve that person in a more direct manner.
An important outcome from this phase was an initial consensus reached on those terms
that could potentially have a meaning for our intended users. The main aim of these informal
94
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
linguistic models was to build an explanatory dictionary; some basic relations were also
established between concepts. We decided to use two separate tools (Protégé [30] and
CMAP-tools [10]) because none of the existing Protégé plug-ins provided direct manipulation
capabilities over the concepts and the relations among them the way CMAP-tools does.
Additionally, we studied different elicitation experiences with CMs such as [31, 32]. Our
knowledge formalism was Description Logic (DL), we used the Protégé OWL plug-in.
CMs were used in two stages of our process: capturing knowledge, and testing the
representation. Initially we started to work with informal CMs; although they are not
computationally enabled, for a human they appear to have greater utility than other forms of
knowledge representation such as spreadsheets or word processor tables. As the model gained
semantic richness, by formalising ‘is-a’ and ‘whole/part-of’ relationships between the concepts
the CMs evolved and became more complex. Using CMs, our domain experts were able to
identify and represent concepts, and declare relations among them. We used CMAP-tools
version 3.8 [10] as a CM editor.
The goal of these sessions was to identify both the high-level and low-level domain
concepts, why these concepts were needed, and how they could be related. A secondary goal
was to identify reusable ontologies where possible.
In the first sessions, it was important to see clearly the ‘what went where’, as well as the
structure of the relationships that ‘glued’ the information together. We were basically working
95
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
with informal artefacts (CMs, word processor documents, spreadsheets and drawings); it was
only at a later stage that we achieved some formalisation.
Some sessions took place by teleconference; these were supported by iterative use of
WEBEX (web, video, and teleconferencing software) [33] and Protégé. CMs were also used
to present structural aspects of the concepts. We found it important to set specific goals for
each teleconference, with these goals ideally specified as questions that are distributed prior to
the meeting. In our case, most of the teleconferences focused on specific concepts, with
questions of the form “how does A relate to B?”, “why do we need A here instead of B?”, and “how does
A impact on B?”. Cardinality issues were also discussed.
We also used CMs to represent conceptual queries. We observed that domain experts
are used to querying information systems using keywords, rather than building structured
queries. In formalising the conceptual queries, CMs provided the domain experts with a tool
that allowed them to go from an instance to the appropriate class/concept, at the same time
identifying the relationships. For example, within the nutrigenomics domain some
investigations study the health status of human volunteers looking at the level of zinc in their
hair. These investigations may take place in different research institutes, but all the
information may be stored in just one central repository. In order to correlate all those
investigations the researcher should be able to formulate a simple query “what is the zinc
concentration in hair across three different ethnic groups”. Figure 3 illustrates this query. Conceptually
this query relates compounds, health function and ethnicity. The concept of compound implies a
measurement; by the same token the concept of health function implies a particular part of
the organism.
96
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
The collected competency questions could be used as a starting point for building the
conceptual queries. Competency questions are informal, whereas conceptual queries are used
to identify the ‘class-relation-instance’ and thus improve the understanding of how users may
ultimately query the system. Conceptual queries may be understood as a formalisation of
competency questions.
3.1. 2.2.4 I T ERAT IVE B UILD ING OF INFOR MAL ONTOLOGY MODEL S
Domain experts represented their knowledge in different CMs that they were
generating. Their representation was very specific; they were providing instances and relating
97
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
these instances with very detailed whole/part-of relations. Figure 4 presents an example from
the nutrigenomics domain that illustrates how we used the CMs in order to move from
instances to classes, to identify is_a and defining the whole/part-of relationship more precisely.
By gathering use cases in the form of CMs, we could identify the classes and subclasses,
for example: beverage is_a food, juice is_a non-alcoholic beverage. The has_attribute /is_ attribute_of
property attached to the instance was also discussed. Moving from instances to classes was an
iterative process in which domain experts were representing their knowledge by providing a
narration full of instances, specific properties, and relationships. The knowledge engineer
analysed all the material. By doing so, different levels of abstractions that could be used in
98
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
order to group those instances were identified; ultimately domain experts validated this
analysis.
As the nutrigenomics work contributes to the development of FuGO, the final steps -
formalisation and evaluation- will be possible only at a later stage, after our results (e.g. new
concepts and/or structures) are evaluated and integrated into the structure of the functional
genomics investigation ontology. However, we will continue to evaluate our framework with
our nutrigenomics users and the other RSBI groups, to see if it accurately captures the
information we need, and if our terminology and definitions are sufficiently clear to assist the
annotation process.
3.1.3.1 Formalisation
Moving from informal models to formal models with accurate is-a and whole/part-of
relationships will be done using Protégé. FuGO will also be developed in Protégé because it
has a strong community support, multiple visualisation facilities, and it can export the
ontology in different formats (e.g. OWL, RDF, XML, and HTML). Partly because Protégé
and CMAP-tools are not currently integrated and partly because they aim to assist different
stages during the process of developing an ontology, this has to be done, mostly, by hand. We
envisage that integration of these two tools may help knowledge engineers in this process;
semi-automated translation from CMs into OWL structures through the provision of
assistance, in order to allow developers to formally encode bio-ontologies, would be desirable.
Hayes and co-workers [34] addressed the problem of moving from CMs into OWL
models. They extend CMAP-tools so it supports import and export of machine-interpretable
knowledge formats such as OWL. Their approach assumes that the construction of the
ontology starts from the CM and that the CM evolves naturally into the ontology. This makes
it difficult for large ontologies where several CMs shape only a part of the whole ontology.
Furthermore, adding asserted conditions (such as necessary, necessary and sufficient) was not
possible; formalisation involves the encoding of the CM into a valid OWL structure by
99
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
identifying and properly declaring classes and properties. Based on those experiences in which
we have used CMs, we are designing a tool that supports such transition.
Difficulties arise from the divergence of syntactic formats between CMs and OWL
models; CMs do not have logical constraints, whereas OWL structures are partially supported
by them; the lack of connection between concepts as understood in CMs and OWL classes
should also be noticed. During the elicitation process, the information gathered by means of
CMs was usually incomplete in the sense that it tended to be too narrow –meaningful within
the context of a particular researcher. Moreover, CMs were initially picturing processes and at
later stages as they were gaining specificity the identification of terms and relationships was
being enriched. All of these add to the difference between the information one could gather
in a CM and an OWL model. They also emphasises the complementary relationship between
one and the other. The node-arc-node structure of a CM may be assimilated to an RDF
representation as well as to an embryonic OWL model. The proximity between both CMs
and OWL models allows the arrangement of a CM directly into the syntactic structure of an
OWL file thereby avoiding thus some of the inconveniences of translations between non-
related models. The transition from a CM model to an OWL model may be made easier by
allowing domain experts to develop parts of the ontology with the assistance of knowledge
engineers.
The assistance of the knowledge engineer should focus on the consistency of the
whole/part-of properties in order to ensure orthogonality. Domain experts express in their
CMs their different views of the world; the fragmentation of the domain of knowledge is
mostly done by means of is-a relationship and whole/part-of properties. Once these
properties and relationships are properly defined, combining complementary CMs may be
much easier; also by doing so, the consistency of the OWL model may be assured.
It will not be only by integrating CM functionality into Protégé that the knowledge
acquisition process will be better supported and the formalisation/encoding of ontologies
might be achieved more rapidly. It is also important to harmonise both CMs and OWL
models syntactically and semantically. The construction of the class hierarchy should be done
100
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
in parallel with the definition of its properties. This will allow us to identify potential
redundancies and inconsistencies in the ontology. Domain analysis will thus be present
throughout the whole development process.
3.1.3.2 Evaluation
Before putting the ontology into use, we will need to evaluate how accurately it could
answer our competency questions and conceptual queries. To accomplish this, we will use
CMs as well as some functionalities included in Protégé.
Because our CMs represent the conceptual scaffold of the knowledge we are
representing, we will use them to evaluate how this discourse may be mapped into the
concepts and relationships we have captured. The rationale behind this is simple: the concepts
and relationships, if accurate, may then be mapped into the actual discourse. By doing this we
hope to identify:
Ultimately the ontology may also be evaluated by using the PAL (Protégé Axiom
Language) plug-in provided by Protégé. PAL allows the construction of more-sophisticated
queries. Among those methods described by [35] we checked the consistency using only
RACER [36].
101
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
3.1.4 Discussion
Building ontologies is a non-trivial task that depends heavily on domain experts. The
methodology presented in this paper may be used in different domains with scenarios similar
to ours. We used concept maps at different stages during this process, and in different ways.
The beauty of CMs is that they are informal artefacts; introducing formal semantics into them
remains a matter for further investigation. The translation from CMs to OWL remains
manual, and we acknowledge that some information may be lost, or even created, in this step
despite the constant participation of domain experts. An ideal ontology development tool
would assist users not only during knowledge elicitation, as CMAP-tools does well, but also
during the formalisation process, so that everything could be done within one software tool.
Unfortunately, too little attention has been paid in the bio-ontological literature to the
nature of such relations and of the relata that they join together [8]. This is especially true for
ontologies about processes. OBO provides a set of guidelines for structuring the
relationships, as well as for building the actual ontology. We are considering these and will
follow these guiding principles in our future development. We will also consider the issue of
orthogonality very carefully, as we have always thought about those ontologies that could, at a
later stage, be integrated into our proposed structure.
102
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
During the development of the GMS ontology, a narrative approach was also
investigated in conjunction with semi-automatic text extraction methods. The approach taken
was simple: domain experts were asked to build stories as they were providing vocabulary.
Empirical evidence from this experience suggests that CMs may provide us with a framework
for larger terminology extraction and validation efforts. A paper describing these experiences
is in preparation. Despite the differences between those domains, the CMs proved to be
useful when capturing and sharing knowledge, both as an external representation of the topic
being discussed, and as an organisational method for knowledge elicitation. It should be
noticed; however, that only time will tell about the transposability of this methodology into
other domains.
103
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
3.1.5 Conclusions
104
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
Hayes and [39] Diaz, propose interesting solutions. However, we consider collaboration
emerges naturally when domain experts are provided with the tools that allow them to
represent and share their knowledge in such a way that it is easy to promote and support
discussion and concentrate on concepts and constraints. There is a need to support
collaborative work from the perspective of allowing users to make use of a virtual working
place; cognitive support is therefore needed. The design and development of such a
collaborative environment and an accompanying CM plug-in for Protégé that supports both
the knowledge acquisition phase and the translation from the CM to an OWL structure are
clearly desirable. The development of this plug-in, as well as a more comprehensive
collaborative environment, is currently in progress.
Ontologies are constantly evolving, and the conceptual structures should be flexible
enough as to allow this dynamic. It is important to report methodological issues (or just
“methodology”) as part of those papers presenting ontologies, in a section analogous to the
“methods and materials” sections required in experimental papers. The added clarity and
rigour that such presentation would bring would help the community extend and better adapt
existing methodologies, including the one we describe here.
3.1.6 Acknowledgements
We gratefully acknowledge our early discussions with Jennifer Fostel and Norman
Morrison, leaders of the toxicogenomics and environmental genomics communities within
MGED RSBI. We thank Ruan Elliot (Institute of Food Research) and Anne-Marie Minihane
(Reading University) for their expertise in nutritional science. We also acknowledge Mark
Wilkinson, Oscar Corcho, Benjamin Good, and Sue Robathan for their comments. Finally,
we thank Mark Green (EBI) for his constant support. This work was partly supported by the
student exchange grants of the EU Network of Excellence NuGO (NoE 503630) to SAS, the
EU Network of Excellence Semantic Interoperability and Data Mining in Biomedicine (NoE
507505) to RS, and Australian Research Council grant (CE0348221) to MAR.
105
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
3.1.7 References
11. Uschold M, King M: Towards Methodology for Building Ontologies. In: Workshop on
Basic Ontological Issues in Knowledge Sharing, held in conjunction with IJCAI-95: 1995;
Cambridge, UK; 1995.
12. Fox M: The TOVE Project: A Common-sense Model of the Enterprise Systems. In:
Industrial and Engineering Applications of Artificial Intelligence and Expert: 1992:
Springer-Verlag; 1992: 25-34.
106
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
13. Gruninger M, Fox MS: The Design and Evaluation of Ontologies for Enterprise
Modelling. In: Workshop on Implemented Ontologies, European Workshop on Artificial
Intelligence: 1994; Amsterdam, NL; 1994.
14. Uschold M: Building Ontologies: Towards a Unified Methodology. In: 16th Annual
Conf of British Computer Society Specialist Group on Expert Systems,: 1996;
Cambridge, UK; 1996.
18. Beck H, Pinto HS: Overview of Approach, Methodologies, Standards, and Tools for
Ontologies. The Agricultural Ontology Service (UN FAO) 2003.
22. Bada M, Stevens R, Goble C, Gil Y, Ashbourner M, Blake J, Cherry J, Harris M, Lewis
S: A short study on the success of the GeneOntology. Journal of Web Semantics
2004, 1:235-240.
23. Canas A, Leake DB, Wilson DC: Managing, Mapping and Manipulating Conceptual
Knowledge,. In: AAAI Workshop Technical Report WS-99-10: Exploring the Synergies
of Knowledge Management & Case-Based Reasoning. Menlo California: AAAI Press;
1999.
107
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
26. Whetzel P, Brinkman RR, Causton HC, Fan L, Fostel J, Fragoso G, Heiskanen M,
Hernandez-Boussard T, Morrison N, Parkinson H, Rocca-Serra P, Sansone SA,
Schober D, Smith B, Stevens R, Stoeckert C, Taylor C, White J, and members of the
communities collaborating in the FuGO project: Development of FuGO: an Ontology
for Functional Genomics Investigations. OMICS: A Journal of Integrative Biology
2006 (in press).
27. Whetzel P, Parkinson H, Causton HC, Fan L, Fostel J, Fragoso G, Game L, Heiskanen
M, Morrison N, Rocca-Serra P, Sansone SA, Taylor C, White J, Stoeckert CJ Jr: The
MGED Ontology: a resource for semantics-based description of microarray
experiments. Bioinformatics 2006, 22(7):866-873.
31. Briggs G, Shamma DA, Cañas AJ , Carff R, Scaargle J, Novak JD: Concept Maps
Applied to Mars Exploration Public Outreach. In: Proceedings of the First
International Conference on Concept Mapping: 2004; Pamplona, Spain; 2004.
108
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
36. Haarslev V, Möller R: Racer: A Core Inference Engine for the Semantic Web. In:
Proceedings of the 2nd International Workshop on Evaluation of Ontology-based Tools
(EON2003): October 20 2003; Sanibel Island, Florida, USA; 2003: 27-36.
109
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
Abstract. We briefly describe the methodology we have adopted in order to develop ontologies. Because our
scenarios involved domain experts distributed geographically, the domain analysis and knowledge acquisition
phases used different independent technologies that were not always integrated into the Protégé suite.
Groupware capabilities were thus achieved. From these experiences we identify conceptual maps (CMs) as an
important collaborative and knowledge acquisition tool for the development of ontologies. Direct
manipulation and collaborative facilities that currently exist in Protégé can be improved with those lessons
learnt from this and similar experiences. Here we describe our scenario, competency questions, results, and
milestones for each methodological stage, use of CMs, and vision for a collaborative environment for
ontology development. This presentation is based on two different sets of experiences, one within
nutrigenomics and the other one in plant genealogy management systems.
3.2.1 Introduction
Traditionally, ontologies have been built by highly trained knowledge engineers with the
assistance of domain specialists. It is a time-consuming and laborious task. Ontology tools are
available to support this work, but their use requires training in knowledge representation and
predicate logic [2]. Bio-ontologies are developed primarily by biologists. Domain experts are
rarely available in one place, so the development of bio-ontologies is usually a distributed
effort in which teleconferences, email, commentary-tracking systems, and videoconferences
are used at all stages. During our ontology building efforts, we identified the lack of an
integrated environment in which at least some of these technologies come together to
facilitate both knowledge representation and sharing as a major bottleneck. CMs may help to
overcome these issues.
110
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
Conceptual maps are graphs that consist of nodes, with connecting arcs that represent
relationships between nodes [3]. The nodes are labeled with descriptive text representing the
"concept", and the arcs are labeled (sometimes only implicitly) with a relationship type. We
used CMs in two stages of our process, the capture of knowledge and testing the structure of
the ontology. Initially we started to work with informal CMs; although they are not
computationally enabled, for humans they appear to have greater "computational efficiency"
than other forms of knowledge representation, e.g. EXCEL™ spreadsheets or Microsoft
Word™ tables. As our models gained semantic richness, the CMs evolved and became more
complex by formalising the knowledge in our ontologies.
We found that the CMs made it possible for domain experts to identify and represent
concepts, and to declare relations among them. More importantly, they helped clarify the
difference between the ontological model, ER (Entity relationship) models and the possible
object model (OM). For biologists, ontologies have a concrete representation in dictionaries,
whereas they view object models as being more related to implementation. Implementation
details were thus separated from ontologically related issues. We used CMAP
(http://cmap.ihmc.us/) [1] as a CM editor.
The ontologies we are developing are asymmetric and complementary. In one we want
to ease the process of accurately capturing nutrigenomics data via web-forms, whereas on the
other hand we want to facilitate the building of queries over large genealogy databases
(http://cropwiki.irri.org/icis/index.php/Germplasm_Ontology). They are two different
experiences with similar problems, and a common bottleneck, knowledge acquisition. From
both ontologies we identified the importance of cognitive support over the groupware facility.
111
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
3.2.2 Methodology
112
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
so we also made decisions about inclusion, exclusion and the first draft of the hierarchical
structure of concepts in the ontology.
An important outcome from this phase was the consensus that we reached on terms
that could potentially have a meaning for our intended users. The main aim of these informal
linguistic models was to build an explanatory dictionary; some basic relations were, as well,
established between concepts.
We built different models throughout our analyses of available knowledge sources and
information gathered in previous steps. First a “baseline ontology” was gathered, i.e. a draft
version containing few but seminal elements of an ontology. Typically, the most important
concepts and relations were identified somewhat informally. We could assimilate this
“baseline ontology” into a taxonomy, in the sense of a structure of categories and
classifications. We consider a taxonomy as “a controlled vocabulary which is arranged in a
concept hierarchy”, and ontology as “a taxonomy where the meaning of each concept is
defined by specifying properties, relations to other concepts, and axioms narrowing down the
interpretation”. As the process of domain analysis and knowledge acquisition evolves, the
taxonomy takes the shape of an ontology. During this step, the ontologist worked primarily
with only very few of the domains experts; the others were involved in weekly meetings. In
this phase the ontologist sought to provide the means by which the domain experts he or she
was working with could express their knowledge. Some deficiencies in the available
technology were identified, and for the most part were overcome by our use of CMs.
For subsequent steps (i.e. formalisation and evaluation), different needs may be
identified.
Our knowledge acquisition phase took place in different stages, for some of which the
domain experts were not together. CMs proved very useful in facilitating the visualisation and
discussion, and in providing domain experts with a tool that could be used to declare the
primary elements of their knowledge. OWLviz [5] was initially tested to support domain
113
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
experts in this task, but this plug-in did not provide direct manipulation (DM) capabilities
over the concepts and the relations among them. We also tested Jambalaya [6] before deciding
to use two separate tools (i.e. Protégé [7] and the CMAP tools). Since CMs support the
declaration of nodes and relationships, it was easy to assimilate these to classes and properties.
The conversion was a straightforward, albeit manual, process.
The main feature we identified from our work with CMs was the DM capability
provided to us by the software. This functionality had several advantages, which we list below.
Interesting, all of these advantages had previously been identified by Shneiderman:
114
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
• The user should be presented with an empty canvas on which concepts, linking
phrases and properties can be declared by a direct click.
Since our methodology involves participatory design activities, it is important for the
tool to support this range of activities. We consider that CMs may play a crucial role in
assisting users in these activities. Our development is inheriting many of those features already
available in CMAPTOOLS; we are extending it so we may additionally also allow users to
“discuss” on-line, while at the same time manipulating the OWL file. We are thus extending
the capabilities currently available in Protégé, not just to enhance browsing but more deeply
to promote a collaborative environment for the development of ontologies. Since protégé was
mainly developed as a desktop tool its web implementation lacks some group-ware features.
In order to implement an integrated web-ontology development environment Human-
Computer Interaction studies need to be conducted.
3.2.5 Acknowledgements
The authors would like to thank Robert Stevens and Karim Nashar for the useful
discussions and collaboration. A. Garcia is supported by Institute for Molecular Bioscience,
Australian Centre for Plant Functional Genomics, the ARC Centre in Bioinformatics and the
EMBL-EBI. SA Sansone is supported by the ILSI-HESI Genomics Committee and Philippe
Rocca-Serra by the European Commission NuGO project.
3.2.6 References
2. Seongwook., Y., et al., Survey about ontology development tools for ontology-based
knowledge Management. 2003.
115
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
5. Knublauch., H., OWLviz a visualisation plugin for the Protege OWL plugin.,
http://www.co-ode.org/downloads/owlviz/.
7. Geniari., H., John.,, et al., The evolution of Protégé: An environment for knowledge-based
systems development. International Journal of Human Computer Studies, 2003. 58(1): p.
89-123.
116
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
The importance of conceptual maps as well as their use was largely studied in those
experiences reported in chapters 3 and 5. Although the benefits of concept maps were well
understood, it was also clear that in order to better-facilitate the communication amongst
domain experts and with the knowledge engineer it was important to have an argumentative
structure. Interaction amongst domain experts generates large amounts of data and
information, not always usable or understandable by the knowledge engineer; for this matter
conceptual maps could be used. This chapter addresses the problem of supporting the
argumentative structure that was the result of the interaction amongst domain experts; it also
studies the argumentative structure within the context of developing ontologies within
decentralised settings.
The main contribution of this paper is not only to present a practical use for
argumentative structures, but also to support this structure by means of conceptual maps. In
this chapter the use of concept maps is proposed as a mean to support and scaffold an
argumentative structure during the development of ontologies within loosely centralised
communities. This novel use of conceptual maps had not been previously studied.
The author conceived and coordinated the project. The proposed use of conceptual
maps, as well as the extensions for the argumentative structure was the product of the analysis
the author carried out during those experiences reported in this thesis. Alex Garcia wrote the
published paper based on this chapter.
117
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
AUTHORS' CONTRIBUTIONS
Alex Garcia Castro conceived and coordinated the project, he also wrote the
manuscripts for this paper. Angela Noreña and Andrés Betancourt were domain experts in
the knowledge elicitation exercises and also assisted Alex Garcia Castro in the implementation
of the first version of the plug-in. Mark A. Ragan supervised the project, and assisted Alex
Garcia Castro in the preparation of the final manuscript.
Garcia Castro A: Cognitive support for an argumentative structure during the ontology
development process. In: 9th Intl Protégé Conference: July, 2006; Stanford, CA, USA; 2006.
118
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
Abstract: Structuring and supporting the argumentative process that takes place within the knowledge
elicitation process is a major problem when developing ontologies. Knowledge elicitation relies heavily on the
argumentative process amongst domain experts. The involvement of geographically distributed domain
experts and the need for domain experts to lead the design process, adds an interesting layer of complexity to
the whole process. We consider that the argumentative structure should facilitate the elicitation process and
serve as documentation for the whole process; it should also facilitate the evolution and contextualisation of
the ontology. We propose the use of concept maps as means to support and scaffold an argumentative
structure during the development of ontologies within loosely centralised communities.
4.1 INTRODUCTION
The applications of knowledge engineering are growing larger and more systematic,
now encompassing more ambitious ontologies—sizes in the hundreds of thousands of
concepts will not be uncommon [1]. Furthermore, the development of those ontologies is
usually a participatory exercise in which different experts interact via virtual means, resembling
thereby a loosely centralised community. We believe the requirements of the Semantic Web
(SW) bring with it an associated need for enhanced cognitive support in those tools we use.
Cognitive support is used to leverage innate human abilities, such as visual information
processing, to increase human understanding and cognition of challenging problems [2].
Developing ontologies in loosely centralised environments as those described by Pinto et al.
[3] poses challenges not previously considered by most existing methodologies. This user-
centric design relies heavily on the ability of domain experts to interact with each other and
with the knowledge engineer. By doing so the ontology evolves. Mailing lists, web forums,
and WIKI pages usually support this interaction. Despite this combination of tools (none of
them an ontology editor per se, nor a knowledge engineering tool), information is lost,
documentation is poorly structured, and the process is not always easy to follow. This results
in a decreased participation by the domain experts.
119
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
Central to ontology development is the process by which domain experts and the
knowledge engineer argue about terms/types and relationships. This collaborative interaction
generates threads of arguments [3, 4, 7], and there is a need to support the evolution and
maintenance of this argumentative process in a way that makes it easy to follow and, more
importantly, links to evidence and provides room for conflicting points of view. Figure 1
presents the argumentative structure proposed by [4].
120
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
Chapter 4 - Figure 1. The major concepts of the argumentation ontology and their relations.
Reproduced with permission from [4]
CMs are semantically valid artefacts without OWL constraints; concepts and
relationships are the main scaffold of a CM. At any given point during the argumentative
process one has a concept/class and a relationship/property. The evolution of the discussions
increases the amount of information attached to the concept or relationship, the
argumentative structure is enriched as domain experts provide arguments and base them
upon evidence, which may be a paper, a commentary, or more generally a file of any kind (e.g.
information source). The different views of the world can be represented with a CM, and the
evidence may be attached to the particular concept/class or relationship/property at hand.
This graphic representation facilitates the continuous exchange of information amongst
121
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
domain experts –sharing knowledge. Following the threads of the discussions is not always
easy for domain experts. The information exchanged is usually structured as an email-based
chat. The knowledge engineer has to follow these text-based discussions in which there is
mostly verbal knowledge, filter them, and at some point “formalise” that implicit knowledge.
Moving from verbal knowledge into formalised-shared knowledge is difficult; some
information is usually lost, the evidence supporting those different positions is not always
provided by domain experts, and most importantly keeping domain experts engaged
throughout the entire process is not always possible. Cognitive support is thus required so we
may facilitate the useful flow/exchange of information and at the same time record the entire
process.
Concepts and relationships resemble the two key components within an argumentative
structure: arguments and positions. During the development process we argue in relation to a
concept and/or a relationship. Positions are supported upon evidence, and the simple
argumentative structure is by itself a particular view of the world that is being modelled.
Figure 2 illustrates the basics behind the relationship between CMs and an argumentative
structure.
122
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
123
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
As the ontology grows, so does the complexity of the information available for each
and every component of the ontology ( e.g. classes, properties, instances). Although having an
ontology that represents the structure of the argumentative process helps the knowledge
engineer in the classification of the information, in order for the evidentiary material to be
useful it needs to be attached to the relevant piece of the ontology. For instance, when
discussing about “biomaterial” within the development of a laboratory information
management system for functional plant genomics one feasible starting point for the
discussion would be to adopt the same understanding of biomaterial as it is available in the
microarray ontology.
2 Concertation. From the French concertation. A conciliatory processes by which two parts reach an agreement.
124
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
class BioMaterial
definition:
Description of the processing state of the biomaterial for use in the microarray hybridisation.
superclasses:
BioMaterialPackage
known subclasses:
BioSample
BioSource
LabeledExtract
properties:
unique_identifier MO_226
class_role abstract
class_source mage
constraints:
restriction: has_type has-class MaterialType
restriction: has_biomaterial_characteristics has-class BioMaterialCharacteristics.
A very important part of the whole process is the management of the history. Tracing
back the argumentation process from the position_on_issue to the elaboration for a particular
argument; being able to “see” the argumentative structure in order to “stand” on a particular
place. The history should also allow us to go back and take an alternative route, thus we see
the history not just as a simple undo” but as a more complex feature.
125
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
For any given issue there is an argument that is elaborated by presenting the
conflicting positions. The elaboration provides instances -concrete examples. For any issue
there is a process that presents argument-elaborated conflicting positions. Once a consensus
is reached there is a position on the issue initially at hand. The issue is well-focused and
specific, the same is true for the argument. It supports a position with simple and few words,
whereas the elaboration of the argument tends to be larger and supported by different files (
e.g. pdf, ppt, doc, xls). Although there may be more than one argument for any given issue,
there is only one elaboration for each argument. The dispute-resolution process (also known
as the conciliatory process) produces a position on the particular issue; within this process the
knowledge engineer acts as a facilitator. Discussions over terminology, and over conceptual
models, tend to address one issue at a time and this is highly dependent on the knowledge
engineer. Within this context conceptual maps provided a scaffold upon which the
argumentative ontology may be instantiated.
4.5 REFERENCES
1. Ernst A, Neil. , Storey M-A, Allen P: Cognitive support for ontology modeling. Int J
Human-Computer Studies 2005, 62:553–577.
126
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
5. Garcia Castro A., Rocca-Serra P, Stevens R, Taylor C, Nashar K, Ragan MA, Sansone
S: The use of concept maps during knowledge elicitation in ontology development
processes - the nutrigenomics use case. BMC Bioinformatics 2005, 2006, 7:267.
7. Garcia Castro A, Sansone AS, Rocca-Serra P, Taylor C, Ragan MA: The use of
conceptual maps for two ontology developments: nutrigenomics, and a
management system for genealogies. In: 8th Intl Protégé Conference Protégé: 2005;
Madrid, Spain; 2005: 59-62.
127
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
This chapter has two sections, the first one “The use of narratives and text-mining extraction
techniques to support the knowledge elicitation process during two ontology developments” presents an
insightful study about the narratives that were being gathered during the knowledge elicitation
process.
Conceptual maps could be used to support the argumentative structure; they were also
quite useful when eliciting knowledge. However, eliciting knowledge was not always a
straightforward question-answer process. Very often domain experts were building narratives
in order to explain in a more illustrative manner their scenarios to the knowledge engineer.
Moreover, it was observed that once baseline ontology was built domain experts tended to
support their discussions on these narratives. Empirically they were using conceptual maps as
they were “drawing” their ideas. Although from these narratives some instances were being
gathered, it was important to better frame the elicitation exercises. How could the narratives
and the elicitation exercises be better framed as well as how could these narratives be better
used and supported when eliciting knowledge? These are the two main issues this section
addresses.
The second section of this chapter, “A proposed semantic framework for reporting OMICS
investigations” addresses the issue of describing biological investigations, “How to provide
semantics for upper-level elements relevant to the representation and interpretation of omics-
based investigations?” This section presents an upper level ontology for the representations of
biological investigations. The experience here reported was useful as it was important for the
author to test the proposed methodology with domain experts from different disciplines
(Nutrigenomics, Toxicogenomics, Environmental Genomics); it was equally important to
study how these domain experts were reaching their consensuses after debating on their
conceptual models. Chapter 7 follows up on the issue of describing biological investigations,
not from the semantic perspective but by studying practical issues when describing these
128
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
investigations. The ontology described in this section is the product of the work between the
author and domain experts from the MGED-RSBI working group.
Alex Garcia conceived and coordinated the work presented in this chapter. He
identified the need for better using the narratives as they were being gathered. Alex Garcia
also investigated how to use text-mining techniques in order to support the development of
ontologies. The author conducted several meetings with members of the MGED-RSBI
working group in order to develop the presented ontology.
AUTHORS' CONTRIBUTIONS
Alex Garcia Castro conceived and coordinated both projects, he also wrote the
manuscripts for this paper. Susanna Sansone provided useful discussion, and assisted Alex
Garcia in the preparation of those submitted manuscripts. Philippe Rocca-Serra and Chris
Taylor provided useful discussion.
Garcia Castro A, Sansone AS, Taylor CF, Rocca-Serra P: A conceptual framework for
describing biological investigations. In: NETTAB: 2005; Naples, Italy; 2005.
129
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
Abstract. Extracting terminology is not always an integral part within most methodologies for building
ontologies. Moreover, the use of terms extracted from literature relevant to the domain of knowledge for
which the ontology is being built has not been extensively studied within the context of knowledge elicitation.
We present here some extensions to the methodology proposed by Garcia et al. (BMC Bioinformatics 7:267,
2006); two important advances on the initial proposed methodologies are the use of extracted terminology for
framing the conceptual mapping building, and the use of narratives during the knowledge elicitation phases.
5.1.1 Introduction
At a glance, an ontology represents some kind of world view with a set of concepts and
relations amongst them, all of these defined with respect to the domain of interest. Some
scholars redefine the term in an effort to capture an absolute view of the world. For instance,
Sowa [1] defines ontologies as “The study of existence, of all kind of things (abstract and concrete) that
make up the world”. A more pragmatic definition is given by Netches et al. [2], who considers
that an ontology “defines the basic terms and relations comprising the vocabulary of a topic
area, as well as the rules for combining terms and relations to define extensions to the
vocabulary”. For practical reasons we agree with this definition as the main aim of the GMS
(Genealogy Management System) ontology is to define a set of basic terms that may
accurately describe Germplasm, within the context of crop information systems, more
specifically within the International Crop Information System (ICIS) [3].
In this paper we present our early ontology for GMS as well as the methodology we
followed. Our scenario involved the development of an ontology having direct physical access
to domain experts within the Australian Centre for Plant Functional Genomics (ACPFG) and
the International Center for Tropical Agriculture (CIAT). We thus decided to adapt and
130
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
This paper is organised as follows: Section 5.2.1 presents an introduction and some
background information along with a brief description of our scenario. A survey on the
methodologies we investigated is given in Section 5.2.2. Section 5.2.3 presents the extensions
to the methodology we used; descriptions of those steps we took are also given in this
section. We place special emphasis on text extraction and conceptual mapping during the
elicitation process. Results (e.g. our ontology) are presented in Section 5.2.5. Our discussions
and conclusions are presented in Section 5.2.6.
A range of methods and techniques has been reported in the literature regarding
ontology building methodologies. However, there is an ongoing argument amongst those in
the ontology community about the best method to build them [7, 9].
Most of the ontology building methodologies are inspired by the work done in the field
of knowledge-based engineering to create methodologies for developing Knowledge-Based
Systems (KBS). For instance, the Enterprise Methodology [10], like most KBS development
methodologies, distinguishes between the informal and formal phases of ontology
development. METHONTOLOGY [11] adapts the work done in the area of knowledge
based evaluation for the ontology evaluation phase. The “Distributed, Loosely-controlled and
evolving engineering of ontologies” (DILIGENT) methodology [12] offers a set of
considerations and steps suitable for loosely centralised environments where domain experts
are geographically distributed. Table 1 presents a summary of our comparison. GM (GM
henceforth Graph-based Methodology [4]) provided us with some detail for the knowledge
elicitation process, however, our scenario was not entirely one in which domain experts were
131
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
geographically distributed and thus some of the techniques described by GM could not be
directly applied to our case.
We analyzed these approaches according to the criteria proposed by Mirazee [13]. Most
of the methodologies do not provide details as to how one actually goes about building the
ontology. Although GM had reported the use of concept maps as well as details for
knowledge elicitation it is not entirely clear how, with in the process of eliciting knowledge,
the narrative provided by different but complementary CMs may be reached or used. Nor
there is any illustration as for the relationships between CMs and terminology extraction;
empirically we could see how these two techniques complement each other.
132
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
support other methods. We studied Protégé [15], HOZO [16], and pOWL [17] as software
tools for developing ontologies. Neither of them supports in any special way a particular
methodology. Moreover, none of these software packages provides support for terminology
extraction or conceptual mapping. All of these methods and techniques are still determined to
some extent by the particular circumstances in which they are applied. We must note that, in
any given circumstance there might be no available guideline for deciding on what techniques
and methods to apply [18].
Since none of the reported methodologies could be fully applied to our particular
scenario and/or needs, we decided to adapt and reuse some steps described in those
investigated methodologies. Those modifications we introduced to the methodology
proposed by Garcia et al. were mostly due to the close relationship that our domain experts
had with the implemented software ICIS. This familiarity brought in some situations not fully
addressed by Garcia et al., such as:
• Confusion between database schemata and ontology. Domain experts were not fully
aware of the difference between the conceptual and the relational model
• Difficulties with those extracted terms
• Domain experts were at the same time users, designers, developers, and policy
makers of a particular kind of GMS. Their vision was too broad on the process but
at the same time too narrow on the software
Since most of those steps we took have been described by Garcia et al., we will only
present details for those variations we introduced. A schematic representation of our process
is given in Figure 1.
133
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
As when building ontologies it is equally important to gather not only classes but also
instances, we decided to investigate how we could better support our process by means of
terminology extraction. Initially we only wanted to have classes and instances within our
ontological corpus; however terminology extraction also proved to be useful during
knowledge elicitation, and more specifically when combined with conceptual mapping. We
used Text2ONTO [19] as our terminology extraction tool because it allowed us to use
documents in its original format (PDF, XLS, DOC, TXT, etc) as the main source of
information. Text2ONTO also facilitated the process of constraining the terminology by
134
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
allowing the domain experts and the knowledge engineer to inspect those inferred models
from the extracted terminology. In parallel to our terminology extraction exercises our
domain experts were building informal ontology models. By informal we mean basic
representations of their particular view of the world with no logical constraints; a “free-
drawing” exercise that helped to engage the communication between the knowledge engineer
and domain experts.
The terms were extracted using the TermExtractor component of the TextToOnto
ontology-engineering workbench. The TermExtractor uses the C-Value method to identify
and to estimate confidence in candidate multi-word terms in a corpus [22]. It utilises linguistic
methods to identify the candidate terms and then uses statistical methods to provide each
term with a "C-value" indicating confidence in its "termhood". This C-value is derived using a
combination of "the total frequency of occurrence of the candidate string in the corpus, the
frequency of the candidate string as part of other longer candidate terms, the number of these
longer candidate terms, and the length of the candidate string (in number of words)" [22]. For
additional details about the algorithm see [22] and for the implementation see [20].
135
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
The main goal for the GMS ontology is to describe the breeding history of germplasm;
phenotypic and genotypic aspects of the germplasm are not to be considered by this
ontology. The function of GMS in ICIS is to provide a unique identifier for all packets of
seed for a given germplasm. It should be noted here that although almost all progenitors of
present germplasm no longer exist we should know them in order to trace pedigrees. GMS
also manages all the names attached to the packet of seed -homonyms, synonyms, and
abbreviations. Most importantly, GMS provides a breeding history for the germplasm so that
questions such as those listed below may be easily answered.
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
Chapter 5 - Figure 2. Classes, instances, and relationships gathered by bringing together extracted
terms and previously built ontological models.
The result of the elicitation stage within the functional plant genomics context is
illustrated in Figure 3. We gathered ten different, yet related, concept maps from two domain
137
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
experts; this graphic represents the consensus. The main aim of the process here modeled is
to improve the corresponding plant material; traditionally this improvement has dealt with
some specific phenotypic features such as: yield, abiotic and biotic stresses, nutritional quality,
and market preferences. From the elicitation sessions we could identify several orthogonal
ontologies highly needed in order to represent those processes part of the narrative we were
working with. For instance, ontologies to describe “stress” and “plant yield” were needed to
complement the model.
In order to assist the knowledge engineer in the harmonisation of those concept maps
gathered, domain experts were required to tell a unified story that could bring together those
different concept maps. As a guide, domain experts had access to the list of extracted
terminology. Interestingly, the story had a direct relationship with the main aim of the
laboratory process; some of the GMS ontology terms were used, but the narrative was not
limited to genealogies. A broader picture could thus be produced.
138
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
Chapter 5 - Figure 3. Narrative, as seen from those concept maps and ontology models domain
experts were building.
Our baseline ontology has classes, instances, and relationships; initially domain experts
organised the classes with no consideration for time and space. For them it was important to
have a coherent is-a structure they could relate to and consequently use in order to describe
the genealogy of a given germplasm. Figure 4 illustrates the structure of our baseline ontology.
139
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
By showing one feasible use of text mining when building ontologies not only we
extended the methodology proposed by Garcia et al. [19], but also developed a deeper
understanding of how concept maps and text mining can be used together to build narratives
that can later be used in the construction of ontologies. These narratives were used not only
(by us) to ease the understanding of this particular domain but also, at a later stage, by the
knowledge engineer to assess the ontological corpus gathered in those different models
provided by domain experts.
Domain experts were requested to match some of the provided narratives against the
concept maps. By doing this exercise it was possible not only to extend our lexicon but also to
evaluate the informal models. Engaging domain experts in the process of building the
140
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
ontology was also simplified by the use of the narratives. Domain experts were telling a story
in a structured manner, and this allowed them to better understand the is-a relationship within
the class hierarchy.
In the experimental method for building an ontology to describe genealogies within the
context of plant breeders, we constructed several ontological models by combining terms and
relationships from those mined texts. Orthogonal ontologies were easily identified as domain
experts were representing their narratives as CMs. For instance, developmental stages
described in Plant Ontology were present in some of the CMs, not only developmental stages,
but also anatomical parts of the plant. This helped us to see more clearly how to better
describe germplasm within the context of an information system that was tightly coupled to a
Laboratory Information Management System (LIMS).
At the time of writing this chapter, our approach was also being used by the
International Center for Tropical Agriculture (CIAT) as part of their methodology for
building their LIMS, paying particular attention to the identification of those orthogonal
ontologies needed by that system. An important feature within narratives is the use of more
than one elemental vocabulary to describe complex terms. The result of this is the creation of
a relationship between the combinatorial vocabulary and each of the vocabularies that was
used in its construction.
The rationale behind this approach is that a plant’s anatomical vocabulary should
completely describe the anatomy of the plant, and a developmental process vocabulary should
completely describe all of the general biological processes involved in development.
Therefore, we should be able to combine the concepts from the two vocabularies to describe
all of the processes involved in the development of all of the anatomical parts of the plant.
The structures are represented in CMs as well as in those baseline ontologies gathered.
Initially those models contained a myriad of relationships, as the process evolved and the
hierarchy becomed better structured, the "whole/part of" relationships were better defined
between structures and substructures; in this the narratives proved to be very useful.
141
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
From this experience we could also identify the gap between two ontological models
built by two different software packages, Protégé and KAON. As KAON serves as the
"platform" on top of which Test2Onto runs, the models it produces are not readable by
Protégé. It was not possible for us to exploit all the functionalities of KAON due mostly to
incompatibility problems between Protégé and KAON.
5.1.7 References
2. Neches RF, Finin RT, Gruber R, Tom., Patil R, Senator T, Swartout WR: Enabling
Technology for Knowledge Sharing. AI Magazine 1991, 11:36-56.
5. Cañas AJ, Hill G, Carff R, Suri N, Lott J, Eskridge T, Gómez G, Arroyo M, Carvajal R:
CmapTools: A Knowledge Modeling and Sharing Environment. In: Proceedings of
the First International Conference on Concept Mapping: 2004; Pamplona, Spain; 2004.
6. Garcia Castro A, Sansone AS, Rocca-Serra P, Taylor C, Ragan MA: The use of
conceptual maps for two ontology developments: nutrigenomics, and a
management system for genealogies. In: 8th Intl Protégé Conference Protégé: 2005;
Madrid, Spain; 2005: 59-62.
7. Noy NF, Hafner CD: The state of the art in ontology design - A survey and
comparative review. Ai Magazine 1997, 18(3):53-74.
8. Lopez MF, Perez AG: Overview and Analysis of Methodologies for Building
Ontologies. Knowledge Engineering Review 2002, 17(2):129-156.
9. Beck H, Pinto HS: Overview of Approach, Methodologies, Standards, and Tools for
Ontologies. In.: The Agricultural Ontology Service (UN FAO); 2003.
142
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
10. Ushold M, King M: Towards Methodology for Building Ontologies. In: Workshop on
Basic Ontological Issues in Knowledge Sharing, held in conjunction with IJCAI-95: 1995;
Cambridge, UK; 1995.
12. Pinto HS, Staab S, Tempich C: Diligent: towards a fine-grained methodology for
Distributed, Loosely-controlled and evolving engineering of ontologies. In:
European conference on Artificial Intelligence: 2004; Valencia, Spain; 2004: 393-397.
15. Noy NF, Fergerson RW, Musen MA: The knowledge model of Protege-2000:
Combining interoperability and flexibility. In: 2th International Conference on
Knowledge Engineering and Knowledge Management (EKAW'2000): 2000; Juan-les-
Pins, France; 2000.
18. Uschold M: Building Ontologies: Toward a Unified Methodology. In: 16th Annual
Conf of British Computer Society Specialist Group on Expert Systems: 1996;
Cambridge, UK; 1996.
19. Cimiano P, Völker J: Text2Onto -A framework for Ontology Learning and Data-
driven Change Discovery. In: International Conference on Applications of Natural
Language to Information Systems (NLDB): 2005; Alicante, Spain: Springer; 2005: 227-
238
143
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
20. Volz R, Oberle D, Staab S, Motick B: KAON SERVER - A Semantic Web Management
System. In: Alternate Track Proceedings of the Twelfth International World Wide Web
Conference, WWW2003: May 2004; Budapest, Hungary: ACM; 2004: 20-24.
23. Card sorting to Discover the Users' Model of the Information Space.
[http://www.useit.com/papers/sun/cardsort.html]
144
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
Abstract. The current science landscape is rapidly evolving and it is increasingly driven by computational
tasks. The deluge of data unleashed by omics technologies, such as transcriptomics, proteomics and
metabolomics requires systematic approaches for reporting and storing the data and the experimental
processes in a standard format, relating the biology information and the technology involved. Ontology-based
knowledge representations have proved to be successful in providing the semantics for a standardised
annotation, integration and exchange of data. The framework proposed by the MGED RSBI working group
would provide semantics for upper-level elements relevant to the representation and interpretation and of
omics-based investigations.
5.2.1 Introduction
When the first microarray experiments were published, it became apparent that the lack
of robust quality control procedures and capture of adequate biological metadata impeded the
exchange and reporting of array-based transcriptomics experiments. The MIAME checklist
(Brazma et al. [1]) was written in response to this lack, by a group of biologists, computer
scientists, and data analysts, and aims to define the minimum information required to
interpret unambiguously and potentially reproduce and verify a microarray experiment. This
group then went on to make its composition official and founded the Microarray Gene
Expression Data (MGED) Society. The response from the scientific community has been
extremely positive and currently most of the major scientific journals and funding agencies
require publications describing microarray experiments to comply with MIAME standard.
The adoption of these standard by public and community databases, Laboratory Information
Management Systems (LIMS) and several microarray informatics tools has greatly improved
the interpretation of microarray experiments described in a structured manner.
145
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
become apparent that analogous minimal descriptors should be identified for these
applications. There have been several extensions to MIAME. MIAME/Tox is an array based
toxicogenomics standard developed by the EBI in collaboration with the ILSI Health and
Environmental Sciences Institute (HESI), National Institute of Environmental Health
Sciences (NIEHS), the National Center for Toxicogenomics, the FDA National Center for
Toxicological Research (NCTR). MIAME/Env has been developed by the Natural
Environmental Research Council (NERC) Data Centre to fulfill the diverse needs of those
working in functional genomic of ecosystems, invertebrates and vertebrates which are not
covered by the model organism community. MIAME/Tox and MIAME/Env have initiated
several discussions in the academic settings as well as in the industrial and regulatory arenas
(OECD Toxicogenomics Guidelines [3]).
However it has become evident that when other –omics technologies will be used in
combination with microarrays, these MIAME-based checklists will soon be insufficient to
serve the scope of experimenters’ needs. The toxicogenomics, nutrigenomics and
environmental genomics communities soon recognised the need for a strategy that capitalises
on synergy, forming the Reporting Structure for Biological Investigations (RSBI [4]) working
group under the MGED [5] umbrella. The RSBI working group feels that it is very important
to agree on a single source of basic conceptual information relating to the reporting process
of complex biological investigations, employing omics technologies. This unified approach to
describe the upper-level elements relevant to the representation and interpretation and of
these investigations should encompass any specific application. The possibility to enable
‘semantic integration’ of complex data, facilitating data mining, and information retrieval is
the rationale for developing an ontologically grounded conceptual framework. Ultimately, the
effort by the RSBI working group aims to constitute the foundation of standard reporting
structure in publications and submission to public repositories and knowledge-bases. The
need for information on which to base the evaluation and interpretation of the results
underlies the objectives of presenting sufficient details to the readers and/or reviewers.
146
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
5.2.2 Methodology
Our scenario involves communities distributed geographically and for the domain
analysis and knowledge acquisition phases the group has used different independent
technologies that were not always integrated into the Protégé suite (Noy et al. [9]). From these
experiences members of RSBI are also working with others on a collaborative and knowledge
acquisition tool for the development of ontologies integrated in Protégé (Garciaet al. [10]).
147
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
the process of domain analysis and knowledge acquisition evolves, the taxonomy takes the
shape of an ontology. During this step, the ontologist worked primarily with only very few of
the domains experts; the others were involved in weekly meetings. In this phase the ontologist
sought to provide the means by which the domain experts he or she was working with could
express their knowledge. Some deficiencies in the available technology were identified, and
for the most part were overcome by our use of concept maps (CMs).
Our approach is one of an upper ontology that would provide high-level semantics for
the representation of omics-based investigations that serves as a conceptual scaffold from
which other ontologies may be hooked. An example for the latter could be an ontology
specific for the microarray technology, such as the MGED Ontology, and/or specific for an
applications, such as toxicology. In order to describe the interaction of different technologies
during the course of a scientific endeavour we considered there was the need for a high-level
container where to place the information relevant to the biology as well as that relevant to
those different assays. Our high-level concept is an Investigation, a self-contained contained unit
of scientific enquiry, containing information for Study(-ies) and Assay(s). We consider a Study to
148
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
be the set of steps and descriptions performed on the Subject(s). In the cases where the Subject
is a piece of tissue, and no steps have been performed but just an Assay has been carried out,
then we the Study contains only the descriptors of the Subject (e.g. provenance, treatments,
storage, etc.). We consider an Assay as the container for the test(s) performed and the data
produced for computational purpose. There are different AssayType(s) and the different omics
technologies fall within this category. A view of the RSBI upper ontology is shown in Figure
6 and the ontology is available from the RSBI webpage (http://www.mged.org/
Workgroups/rsbi/rsbi.html).
Since our framework will allow the use of different ontologies the definition for
whole/part relationships should be consistent across those different ontologies. However,
currently there are no standards of guidance for defining whole/part of relationships, adding
another layer of complexity when developing an upper-level ontology.
Upper level, or top level, ontologies describe very general concepts like space, time,
event, which are independent of a particular problem domain. Such unified top-level
ontologies aim at serving large communities [11, 12]. For instance, the Standard Upper
Ontology (SUO) [13] provides definitions for general-purpose terms, and it acts as a
149
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
foundation for more specific domain ontologies. General purpose ontologies, such as the
RSBI, provides a more specific semantic framework from which it is, in principle, possible to
integrate other biological ontologies. As the RSBI aims to facilitate the annotation of
biological investigations, the generality of its concepts is constrained to those predefined 3
specific domains of knowledge for which it was created (toxicogenomics, nutrigenomics and
environmental genomics). Those principles recommended by Niles and Pease [13] when
developing upper level ontologies were considered during the development of the RSBI
ontology, however as the RSBI ontology aims to facilitate the description of biological
investigations some practical considerations were also taken.
Ultimately the RSBI upper-level ontology should be able to answer a few questions
and position almost anything approximately in the right place, even if the spot has a non-
existent ontology. The relationship between Study and Assay defines an Investigation,
different things participate in different processes and on the same token some things retain
their form over time. Study and Assay contain information about those processes. It is
particularly important to have minimal commitment when developing upper-level ontologies,
only those concepts providing a common scaffold should be considered.
5.2.5 References
1. Brazma, A., Hingamp, P., Quackenbush et al. 2001. Minimum information about a
microarray experiment (MIAME)-toward standards for microarray data. Nat Genet. 29
(4): 365-71.
2. Quackenbush J. 2004. Data standards for 'omic' science. Nat Biotechnol. 22:613-614.
150
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
8. SMRS: http://www.smrsgroup.org
9. Noy, N.F., Crubezy, M., Fergerson, R.W. et al. 2003. Protege-2000: an open-source
ontology-development and knowledge-acquisition environment, AMIA Annu Symp
Proc, 953.
10. Garcia Castro, A., Sansone S.A., Rocca-Serra, P., Taylor, C., Ragan, M.A. 2005. The
use of conceptual maps for two ontology developments: nutrigenomics, and a
management system for genealogies. Proceedings of the 8th International Protege
Conference. (Accepted for Publication)
11. Sure, Y. 2003. Methodology, Tools & Case Studies for Ontology based Knowledge
Managment. Karlsruhe: Universitat Fridericiana zu Karlsruhe
13. Niles, I., Pease, A. 2001. Towards a standard upper ontology. Proceedings of the
international conference on Formal Ontology in Information Systems-Volume 2001, 2-9,
2001. ACM Press New York, NY, USA.
151
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
As has previously been pointed out by this research, ontologies in biosciences are to be
used by software applications that ultimately seek to facilitate the integration of information in
molecular biosciences. Previous chapters of this research have explored how to develop
ontologies within a highly decentralised environment such as the bio-community. However,
as the involvement of the community does not just end at the time the ontology is deployed,
it is also important to consider those scenarios in which ontologies are to be used by software
that supports some of the activities carried out by biologists. This chapter introduces the
reader to some of the problems when integrating information in biosciences. Not only
technical issues concerning the integration of heterogeneous data sources and the
corresponding semantic implications, but also the integration of analytical results are
presented in this chapter. Within the broad range of strategies for integration of data and
information, platforms and developments are here distinguished.
The main contribution of this chapter is to present a view of the state of the art in data
and information integration in molecular biology that is general and comprehensive, yet based
on specific examples. The perspective this review gives to the reader is critical, and offers
insights and categorisations not previously considered by other authors. This chapter
concludes with the identification of some open issues for data and information integration in
the molecular biosciences domain and argues that with a wider application of ontologies and
semantic web technologies some of these issues can be overcome.
This chapter contains an original critical assessment made entirely by the author, who
conceived its structure, organisation and scope. The manuscripts that lead to the published
paper were written by Alex Garcia; the analysis and classification presented, as well as those
critical insights were worked out by Alex Garcia.
152
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
AUTHORS' CONTRIBUTIONS
Alex Garcia Castro conceived the project and wrote the manuscripts for this paper. Yi-
Ping Phoebe Chen provided useful discussion. Mark Ragan supervised the project, provided
useful discussion and assisted Alex Garcia Castro in the preparation of the final manuscript.
153
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
Abstract. Integrating information in the molecular biosciences involves more than the cross-referencing of
sequences or structures. Experimental protocols, results of computational analyses, annotations and links to
relevant literature form integral parts of this information, and impart meaning to sequence or structure. In this
review, we examine some existing approaches to integrating information in the molecular biosciences. We
consider not only technical issues concerning the integration of heterogeneous data sources and the
corresponding semantic implications, but also the integration of analytical results. Within the broad range of
strategies for integration of data and information, we distinguish between platforms and developments. We discuss
two current platforms and six current developments, and identify what we believe to be their strengths and
limitations. We identify key unsolved problems in integrating information in the molecular biosciences, and
discuss possible strategies for addressing them including semantic integration using ontologies, XML as a data
model, and Graphical User Interfaces (GUIs) as integrative environments.
154
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
scope, organisation [3, 4] and functionality. Both the databases themselves, and individual
entries within them, may be incomplete.
Sequences and their descriptors are meant to inform us about organisms and life
processes. In this sense, a sequence is not merely an isolated entity, but, on the contrary, is
part of a highly interconnected network. Initially at least, MBDBs were intended merely as
data repositories. Later, some databases were developed to facilitate the retrieval of connected
information that goes beyond sequences – for example, metabolic pathway databases
(MPDBs), in which sequences are nodes of networks (subgraphs) linked by edges that
represent biochemical reactions. In such a context, the meaning of a sequence is given by the
way it relates to other sequences, correlating data beyond sequences and reactions. Only by
understanding this context can a user formulate an intelligent query.
We believe that the integration of both data and information in molecular bioscience
should embody more-holistic views: how do molecules, pathways and networks interact to
build functional cells, tissues or organisms? How can health, development or diseases be
modeled so all relevant information is accessible in meaningful context? Developing
computational solutions that allow biologists to query multiple data sources in meaningful
ways is a fundamental challenge in modern bioinformatics [5], one with important
implications for the success of biomedicine, agriculture, environmental biotechnology and
most other areas of bioscience.
This review is organised as follows. In the first section we present an overview of issues
and technologies relevant to integration of information in the molecular biosciences, and
155
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
distinguish platforms from developments. Next, we focus on data integration (Section 6.2),
describing some existing platforms and developments and considering the extent to which
they can be considered to integrate information. In the third section (6.3) we address deeper
issues of semantic integration of molecular-biological information, highlighting the role of
ontologies. Section 6.4 presents XML not only as a format for data exchanges, but also as a
technology for introducing and managing semantic content. In Section 6.5 we describe how
Graphical User Interfaces (GUIs) can provide integrative frameworks for data and analysis. In
Section 6.6 we further detail metabolic pathway databases as a special case of integration –
one in which data become more valuable in the context of other data, and in which a formal
description of information relatedness and flow helps shape the description of biological
processes. Section 6.7 summarises and concludes our analysis and present what we consider
to be key unsolved problems.
156
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
data availability, control of data quality and standardisation of formats) also have domain-
specific features.
The scope and coordination of public databases such as those organised for the
research community by the U.S. National Center for Biotechnology Information (NCBI), the
European Bioinformatics Institute (EBI), and the DNA Data Bank of Japan (DDBJ) are a
characteristic feature of molecular bioscience. However, extensive data are also held privately,
some proprietarily within commercial enterprises (e.g. pharmaceutical or agrichemical
companies) and others available on a subscription basis. Public data can be integrated into
private databases, but (although the technical issues are presumably no different than for
integration among public databases) because of access policies the reverse does not happen.
Initiatives such as the International Crop Information System (http://www.icis.cgiar.org) and
the Global Biodiversity Information Facility (http://www.gbif.org) will likewise come up
against boundaries between public and private information.
The major public sequence databases have instituted measures to ensure that the data
they provide to the research community are of high quality. These include standard formats
and software for data submission, automated processing of submissions, and the availability
of human assistance. Nonetheless, as open, comprehensive repositories, these databases
necessarily contain instances of incomplete or poor-quality data, missing fields and legacy
formats that can only create problems for data and information integration. These problems
should be largely absent from databases that are expert-curated (e.g. Mouse Genome Database
[http://www.informatics.jax.org/], UniProt/SwissProt [6]) or derived from a curated
database (e.g. ProDom [http://protein.toulouse.inra.fr/prodom/current/html/home.php]),
based around one or a few large projects (e.g. Ensembl [7], FlyBase
[http://flybase.bio.indiana.edu/]), or otherwise narrowly focused (e.g. Protein Kinase
Resource [http://pkr.sdsc.edu/html/index.shtml], Snake Neurotoxin Database
157
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
6.1.3 Standardisation
International efforts through the 1990s, in part through Committee on Data for Science
and Technology CODATA (http://www.codata.org) of the International Council for
Science, led to highly coherent data formats for molecular sequence data at EBI, NCBI and
DDBJ. This is despite the evolution, during that decade, of sequencing technology from
manual slab gels with autoradiographic detection to automated slab gels with fluorescent
detection, to today’s capillary-based technologies. However, data standardisation remains a
major issue in fields where it may be less obvious what experimental conditions are relevant
to interpretation and where alternative technologies may be intrinsically less compatible. The
MIAME/MGED (minimum information about a microarray experiment/Microarray Gene
Expression Data Society; http://www.mged.org) and MAGE [8] (microarray and gene
expression) initiatives among the expression microarray community, and the Proteomics
Standards Initiative (PSI; http://psidev.sourceforge.net), exemplify the efforts being
undertaken to establish data standards for newer types of molecular data.
Technological issues cut across the integration of information in diverse ways, many of
which are discussed, in greater or lesser detail, in the sections that follow. Two others bear
further mention here: language and access.
6.1.4 Language
158
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
6.1.5 Access
“Grid” initiatives refer to a vision of the future in which data, resources and services
will be seamlessly accessible through the Internet, in the same sense that the electricity
delivered to our homes and offices is generated and transmitted via diverse power plants,
transmission lines, substations and the like that electric-power users rarely have to think
about. “The grid” will actually be multiple grids (data grid, computation grid, services grid)
and will be useful only to the extent that relevant components communicate with each other.
Many existing grid initiatives are coordinated through the Global Grid Forum
(http://www.ggf.org) in which the bioinformatics community is actively represented. It is
envisioned that the computational grid will be implemented using a standard “toolkit” of
reference software (http://www.globus.org). Other initiatives focus on how data can be most
efficiently shared across a data grid. The Life Science Identifier (http://Isid.sourceforge.net),
for example, has been proposed as a uniform resource name (URN) for any biologically
relevant resource. It is being offered as a formal standard that would be served on top of, not
as a replacement for, existing formats or schemata.
159
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
6.2.1 Platforms
We recognise two broad strategies for integration of data and information in molecular
biology: (i) provision of a general platform (framework, backbone) and (ii) addressing a specific
problem via a specific development. A platform for data integration offers a technological
framework within which it is possible to develop point solutions; usually platforms provide
non-proprietary languages, data models and data exchange/exporting systems, and are highly
customizable. By contrast, a development may not provide technology applicable to other
problems even in the same domain. A platform is meant to be a deeper layer over which
several heterogeneous solutions may share a common background, and in this way some
degree of interoperability can be achieved. Kleisli and DiscoveryLinkR [15] are examples of
platforms over which heterogeneous data can be integrated. Platforms provide a data model,
can query optimisation procedures, and provide a general query language as well as flexible
data exchange mechanisms.
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
queries to the appropriate data source and then combines the answers from the various data
sources to produce an answer to the global query. Kleisli is one example of this type of
integrative strategy.
Kleisli and DiscoveryLink® can be considered platforms for data integration. Although
not fully integrated strategies, they address data integration under a broader perspective than
does an individual development.
Kleisli: Kleisli is a mediator system, encompassing a nested relational data model, a high-
level query language, and a query optimiser. It provides a high-level query language, sSQL,
which can be used to express complicated transformations across multiple data sources. The
sSQL module can be replaced with other high-level query languages. It is possible to add new
data types if an appropriate wrapper is available or if one can be added. Kleisli does not have
its own Data Base Management System (DBMS); instead, it has functionality to convert many
types of database systems into its nested relational data model. Kleisli does not require
schemata; its relational data model and its data exchange format can be translated by external
data databases [14]. Kleisli is thus a backbone that is not limited in application to the
biological domain.
161
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
several advantages over other proposed solutions, basically because it relies on a de facto data
model for most commercial DBMSs. For example, users are able to carry out post-query
manipulations, and can use explicit SQL statements to define their queries; query optimisation
is also a feature of DiscoveryLink®. However, adding new data sources or analysis tools into
the system is not a straightforward process. DiscoveryLink® wrappers are written in C++,
which is not necessarily the most suitable programming language for wrappers [14]. Extensive
knowledge of SQL and of relational database technology is needed. DiscoveryLink® is built
over IBM’s DB2® technology from IBM, which is a commercial product. Although new data
sources can often readily be incorporated within the DiscoveryLink® federation, it may be
much more difficult to integrate DiscoveryLink® per se with non- DiscoveryLink®
environments.
Wrappers, such as those used in both Kleisli and DiscoveryLink®, mediate between
query system and specific data source (or type of data source). Thus, systems that wrap
multiple heterogeneous data sources thus translate their data into a common integrated data
representation [19]. Wrappers thus provide a kind of lingua franca through which two different
databases communicate and produce a result. Retrieval components in wrappers map queries
with common gateway interface (CGI) calls. Changes in data sources make it difficult to
maintain wrappers.
162
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
6.2.2 Developments
There have been many developments, some of which offer a framework over which
the integrating of analytical tools is possible, whereas others were designed to provide GUI
capabilities only for very specific algorithms. Commercial packages such as VectorNTI®
(http://www.invitrogen.com), Lasergene® (http://www.dnastar.com) and SeqWeb®
(http://www.accelrys.com/products/gcg/seqweb.html) basically offer additional
functionality (e.g. access to databases of plasmids sequences, tools for immediate visualisation
of 3-dimensional structure, etc.) and facilities for direct manipulation of specialised hardware
devices.
SRS is an information indexing and retrieval system designed for libraries with a flat-file
format such as the EMBL [20] Nucleotide Sequence Database, the Swiss-Prot [6] protein
sequence databank or the PROSITE [21] library of protein subsequence consensus patterns.
SRS was intended to be a retrieval tool that allows the user to access as many different
biological data sources as possible via a common GUI. It is relatively easy to integrate new
data into SRS. SRS wraps data sources via a specialised, built-in wrapping programming
language called ICARUS. It can be argued, however, that parsers should be written in a
general purpose language, rather than a language being built around a parser [22].
163
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
164
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
As a relatively popular attempt at a unified GUI to heterogeneous data sources, SRS may
provide important information about Human-Computer Interaction (HCI) in the
bioinformatics field.
Neither Kleisli, nor DiscoveryLink®, is deeply comparable with SRS. This is because
SRS focuses more on an integration of molecular data, whereas DiscoveryLink® and Kleisli
were, from their beginnings, products designed to allow the biomedical community to access
data in a wide variety of file formats.
6.2.2.2 GeneCards®
6.2.2.3 Entrez
Entrez [23, 24] is an integral part of the NCBI portal system, and as such is an
integrative solution within the NCBI problem framework. It provides a single portal for
access to most existing genomes, along with some analysis tools and database querying
capacities for genomic, protein and bibliographic information about specific genes. Graphical
displays for chromosomes, contig maps and integrated genetic and physical maps are also
available. Entrez also links each data element to neighbors of the same type [23].
6.2.2.4 Ensembl
Ensembl is not by itself a data integration effort, but rather an automatic annotation
tool. The task of annotation involves integrating information from different data sources, at
different levels and using different methods in concert. Emsembl provides general
visualisation tools and the ability to work with different data sources. Ensembl relies on open-
165
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
source projects such as BioJava and BioPerl. Raw data are loaded into the MySQL®-based
(http://www.mysql.com) internal schema of Ensembl and processed through its annotation
pipeline; results can be visualised using, for example, the Apollo [25] genome browser. Query
capacities in Ensembl are limited, but more flexible capacities may be achieved by addition of
Perl or Python scripts. Ensembl does not inherently provide alternative data models (or the
flexibility to supply alternative models), data exchange formats or substantial data exchange
capacities. Thus, Ensembl is not a data integration, but rather a solution that addresses a
specific problem (genome browsing and automatic annotation facility).
6.2.2.5 BioMOBY
166
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
6.2.2.6 myGrid
6.2.2.7 Others
Other projects also provide different functional capabilities and integrate information
from heterogeneous sources for a particular purpose or for a specific community. FlyBase
and WormBase (http://www.wormbase.org/) are examples of such integrative efforts that
aim to provide ‘all’ the available information related to a particular organism.
Syntactic integration basically deals with heterogeneity of form - the structure but not
the meaning. Semantic integration, on the other hand, fundamentally deals with the meaning
167
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
of information and how it relates within a specific field. It addresses the problem of
identifying semantically related objects in different databases, then resolving the schematic
(schema-related) differences among them [28]. A simple scenario where semantic implications
matter is one in which a protein may be identified in a particular databank with a certain
accession number, but may have a different identifier, or even a different annotation, in
another database (i.e. may appear non-synonymous). In the case of bioinformatics, semantic
integration could (and we argue, should) be seen to encompass not only the taxonomy of
terms (controlled vocabulary) or the resolution of semantic disagreements (e.g. between
synonyms in different databases), but also the discovery of services (databases and/or analysis
algorithms).
Semantic integration of MBDBs thus focuses, at some level, on how a database entry
can be related to other information sources in a meaningful way. Our previous descriptions of
database integration (in Section 2) addressed the problem of querying, and extracting data
from, multiple heterogeneous data sources. If this can be done via a single query, the data
sources involved are considered interoperable. We have not so far considered how a
particular biological entity might be meaningfully related to others; only location and
accessibility have been at issue. In the same way, the complexity of a query would be largely a
function of how many different databases must be queried, and, from these, how many
internal subqueries must be formed and exchanged for the desired information to be
extracted. If a deeper layer that embeds semantic awareness were added, it is probable that
query capacities would be improved. Not only could different data sources be queried, but
(more importantly) interoperability would then arise naturally as a consequence of semantic
awareness. At the same time, it should be possible to automatically identify and map the
various entries that constitute the knowledge relationship, empowering the user to visualise a
more descriptive landscape.
A richer example of why semantics matters may be seen with the word ‘gene’, a term
that has different meanings in different databases. In GenBank® [29], a gene is a “region of
biological interest with a name that carries a genetic trait of phenotype” and includes non-
168
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
structural coding DNA regions including introns, promoters and enhancers. In the Genome
Database [30], a gene is just a “DNA fragment that can be transcribed and translated into a
protein”. The RIKEN Mouse Full-length cDNA Encyclopedia [31], which focuses on full-
length transcripts, refers to the transcriptional unit instead of the gene. Queries involving
databases that, among themselves, present such semantic issues necessarily have limited
operability. Ontology can provide a guiding framework within which the user can restrict the
query to the context that it makes sense within and can navigate intelligently across terms.
Semantic approaches thus depend heavily on ontology.
What is ontology? Notions of what ontology is, and how it should be implemented,
differ but include the following: (i) a system of categories accounting for a particular vision of
the world [32]; (ii) specification of conceptualisations [33]; (iii) a concise and unambiguous
description of principal relevant entities with their potential, valid relations to each other [34];
and (iv) a means of capturing knowledge about a domain, such that it can be used by both
humans and computers [35] In the molecular biosciences, an ontology should capture, in
axioms, the relations among concepts. These axioms might then be used to extract implicit
knowledge such as the transitive closure of relations (if an enzyme is a type of protein and a
protein a type of polypeptide, then an enzyme is a type of polypeptide) [36].
Ontology may also provide a framework for describing living systems in terms of
information. For example, metabolic pathways describe many different chains of reactions
that relate different biological entities. These complex networks reflect a deep layer of
concepts that describe the system and, if represented appropriately, could support
visualisation, querying, and implementation of further analyses.
Thus, we see that an ontology is not simply a controlled vocabulary, nor merely a
dictionary of terms. Controlled vocabulary per se describe neither relations among entities nor
relations among concepts, and consequently cannot support inference processes. Database
schemata describe categories, and provide an organisational model of a system, but do not
necessarily represent relations among entities. Database schemata can be derived from
ontologies, but the reverse step is not so straightforward. An ontology might better be
169
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
considered as a type of knowledge base in which concepts and relations are stored, and which
makes inference capacities available.
Can the use of ontology improve query capacities? We believe it can, but much
optimisation of query logic will be required if full benefits are to be won. An ontological
veneer over existing databases will achieve little. With the current state of MBDBs, ontology
might be helpful primarily as a flexible guidance system, supporting the user in building
queries by relating concepts. To avoid the philosophically difficult question of what
constitutes a related concept, we prefer to think not in terms of related concepts in general,
but rather about restricting relations to a defined context.
Molecular biology has an emerging de facto standard ontology. The Gene OntologyTM
[40] (GO) consortium, established in 1998, provides a structured, precisely defined, common
controlled vocabulary for describing the roles of genes and gene products in eukaryotic cells.
GO embodies three different views of these roles: as functions, as processes and as cellular
170
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
TAMBIS therefore provides a level of interaction between the user and the external
sources that removes the need for the user to be aware of the schema. It is based on a three-
layer mediator/wrapper architecture [46] and uses Kleisli as a backend database. TAMBIS is
intended to improve query capacities in MBDBs via its supplied conceptual model, a
knowledge-driven user interface and a flexible representation of biological knowledge that
supports inference processes over the relations among concepts. The representation is
implemented using GRAIL Description Logic (http://www.openclinical.org/
dld_galenGrail.html). With TAMBIS, the user is guided over an ontologically informed map
of concepts related to a given query. This is done by exposing the user to the terminological
model, and by providing a guided query formulation system implemented in a graphical tool.
171
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
172
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
173
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
environment, XML and XML-based tools may mature into an alternative data integration
platform comparable with Kleisli and DiscoveryLink®.
Today, researchers in the biosciences rely mostly on string matching and link analysis
for database searching; computer functionality (operations relevant to different data types) in
MBDBs is very limited. However, if computers were able to understand the semantic
implications not only of data but also of queries, then more functionality and accuracy might
be added to database searching operations. XML provides a general framework for such
tasks. Although not a descriptive language, XML provides a representational framework for
semantic values to be introduced in the description of content.
XML has been extensively used in bioinformatics as an exchange format, but complete
XML integrative solutions have not yet been developed. We believe that XML should be
understood as a powerful data model, since XML allows flexible definition of set of tags as
well as the hierarchical nesting of tags. BioXML came about as an effort to develop standard
biological XML schemata and DTDs. It was intended to become a central repository, part of
the Open Bioinformatics Foundation (http://www.open-bio.org). However, the BioXML
project appears to be inactive. BioMOBY inherited some of the desired features of BioXML;
174
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
within MOBY lightweight XML documents comprise a set of descriptor objects that are
passed from MOBY central to the clients.
Querying databases is only one part of the research process in molecular biosciences.
Once relevant data have been retrieved, analysis must be undertaken. Very often the process
is not clear in advance, and the user must iteratively query, retrieve, analyze and compare until
the desired endpoint is attained. These steps are most easily carried out in an integrated
environment within which the functionality of MBDBs is brought together with appropriate
analysis tools, allowing the user to specify and carry out computational experiments, record
intermediate and final data, and annotate experiments.
Analysis tools in molecular bioscience are likewise heterogeneous, and may typically
include remote webtools, locally installed executables, and scripts in, for example, Perl,
Python and/or SQL. Especially among but even within application fields, interoperability
tends to be limited or nonexistent. Instead, the output of one program must usually be
reformatted for input into the next; this piping is mostly done using purpose-written Perl
parsers, of which there is no central library or listing.
The GCG® [51] and EMBOSS [52] suites are two well-known sequence analysis
software packages that group many methods commonly used in molecular biology. They are
both command-line driven, which requires users to have at least a basic familiarity with
UNIX® command-line syntax. Therefore, in recent years some groups have developed GUI
systems [53, 54] that suppress the syntactic complexity of UNIX® commands, thereby
promoting the coordinated use of the programs in these packages. A list of some of the
existing GUIs for EMBOSS [52] and GCG® is given in Table 2.
175
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
The graph of all possible paths may be seen at the Pise website at http://www-
alt.pasteur.fr/~letondal/Pise/gensoft-map.html. Pise provides two different ways by which a
176
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
macro can be customised: either a form (supplied within Pise) can be filled out with
parameters supplied by the user or the user can save an entire procedure as a macro. The user
is presented with a ‘Register the whole procedure’ button that builds the scripts, and allows
users to repeat their actions [54]. G-PIPE* (http://if-web1.imb.uq.edu.au
/Pise/5.a/gpipe.html) is a development build on top of Pise. It provides an automatic
workflow generator using Pise as a GUI framework. The workflow descriptions are stored as
XML documents that can later be loaded and run across different G-PIPE servers. This
automatisation is possible thanks to a set of Perl modules that check the syntactic consistency
of the different files in order to evaluate them as possible input files for the different steps in
a given workflow.
W2H [53] is one of the oldest and most powerful GUIs in bioinformatics. In a sense, it
has evolved from a GUI into an environment, as it provides not only GUI capacities but also
some functionality for file handling. W2H was developed making extensive use of the
metadata files that describe the applications available in the former GCG® package. In W2H,
these files are used to generate on-the-fly HTML documents with forms for entering values
of command-line parameters [53]. W2H embodies a classical tool-oriented approach;
combinations of tools were not initially supported [55]. W2H now provides some problem-
oriented tools (a task framework), allowing users to define data workflows. For this, W2H
again makes use of metadata, as well as descriptions of workflow and dataflow. Workflow in
this context refers to the sequence of tasks (methods, programs) that are part of a user’s
analysis chain. Dataflow is basically the parsing of one output into the subsequent task.
Using the existing W2H, the dataflow description is used by the web interface to
dynamically create HTML input forms for the task data. With the given metadata, the web
interface can collect input from the user, determine if all minimum requirements are fulfilled,
and provide the data to the task system. The name given to this task framework is W3H [55].
W3H reduces the amount of necessary programming skills, however the definition of tasks is
not an automatic process. Shareability of tasks among other GUI environments is not
possible under W3H. This feature was considered from the beginning in the design of
177
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
Pise/G-PIPE, where the whole task or workflow can be exported as an XML file (which may
later be loaded into the same system as Pise/G-PIPE) or as a Perl script that can be
customised by the user. Both W3H and Pise make extensive use of Bioperl. W3H is
immersed within the HUSAR (Heidelberg Unix Sequence Analysis Resources) environment,
and has pre-built parsers that enable connectivity with different datasets available on the local
HUSAR installation of SRS, GeneCards® and other databases/facilites.
PATH (Phylogenetic Analysis Task in HUSAR) was developed within the framework
provided by W3H [56]. Dependencies among applications, descriptions of program flow, and
merging of individual outputs (or parts thereof) into a common output report are provided to
the system. cDNA2Genome [57] is another task developed under the W3H framework. It
allows high-throughput mapping and characterisation of cDNAs. cDNA2Genome can be
divided into three main categories: database homology searches, gene finders and sequence
features’ predictors (i.e. start/stop codons, open reading frames). cDNA2Genome is available
at http://genius.embnet.dkfz-heidelberg.de/menu/biounit/open-husar/.
SeWeR [58] (SEquence analysis using WEb Resources) is a GUI for a different scenario
from EMBOSS or GCG®. It was designed to make extensive use of JavaScript and dynamic
HTML (DHTML), and its capacities thus provide a very lightweight solution. It presents a
uniform interface to most common services in bioinformatics, including polymerase chain
reaction-related analyses, sequence alignment, database searching, protein structure prediction
and sequence assembly. It provides for several levels of customisation to the interface and is
highly amenable for batch processing and automation.
178
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
Jemboss [59] is yet another a GUI for use with EMBOSS. In this case, a web launch
tool (Java Web Start) must be installed on the client’s computer. The user is presented with an
intuitive window that gives access to his or her assigned area on the server on which
EMBOSS is running. Jemboss uses SOAP (Simple Object Access Protocol;
http://www.w3org/TR/soap/), reducing security risks by allowing the user to access the
EMBOSS application as a client. The display area gives the user complete control over the
environment; analyses are run on the defined EMBOSS server. Via a job manager, it is
possible to run and monitor batch processes [59].
All of the GUIs described above provide graphical access to a specific set of analysis
tools; they do not provide integration between retrieval systems such as those in GeneCards®
or SRS. The coded functionality available on W3H in this respect is very limited; sequences
are identified and users can get intermediate access to databank entries. However, this limited
integration is not enough, as even simple operations such as automatic presentation of
analysis options over a set of previously identified sequences are not available. Such
integration (context menus embedded within the GUI) would define an environment within
which query capacities and analytical tools coexist in a single, unified working area. The
selection of one of these GUIs over another depends entirely on the problem at hand. All
provide in essence the same features, and source code is available for each. The definition of
analysis pipelines remains limited among these solutions.
A pathway can be defined as a linked set of biochemical reactions, such that the
product of one reaction is a reactant of, or an enzyme that catalysis, a subsequent reaction
[60]. A MPDB is a bioinformatics development that describes biochemical pathways and their
reactions, components, associated experimental conditions and related relationships. To the
extent that it is sufficiently comprehensive, a MPDB can be seen as a description of an
179
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
organism at the metabolic level. In the same way, other aspects of an organism can be
described in gene regulation databases, protein-protein interaction databases, signal
transduction databases and so forth.
The techniques used for building metabolic pathways range from manual analysis to
automated computational methods. The resulting databases differ in the types of information
they contain and in the software tools they make available for queries, visualisation and
analysis [42]. Quality is assured by following combinations of manual and automatic curation
processes. This is the case for BRENDA® [61] (Braunschweiger Enzyme Database); this
manually curated database contains information for all those molecules that have been
assigned an Enzyme Commission (EC) number. By querying the database it is possible to
retrieve information about an enzyme for all the organisms in which it is present.
BRENDA® is rich in literature references; these are parsed for relevant key phrases directly
from PubMed and are then associated with the corresponding enzymes.
Another example of an MPDB is the KEGG [62] (Kyoto Encyclopedia of Genes and
Genomes) pathway database. This database aims to link genomic information with higher
order functional information by computerisation of current knowledge on cellular processes
and by standardising gene annotations [63]. Within KEGG, genomic information is stored in
the GENES database (a collection of gene catalogues), while higher order functional
information is stored in the PATHWAY database. The WIT [64] (What is There) database is
another example of an MPDB. WIT has been designed to extract functional content from
genome sequences and organise it into a coherent system. It supports comparative analysis or
sequenced genomes and generates metabolic reconstructions based on chromosomal
sequences and metabolic modules from the Enzymes and Metabolic Pathways Database
(EMP)*/Metabolic Pathways Database (MPW) family of databases. WIT provides a set of
tools for the characterisation of gene structures and functions. After genes have been assigned
initial functions, they are then ‘attached’ to pathways by choosing templates form the
metabolic database (MPW) that best incorporate all observed functions. When this basic
model has been created, a (human) curator evaluates this model against biochemical data and
180
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
phenotypes known from the literature. Textual and graphical representation are fully linked
with underlying data.
The Pathway Tools software [65, 66] constitutes an environment for creating a
metabolic pathway database for a given organism or genome. Pathway Tools has three
components: PathoLogic, which facilitates the creation of new pathway/genome databases
from GenBank® entries; Pathway/Genome Navigator, for query, visualisation and analysis;
and Pathway/Genome Editor, which provides interactive editing capabilities. Some of the
computationally derived pathway/genome databases today are AgroCyc (Agrobacterium
tumefaciens; http://biocyc.org/AGRO/organism-summary?object=AGRO), MpneuCyc
(Mycoplasma pneumoniae; http://biocyc.org/MPNEU/organism-summary?object=MPNEU),
Human-Cyc (Homo sapiens; http://humancyc.org/); a more detailed list can be found at
http://www.biocyc.org.
181
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
182
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
Ideally, MPDBs should integrate information about the genome and the metabolic
networks of a particular organism. The metabolic network can be described in terms of four
bio-object types: the pathways that compose the network, the reactions that compose the
pathway, metabolic compounds, and the enzymes that catalyze the reactions [42]. Literature
citations are typically provided for most of the information units. However, very often this
information is incomplete; extraction of information from GenBank® and PubMed in order
to assist systematic annotation of gene functions is not a trivial process. Figure 2 exemplifies
the relationships among these four biological data types.
Semantic and syntactic issues are equally important, but defining the boundaries
between them is not a simple task. Projects such as myGrid and BioMOBY require well
183
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
184
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
Finally, analytical tools have not yet been fully integrated with indexing, data and
workflow management systems. As discussed in Section 5, GUIs are available for a wide
variety of implementations of diverse analytical methods, but we are far from having access to
a unified, platform-independent analytical environment. One bioinformatics company, LION
185
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
bioscience, has taken some steps in this direction with its SRS version 6.0. We continue to
believe, however, that real information integration in molecular bioscience requires a unified
analytical and data-handling environment for users. A relevant analogy may be, for example,
the diversity of operations available for a particular data type within the Windows® operating
system environment; all of the possible operations are presented to the user via a contextual
menu displayed at the time the user right-clicks on the icon of interest. In the same way,
operations over biological data types should be identified in advance, presented to the user
and then executed; currently all these operations are done either by coding them or by
copy/paste procedures. Simplification of coding operations should be enabled within GUI
frameworks (e.g. direct manipulation interfaces). We think that concepts from projects such as
Haystack [67] should be more carefully considered in bioinformatics.
Automation of data handling and knowledge extraction, along with tools that support
the interpretation of extracted knowledge, are likewise not yet available to bioinformaticians.
Such a set of tools should support the selection and planning of ‘wet’ experition ments in the
laboratory. Computer models are increasingly used to complement laboratory experiments,
and tools that extract and integrate knowledge would be powerful adjuncts to these models at
all stages of their implementation and use. Biological knowledge is spread not only over many
databases but also (and in a more complicated way) across thousands of papers, patents and
technical reports. It is in these latter documents that facts are described in the context in
which the underlying biological entities have been studied; real integration, therefore, should
consider conceptual queries over fully integrated views of relevant data sources.
Ideally, future biological information systems (BISs) will require neither frequent (and
difficult) data and software updates nor local data integration (warehouses); they should allow
semantically based data integration through ontologies (improving data integration) and
should support monitoring of the evolution of information sources. Future BISs should also
allow each researcher to ask questions within the context of his or her own problem domain
(and between domains), unconstrained by local or external data repositories. They should
proactively inform the user about new, relevant information, based on individual needs, and
186
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
support collaboration by matching researchers who have relevant expertise and/or interests.
Achieving this level of integration – as data in the molecular biosciences continue inexorably
to increase and diversify – will continue to provide challenges on many levels.
6.8 ACKNOWLEDGMENTS
We thank Dr. Limsoon Wong and the reviewers for extremely helpful suggestions.
Financial support for ARC Discovery Project DP0342987 and the ARC Centre in
Bioinformatics CE0348221 is acknowledged.
The authors have no conflicts of interest that are directly relevant to the content of this
review.
6.9 REFERENCES
1. Sirotkin, K., NCBI: Integrated Data for Molecular Biology Research. 1999, Norwell,
MA: Kluwer Academic Publishers.
2. Karp, P.D. and S. Paley, Integrated access to metabolic and genomic data. Journal
of Computational Biology, 1996. 3(1): p. 191-212.
3. Keen, G., et al., The Genome Sequence DataBase (GSDB): Meeting the challenge
of genomic sequencing. Nucleic Acids Research, 1996. 24(1): p. 13-16.
4. Benson, D.A., et al., GenBank. Nucleic Acids Research, 1997. 25(1): p. 1-6.
187
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
7. Hubbard, T., et al., The Ensemble genome database project. Nucleic Acids Research,
2002. 30(1): p. 38-41.
9. Lord, P., et al. PRECIS: An automated pipeline for producing concise reports about
proteins. in IEEE International Symposium on Bio-informatics and Biomedical
engineering. 2001. Washington: IEEE press.
11. Etzold, T. and P. Argos, Transforming a Set of Biological Flat File Libraries to a Fast
Access Network. Computer Applications in the Biosciences, 1993. 9(1): p. 59-64.
12. Zdobnov, E.M., et al., The EBI SRS server - recent developments. Bioinformatics,
2002. 18(2): p. 368-373.
13. Davidson, S., et al., BioKleisli: A digital library for biomedical researchers.
International Journal of Digital Libraries, 1997. 1: p. 36--53.
14. Wong, L., Kleisli, a Functional Query System. Journal of Functional Programming,
2000. 10(1): p. 19-56.
15. Haas, L., et al., DiscoveryLink: A system for integrated access to life sciences data
sources. IBM Systems Journal, 2001. 40: p. 489-511.
16. Davidson, S.B., et al., K2/Kleisli and GUS: Experiments in integrated access to
genomic data sources. Ibm Systems Journal, 2001. 40(2): p. 512-531.
17. Rebhan, M., et al., GeneCards: a novel functional genomics compendium with
automated data mining and query reformulation support. Bioinformatics, 1998.
14(8): p. 656-664.
18. Stein, L.D., Integrating biological databases. Nature Reviews Genetics, 2003. 4(5): p.
337-345.
19. Lacroix, Z., Biological Data Integration: Wrapping Data and Tools. IIIE Transactions
on Information Technology in Biomedicine, 2002. 6(2): p. 123-128.
188
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
20. Stoesser, G., et al., The EMBL Nucleotide Sequence Database: major new
developments. Nucleic Acids Research, 2003. 31(1): p. 17-22.
21. Sigrist, C., et al., PROSITE: A documented database using patterns and profiles as
motif descriptors. Briefings in Bioinformatics, 2002. 3: p. 265-274.
22. Chenna, R., SIR: a simple indexing and retrieval system for biological flat file
databases. Bioinformatics, 2001. 17(8): p. 756-758.
23. Macauley, J., H.J. Wang, and N. Goodman, A model system for studying the
integration of molecular biology databases. Bioinformatics, 1998. 14(7): p. 575-582.
24. Tatusova, T.A., I. Karsch-Mizrachi, and J.A. Ostell, Complete genomes in WWW
Entrez: data representation and analysis. Bioinformatics, 1999. 15(7-8): p. 536-543.
25. Lewis, S.E., et al., Apollo: a sequence annotation editor. Genome Biology, 2002.
3(12): p. 1-14.
26. Willkinson, M. and M. Links, BioMOBY: An Open Source Biological Web Services
Proposal. Briefings in Bioinformatics, 2002. 3: p. 331 - 341.
27. Stevens, R., J. Robinson, and C. Goble, myGrid: personalised bioinformatics on the
information grid. Bioinformatics, 2003. 19: p. 302 - 304.
29. Benson, D.A., et al., GenBank. Nucleic Acids Research, 1999. 27(1): p. 12-17.
30. Attwood, T.K. and C.J. Miller, Which craft is best in bioinformatics? Computers &
Chemistry, 2001. 25(4): p. 329-339.
31. Okazaki, Y., et al., Analysis of the mouse transcriptome based on functional
annotation of 60,770 full-length cDNAs. Nature, 2002. 420(6915): p. 563-573.
32. Guarino, N. Some Ontological Principles for Designing Upper Level Lexical
Resources. in the First International Conference on Language Resources and
Evaluation. 1998. Granada, Spain.
189
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
37. Friedman, N.N. and C.D. Hafner, The State of the Art in Ontology Design: A Survey
and Comparative Review. AI Magazine, 1997. 18: p. 53-74.
38. Erdmann, M. and R. Studer. Ontologies as Conceptual Models for XML Documents.
in 12th Workshop on Knowledge Acquisition, Modeling and Management (KAW-99).
1999. Banff, Canada.
40. Ashburner, M., et al., Gene Ontology: tool for the unification of biology. Nature
Genetics, 2000. 25(1): p. 25-29.
41. Yeh, I., et al., Knowledge acquisition, consistency checking and concurrency
control for Gene Ontology (GO). Bioinformatics, 2003. 19(2): p. 241-248.
42. Karp, P.D., EcoCyc:The Resource and the Lessons Learned. Bioinformatics
Databases and Systems, 1999: p. 47-62.
45. Wiederhold, G., Integration of Knowledge and Data Representation. IIIE Computers,
1992. 21: p. 38-50.
46. Paton, N.W., et al. Query Processing in the TAMBIS Bioinformatics Source
Integration System. in 1th Int. Conf. on Scientific and Statistical Database
Management (SSDBM). 1999: IEEE Press.
190
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
47. Klein TE, Chang JT, Cho MK, et al. Integrating genotype and phenotype
information: an overview of the PharmGKB project. Pharmacogenomics J 2001;
1:167-70
48. Rubin DL, Farhad S, Oliver DE, et al. Representing genetic sequence data for
pharmacogenomics: an evolutionary approach using ontological and relational
models. Bioinformatics 2002; 18: 207-15
49. Wong, L., Technologies for Integrating Biological Data. Briefings in Bioinformatics,
2002. 3(4): p. 389-404.
50. Bry, F. and P. Kröger, A Computational Biology Database Digest: Data, Data
Analysis, and Data Management. International Journal of Distributed and Parallel
Databases, 2003. 13: p. 7 - 42.
52. Rice, P., I. Longden, and A. Bleasby, EMBOSS: The European molecular biology
open software suite. Trends in Genetics, 2000. 16(6): p. 276-277.
53. Senger, M., et al., W2H: WWW interface to the GCG sequence analysis package.
Bioinformatics, 1998. 14(5): p. 452-457.
54. Letondal, C., A Web interface generator for molecular biology programs in Unix.
Bioinformatics, 2001. 17(1): p. 73-82.
55. Ernst, P., K.H. Glatting, and S. Suhai, A task framework for the web interface W2H.
Bioinformatics, 2003. 19(2): p. 278-282.
56. Del Val, C., et al., PATH: a task for the inference of phylogenies. Bioinformatics,
2002. 18(4): p. 646-647.
57. Del Val, C., K.H. Glatting, and S. Suhai, cDNA2Genome: A tool for mapping and
annotating cDNAs. BMC Bioinformatics, 2003. 4: p. 39.
58. Malay, K.B., SeWeR: a customizable and integrated dynamic HTML interface to
bioinformatics services. Bioinformatics, 2001. 17: p. 577-578.
59. Carver, T.J. and L.J. Mullan, Website update: A new graphical user interface to
EMBOSS. Comparative and Functional Genomics, 2002. 3(1): p. 75-78.
191
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
60. Karp, P.D., Pathway databases: A case study in computational symbolic theories.
Science, 2001. 293(5537): p. 2040-2044.
61. Schomburg, I., A. Chang, and D. Schomburg, BRENDA, enzyme data and metabolic
information. Nucleic Acids Research, 2002. 30(1): p. 47-49.
62. Ogata, H., et al., KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids
Research, 1999. 27(1): p. 29-34.
63. Kanehisa, M. and S. Goto, KEGG: Kyoto Encyclopedia of Genes and Genomes.
Nucleic Acids Research, 2000. 28(1): p. 27-30.
64. Overbeek, R., et al., WIT: integrated system for high-throughput genome sequence
analysis and metabolic reconstruction. Nucleic Acids Research, 2000. 28(1): p. 123-
125.
65. Karp, P.D., et al., The MetaCyc database. Nucleic Acids Research, 2002. 30(1): p. 59-
61.
66. Karp, P.D., S. Paley, and P. Romero, The Pathway Tools Software. Bioinformatics,
2002. 18: p. 225-232.
67. Quan, D., D. Huynh, and D.R. Karger. Haystack: A Platform for Authoring End User
Semantic Web Applications. in 2nd International Semantic Web Conference. 2003.
Sanibel Island, Florida: Springer-Verlag, Heidelberg.
68. EcoCyc. E. coli K-12 pathway: valine biosynthesis [online]. Available from URL:
http://biocyc.org/ECOLI/new-image?type=PATHWAY&object=VALSYN PWY [Accessed
2005 Sep 30]
192
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
In the previous chapter it has been argued that with a wider application of ontologies
and Semantic Web technologies in bioinformatics it will be possible to overcome some issues
when integrating information. Workflows are identified as a fundamental component when
integrating information in molecular biosciences, as researchers need to interleave
information access and algorithm execution in a problem-specific workflow. Within this
problem-specific workflow there are syntactic issues as well as semantic ones. Allowing the
concrete execution of the workflow is a syntactic problem, describing this in silico experiment
is, however, a semantic one for which an ontology of similar characteristics as those presented
in chapter 5 section 5.2, is required. The benefit of having well defined syntactic and
semantics not only easies some technical aspects, but also allows for better reusability of the
workflow in a larger context – a community of users.
Alex Garcia was responsible for the conceptualisation, initial investigation and
finalisation of the research described in this chapter. Alex Garcia conceived the workflow
193
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
generator, graphical user interface, and semantic structures. He also participated in the
development of the tool, and wrote the corresponding papers.
AUTHORS' CONTRIBUTIONS
Alex Garcia Castro was responsible for design and conceptualisation, took part in
implementation, and wrote a first draft of the manuscript. Samuel Thoraval was the main
developer of G-PIPE. Leyla Jael Garcia Castro assisted with server issues and FCA. Mark A.
Ragan supervised the project and participated in writing the manuscript.
Garcia Castro A, Thoraval S, Garcia LJ, Chen Y-PP, Ragan MA: Bioinformatics
workflows: G-PIPE as an implementation. In: Network Tools and Applications in Biology
(NETTAB), 5-7 October 2005, Naples, Italy, pages 61-64
194
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
Abstract. Computational methods for problem solving need to interleave information access and algorithm
execution in a problem-specific workflow. The structures of these workflows are defined by a scaffold of
syntactic, semantic and algebraic objects capable of representing them. Despite the proliferation of GUIs
(Graphic User Interfaces) in bioinformatics, only some of them provide workflow capabilities; surprisingly, no
meta-analysis of workflow operators and components in bioinformatics has been reported. We present a set
of syntactic components and algebraic operators capable of representing analytical workflows in
bioinformatics. Iteration, recursion, the use of conditional statements, and management of suspend/resume
tasks have traditionally been implemented on an ad hoc basis and hard-coded; by having these operators
properly defined it is possible to use and parameterise them as generic re-usable components. To illustrate
how these operations can be orchestrated, we present G-PIPE, a prototype graphic pipeline generator for
PISE that allows the definition of a pipeline, parameterisation of its component methods, and storage of
metadata in XML formats. This implementation goes beyond the macro capacities currently in PISE. As the
entire analysis protocol is defined in XML, a complete bioinformatics experiment (linked sets of methods,
parameters and results) can be reproduced or shared among users. Availability: http://if-
web1.imb.uq.edu.au/Pise/5.a/gpipe.html (interactive), ftp://ftp.pasteur.fr/pub/GenSoft/unix/misc/Pise/
(download). From our meta-analysis we have identified syntactic structures and algebraic operators common
to many workflows in bioinformatics. The workflow components and algebraic operators can be assimilated
into re-usable software components. G-PIPE, a prototype implementation of this framework, provides a GUI
builder to facilitate the generation of workflows and integration of heterogeneous analytical tools.
7.1 BACKGROUND
195
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
Workflow management systems (WFMS) are basically systems that control the
sequence of activities in a given process [1]. In molecular bioscience, these activities can be
divided among those that address query formulation, and those that focus more on analysis.
At this abstract level, WFMS could serve to control the execution of both query and analytical
procedures. All of these procedures involve the execution of activities, some of them manual,
some automatic. Dependency relationships among them can be complex, making the
synchronisation of their execution a difficult problem.
Systems such as W2H/W3H [2] and PISE [3] provide some tools that allow methods
to be combined. W3H is a task framework that allows the methods available under W2H [4]
to be integrated; however, those tasks have to be hardcoded. In the case of PISE, the user can
either define a macro using Bioperl http://www.bioperl.org, or use the interface provided and
register the resulting macro. In either case, it is assumed that the user can program, or script
in Perl. Macros cannot be exchanged between PISE and W2H, although these two systems
provide GUIs for more or less the same set of methods (EMBOSS [5]). Indeed, macros
196
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
cannot be easily shared even among PISE users. Biopipe http://www.biopipe.org, on the
other hand, provides integration for some analytical tools using Bioperl API (Application
Programming Interface) using MySQL to store results as well as the workflow definition; in
this way, users are able to store results in MySQL and monitor the execution of the pre-
defined workflow.
G-PIPE provides a real capacity for users to define and share complete analytical
workflows (methods, parameters, and meta-information), substantially mitigating the syntactic
complexity that this process involves. Our approach addresses overall collaborative issues as
well as the physical integration of tools. Unlike TAVERNA, G-PIPE provides an
implementation that builds on a flexible syntactic structure and a set of algebraic operations
for analytical workflows. The definition of operators as part of the workflow description
allows a flexible set-up when executing it; operators also facilitate the reproducibility of the
workflow as they allow researchers to share experimental conditions in the form of
parameters.
Although G-PIPE was not conceived as an environment for testing usability aspects in
the design of bioinformatics tools, empirical observations allowed us to see how the
disposition of the functional objects in the interface (e.g. interfaces to algorithms and the
workflow representation) was simpler and easier for researchers than in the one provided by
TAVERNA. An important issue that was raised from these observations was the high-level of
complexity involved in the parameterisation, as researchers usually run algorithms with default
settings. Unlike G-PIPE, TAVERNA assumes users have an understanding of web services,
part of the necessary steps when defining a workflow in TAVERNA involves the selection of
the algorithm as a web service. Another interesting aspect we could observe was the
197
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
importance of having a tool in which fewer steps were involved in the definition and
execution of the workflow. TAVERNA requires too many details and involves too many
steps when defining and executing the workflow; some of the required information is
technical and thus more related to the operational domain then to the domain of knowledge;
this causes an unnecessary stressing factor in the researcher. Surprisingly, there are no
usability methods for bioinformatics nor is there usability studies performed throughout the
software development process in bioinformatics; the application of usability engineering
could potentially benefit the development of bioinformatics tools by bringing them closer to
the needs of end-users. The facility provided by G-PIPE for the generation of the workflow
aims to hide the complexity of the workflow by allowing researchers to concentrate on the
minimal necessary procedural details of the workflow (e.g. input files, parameters, where to
pipe).
7.2 RESULTS
198
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
developed a great diversity of GUIs for EMBOSS and GCG, but a meta-analysis of the
processes within which these analytical implementations are immersed is not yet fully
available. Some of the existing GUIs have been developed to make use of grammatical
descriptions of the analytical methods, but there exists no standard meta-data framework for
GUI and workflow representation in bioinformatics.
Our workflow conceptualisation (Figure 1) closely follows those of Lei and Singh [8]
and Stevens et al. [9]. We have adapted these meta-models to processes in bioinformatics
analysis. We consider an input/output data object as a collection of input/output data. For us
a transformer is the atomic work item in a workflow. In analytical workflows, it is an
implementation of an analytical algorithm (analytical method). A pipe component is the entity
that contains the required input-output relation (e.g. information about the previous and
subsequent tasks); it assures syntactic coherence. Our workflow representation has tasks,
stages, and experimental conditions (parameters). In our view, protocols are sets of
information that describe an experiment. A protocol contains workflows, annotations, and
information about the raw data; therefore we understand a workflow to be a group of stages
with interdependencies. It is a process bound to a particular resource that fulfils the analytical
necessities.
199
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
200
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
Iteration is the operator that enables processes in which one transformer is applied
over a multiple set of inputs. A special case for this operator occurs when it is applied over a
blank transformer; this case results in replicates of the input collection. Consider an analytical
method, or a workflow, in which the same input is to be used several times; the first step
would be to use as many replicates of the input collection as needed. The recursion operation
takes place when one transformer is applied with parameters defined not as a single value, but
as a range or as a set of values. The conditional operator has to do with the conditioned
execution of transformers. This operation can be attached to a function evaluated over the
201
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
application of a recursion, or of an iteration; if the stated condition is true, then the workflow
executes a certain path. Conditional statements may also be applied to cases where an
argument is evaluated on the input; the result affects not a path, but the parameter space of
the next stage. The suspension/resumption operation stands for the capacity of the workflow
to stop and re-capture the jobs.
Formal Concept analysis (FCA) is a mathematical theory based on ordered sets and
complete lattices. Numerous investigations have shown the usefulness of concept lattices for
information retrieval combining query and navigation, learning and data-mining, visual
constructors and visual programming [10]. FCA helps one to define valid objects, and identify
behaviours for them. We are currently working on a complete FCA for biological data types
and operations (database and analytical). Here we define operators in terms of pre- and post-
conditions, as a step toward eventual logical formalisation. We focus on those components of
the discovery process not directly related to database operations; a good integration system
will "hide" the underlying heterogeneity, so that one can query using a simple language (which
views all data as if they are already in the same memory space). Selection of the query
language depends only on the data model. For the XML "data model", XML-QL, XQL, and
other XML query languages are available. For the nested relational model there are nested
relational calculi and nested relational algebras. For the relational model SQL, relational
algebras and so on are available. For database operations, the issues that arise are lower-level
(e.g. expression of disk layout, latency cost, etc. in the context of query optimisation), and it is
not clear that any particular algebra offers a significant advantage.
202
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
203
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
Implementation of both of these workflows was a manual process. GUI generation was
facilitated by using PISE as our GUI generator, and this simplified the inclusion of new
analytical methods as needed. Database calls had to be manually coded in both cases.
Choreographing the execution of the workflow was not simple, as neither has a real workflow
engine. It proved easier to give users the ability to manipulate parameters and data with
PISE/G-PIPE, partly due to wider range of methods within BioPerl partly because algebraic
operators were readily available as part of PISE/G-PIPE. From this experience we have
concluded that, due to the immaturity of current available web service engines, it is still most
practical to implement simple XML workflows that allow users to manipulate parameters, use
conditional operators, and carry out write and read operations over databases. This balance
204
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
will, of course, presumably shift as web services mature in the bioinformatics applications
domain.
We have developed G-PIPE, a flexible workflow generator for PISE. G-PIPE extends
the capabilities of PISE to allow the creation and sharing of customised, reusable and
shareable analytical workflows. So far we have implemented and tested G-PIPE over only the
EMBOSS package, although extension to other algorithmic implementations is possible
where there is an XML file describing the command-line user interface.
205
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
rules drive the interaction between these entities (e.g. to ensure syntactic coherence between
heterogeneous file formats). G-PIPE also assures the execution of the workflow, and makes it
possible to distribute different jobs over a grid of servers. G-PIPE addresses these
requirements using mostly Bioperl.
In G-PIPE, each analysis protocol (including any annotations, i.e. meta-data) is defined
within an XML file. A Java applet provides the user with an exploratory tool for browsing and
displaying methods and protocols. Synchronisation is maintained between client-side display
and server-side storage using Javascript. Server-side persistence is maintained through
serialised Perl objects that manage the workflow execution. G-PIPE supports independent
branched tasks in parallel, and reports errors and results into an HTML file. The user selects
the methods, sets parameters, defines the chaining of different methods, and selects the
server(s) on which these will be executed. G-PIPE creates an XML file and a Perl script, each
of which describes the experiment. The Perl file may later be used on a command-line basis,
and customised to address specific needs. The user can monitor the status of workflow
execution, and access intermediary results. A workflow built with G-PIPE can distribute its
analyses onto different, geographically disperse G-PIPE/PISE servers.
The overall architecture of G-PIPE is shown in Figure 5. A Java applet provides the
user with an exploratory tool for browsing and displaying methods and protocols. The user
interacts with the HTML forms to define a protocol. Synchronisation is maintained between
client-side display and server-side storage using Javascript. Server-side persistency is
maintained through serialised Perl objects that describe the experiment. The object is
translated into two user-accessible files: an XML file to share and reload protocols, and a Perl
script. A new lightweight PISE/Bioperl module, PiseWorkflow, lets workflows to be built
and run atop PiseApplication instances. This module supports independent branched tasks in
parallel, and report errors and results into an HTML file.
206
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
computational workflow, as illustrated in Figure 6. In this case, syntactic elements have been
defined by a clear semantics that allows developers to manipulate the constructs depending
on the needs of the application; thus there is a semantics scaffold from which syntactic
aspects make sense. For the case of G-PIPE it is enough just to allow users to manipulate
parameters, transformers, pipe components, and data collections. However, when annotating complete
biological investigations the design of SNPs, or any computational method involved, is just a
small part within the context of a larger effort. For these cases the annotation is not only
about those identified constructs; the workflow is part of the whole, then, the workflow
constructs have to be annotated within the new context.
208
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
The RSBI ontology, in principle, allows this integration. The collection component as
understood by Garcia et al [1], and discussed in section 7.2 can be assimilated to the same
concept of Biomaterial. The transformer can be assimilated to the assay. Figure 7 illustrates
how, for a particular segment of the workflow presented in Figure 6, the RSBI ontology
together with the workflow constructs; represent in a meaningful way the use of TBLAST. It
is important to notice that the larger the effort, the more complex the annotation. Biomaterial
is an elusive concept, as for every assay, being it computational, in vivo, or in vitro, there is the
potential to fragment or even mutate – transform - the biomaterial; however there is always
the need to trace back the sample to its original source allowing researchers to inspect the
process at different levels of detail.
209
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
7.5 DISCUSSION
Semantic issues are particularly important with these kinds of workflows. An example
may be derived from Figure 3, where three different phylogenetic analysis workflows are
executed. These may be grouped as equivalent, but are syntactically different. Selection should
be left in the hands of the user, but the system should at least inform about this similarity.
210
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
Despite agreement on the importance of semantic layers for integrative systems, such a
level of sophistication is far from being achieved. Lack of awareness of the practical
applications of such technologies is well illustrated with a traditional and well-studied product:
Microsoft Word®. With Word, syntactic verification can take place as the user composes text,
but no semantic corroboration is done. For two words like "purpose" and "propose", Word
advises on syntactic issues, but gives no guidance concerning the context of the words.
Semantic issues in bioinformatics workflows are more complex, and it is not clear if existing
technologies can effectively overcome these problems.
Transformers and grid components are intrinsically related because the services are de
facto linked to a grid component. It has been demonstrated that the use of ontologies
facilitates interoperability and the deployment of software agents [11]; correspondingly, we
envision semantic technology supporting the agents to form the foundation of future
workflow systems in bioinformatics. The semantic layer should make the agents more aware
of the information.
More and more GUIs are available in bioinformatics; this can be seen in the number of
GUIs for EMBOSS and GCG alone. Some of them incorporate a degree of workflow
capability, more typically a simple chaining of analytical methods rather than flexible
workflow operations. A unified metadata model for GUI generation is lacking in the
bioinformatics domain. Web services are relatively easy to implement, and are becoming
increasingly available as GUI systems are published as web services. However, web services
were initially developed to support processes for which the business logic is widely agreed
upon, well-defined and properly structured, and the extension of this paradigm to
bioinformatics may not be straightforward.
Automatic service discovery is an intrinsic feature of web services. The accuracy of the
discovery process necessarily depends on the ontology supporting this service. Systems such
as BioMoby and TAVERNA make extensive use of service discovery; however, due to the
difficulty in describing biological data types, service discovery is not yet accurate. It is not yet
clear whether languages such as OWL can be developed to describe relations between
211
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
biological concepts with the required accuracy. Integrating information is as much a syntactic
as a semantic problem, and in bioinformatics these boundaries are particularly ill defined.
Semantic and syntactic problems were also identified from the case workflow described
in Figure 3. There, we saw that to support the extraction of meaningful information and its
presentation to the user, formats should be ontology-based and machine-readable, e.g. in
XML format. Lack of these functional features makes manipulation of the output a difficult
task that is usually addressed by use of parsers specific to each individual case. For workflow
development, human readability can be just as important. Consider, for example, a ClustalW
output where valid elements could be identified by the machine and presented to the user
together with contextual menus including different options over the different data types. In
this way the user would be able to decide what to do next, where to split a workflow, and
over which part of the output to continue or extend the analysis. Inclusion of this
functionality would allow the workflow to become more concretely defined as it is used.
Failure management is an area in which we can see a clear difference between the
business world and bioinformatics. In the former, processes rarely take longer than an hour
and are not so computationally intensive, whereas in bioinformatics, processes tend to be
computationally intensive and may take weeks or months to complete. How failures can be
managed to minimise losses will clearly differ between the two domains. Due to the
immaturity of both web services and workflows in bioinformatics, it is still in most cases
more practical to hard-code analytical processes. Improved failure management is one of the
domain-specific challenges that face the application of workflows in bioinformatics.
212
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
213
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
7.6 CONCLUSION
7.7 ACKNOWLEDGEMENTS
214
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
7.8 REFERENCES
2. Ernst P, Glatting K-H, Shuai S: A task framework for the web interface W2H.
Bioinformatics 2003, 19:278-282.
7. Shah SP, He DYM, Sawkins JN, Druce JC, Quon G, Lett D, Zheng GXY, Xu T, Ouellette
BFF: Pegasys: software for executing and integrating analyses of biological
sequences. BMC Bioinformatics 2004, 5:40.
10. Ganter B, Kuznetsov SO: Formalizing hypothesis with concepts. In 8th International
Conference on Conceptual Structures, ICCS Conceptual Structures: Logical, Linguistic,
and Computational Issues. Darmstadt, Germany. Lecture Notes in Computer Science
1867 Edited by: Mineau G, Ganter B. Springer-Verlag; 2000:342-356. August 14–18
2000
11. Sowa FJ: Top-level ontological categories. International Journal of Human Computer
Studies 1995, 43:669-685.
215
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
8.1 SUMMARY
This thesis has primarily dealt with the actual how to build ontologies, and how to gain
the involvement of the community. This research presents a detailed description of three
different but complementary ontology developments, along with all those issues surrounding
the development of the corresponding ontologies. As ontologies are living systems, constantly
evolving, the maintenance and life cycle of the ontology has also been investigated in order to
have a consistent methodology. It has been largely accepted by the biological community that
ontologies play a prominent role when integrating information, however very few studies had
focused on the relationship between the syntactic structure and the semantic scaffold. This
thesis has also explored this relationship.
As integration of information has different edges, this present work has covered the
workflow nature of bioinformatics. Within this context a syntactic structure was proposed in
order to allow in silico experiments to be replicable and reproducible. Also, and more
importantly, from this experience it was possible to study the relationship between a syntactic
and semantics. This research is based upon real cases in which researchers were involved; this
allowed the author to gain from this direct relationship not only with the subject of study but
also with the context in which solutions were expected to play a role.
216
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
The discussion and conclusions are organised as follows; initially a summary of the
thesis is presented. In Section 6.2 and 6.3 the similarity between the semantic web and that of
biology is illustrated within the context of information systems. Issues related to the
construction of biological ontologies are discussed in Section 6.4 and, finally, references are
given in Section 6.5.
LIMS are a special kind of biological information systems as they in principle organise
the information produced by laboratories. Once this information has been organised the
analysis process takes place, discovering relations becomes more and more important. Within
the plant context, plant-related descriptors such as those provided by Plant Ontology (PO) [1]
and Gramene [2] are being consumed by object models in a variety of software systems such
as LIMS in order to support annotation of sequences and experiments. These object models
217
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
are meant to support an integrative approach so therefore the use of orthogonal ontologies is
essential.
Relating phenotypic information to its corresponding genotypes and vice versa should,
in principle, be possible. For instance, with a saline stress related query one should be able to
retrieve not only sequences but experimental designs and conditions, location, morphological
features of the plants involved, etc. Common biological descriptors should be identified and
ontologies addressing specific needs for the plant community need to be developed.
Molecular information may be described independently from the domain, for instance the
Gene Ontology (GO) [3], however, phenotypic information is highly specific to the type of
organism being described.
Ideally LIMSs should consume core and domain-specific terminology in order to allow
for the annotation of experiments; these vocabularies should be shared across the community
so exchanging information might be a simpler task. In order for information to be shared the
vocabulary used should be independent from the LIMS; different LIMS should be able to
share a standard vocabulary. This ensures the independence between both the conceptual and
the functional model – researchers may use different LIMS but still name things with a
consistent vocabulary. In the same vain this may allow to share experiments in the form of
customizable “templates”.
Some attempts have been made in order to define what an investigation is, what the
difference between a test and an assay is, how we can classify experiments and how to annotate
research endeavours in order to facilitate contextualised information retrieval. One of the first
218
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
ontologies addressing the problem of describing experiments was the MGED Ontology
(henceforth MO) [4]; it was developed as a collaborative effort by members of the MGED
Ontology working group in order to provide those descriptors required to interpret
microarray experiments. Although these concepts were derived from the MicroArray and
Gene Expression Object Model (MAGE-OM), which is a framework to represent gene
expression data and relevant annotations [5], in principle any software capable of consuming
the ontology can use these descriptors. There is thus a separation between both the functional
and the declarative (ontological) model.
Throughout this thesis, the need to support integrative approaches on rich and useful
graphical environments has been clearly stated. Designing these environments is a research
topic not sufficiently studied within the context of bioinformatics. There have been very few
Human-Computer Interaction (HCI) evaluations over bioinformatics tools; moreover, HCI
and cognitive aspects are rarely considered when designing biological information systems.
Information foraging, which refers to activities associated with assessing, seeking, and
handling information sources [6] has also not been considered in bioinformatics. Such search
is adaptive to the extent that it makes optimal use of knowledge about the expected value of
the information, and the expected costs of accessing and extracting it. Humans seeking
information adopt different strategies when gathering information and extracting knowledge
from the results of their searches. A relationship between the user and the information is then
built. The relation is easy if the data are presented to the user in a clear way, if the information
provides extraction tools, and especially if the information is understandable, structured, and
immersed in the right context. Value and relevance are not intrinsic properties of
information-bearing representations, but can be assessed only in relation to the environment
in which the task is embedded.
Graphical User Interfaces (GUIs) should facilitate managing and accessing the
information. Graphical environments should relieve users from high learning curves, and
difficulties accessing command line based interfaces. There is a need to establish a clear
separation between the operational domain and the domain of knowledge. For a researcher,
219
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
finding a protein defined by some specific features along with all the relevant bibliographic
references should not be a daunting task. Integrative approaches should therefore be integral.
Ontologies may help in creating coherent visual environments, as has already been shown by
Stevens et al. with the TAMBIS [7] project.
The field of bio-ontology development has been surprisingly active in recent years;
partly because of the premise that it will encourage and enable knowledge sharing and reuse,
but also because the biological community is gradually adopting a holistic approach for which
context is critical - shifting paradigm, some would say. In order to achieve this “holistic view”,
it is indispensable to develop ontologies that accurately describe the reality of the world.
Different groups will develop this ontological corpus –as is currently happening. Those
efforts already in place are independent from one another, and made in response mostly to ad
hoc necessities. Ironically for the biological community, we may be re-writing an old story:
database integration in molecular biology has long been a problem, partly due to the fact that
most approaches to data integration have been driven by necessity. By the same token
biological ontologies have been developed as a momentary response to a particular need. This
has lead the bio-communities to describe their worlds from their particular perspective, not
taking into account that at a later stage these ontologies are needed to describe the “big
picture”. This approach also carries negative implications for the maintenance and evolution
of the ontologies.
This situation should change within the coming years, not only because of those lessons
learned, but also because the “big picture” will drive biology more and more, making it
necessary to have articulated descriptions by using well-harmonised ontologies. More
importantly, ontologies are being, slowly but firmly, separated from object models. This
independence should allow ontologies to be used across a wide range of applications. For
instance, any Laboratory Information Management System should be able to use the same
220
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
descriptors for those processes for which it was designed, thereby enabling data sharing and
to some extent knowledge sharing. Full experiments could then be easily replicated.
Ontologies should be independent from computer realisations.
Are we heading towards a semantic web (SW) in bioinformatics? It was Tim Berners-Lee
who initially presented a vision of a unified data source where, as a consequence of highly
integrated systems, complex queries could be formulated [8]. It has been long since this vision
was presented, and many different approaches have been developed in order to make it
operative; so far it is still hard to define what semantic web really means. The SW may be seen as
a knowledge base where semantic layers allow reasoning and discovery of hidden relations,
contextualising the information, thereby delivering personalised services. In the development
of the semantic web there is, thus, a pivotal role for ontologies to play, since they provide a
representation of a shared conceptualisation of a particular domain that can be communicated
between people and applications.
In the same way, the complexity of a query would be largely a function of how many
different databases must be queried, and from these how many internal sub-queries must be
formed and exchanged, for the desired information to be extracted. If a deeper layer that
embeds semantic awareness were added, it is probable that query capacities would be
improved. This can be envisioned as the provision, within an ontological layer, of just enough
connective tissue to allow semi-intelligent agents or search engines to execute simplified
221
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
queries against hundreds of sites (McEntire, 2002). Not only could different data sources be
queried, but also (more importantly) interoperability would then arise naturally as a
consequence of semantic awareness. At the same time, it should be possible to automatically
identify and map the various entries that constitute the knowledge relationship, empowering
the user to visualise a more descriptive landscape. What is semantic integration of molecular
biology databases? What does it mean to have a semantic web for the biological domain?
Not so surprisingly, as has been already mentioned, as well as discussed in more detail
in Chapter 2, the biological community is heading towards a semantic web. However, this is not
new as the biological community was facing all those problems the syntactic web has always
had. The semantic web in biology poses an interesting, not so well known, challenge to the
semantic web community: that of knowledge representation within communities of practices.
Representing and formalising knowledge for semantic web purposes has usually been studied
within closed-complete contexts – Amazon3, insurance companies, administrative
environments for which a business logic is not only known in advance, but also for which the
communities are more prone to follow rules. The biological community is different, and these
idiosyncratic factors must be taken in to account. Moreover, it is not clear what constitutes
knowledge in a broad sense for the biological community. One could say that a database entry
may be considered to be data; however as the database entry is annotated with meaningful
information that places it within a valid context for the researcher, the boundaries between
data, information and knowledge become difficult to see.
3 http://www.amazon.com
222
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
Simple guidance and criteria such as how to best define the class structure within the
bio-domain may make a huge difference. Names should not matter much, as they will
proliferate; ontology classes on the other hand should offer a more enduring structure; ideally
the class structure should follow axes based on time and space, continuants and ocurrents.
There is thus the need to disentangle the meanings from the names; in this way we may
achieve a modular-accurate description of the world, based on facts and evidence rather than
perceptions. Quoting Barry Smith4:
“As Leibniz pointed out several centuries back in his criticism of John Locke's statement (roughly
summarised): Since names are arbitrary and our understanding of the world is based on the names we give to
things, our understanding of the world is arbitrary. Leibniz agreed names are arbitrary but our description of
the real world is based on our best effort to describe facts as we see them - not on names. It's these aspects of
Leibniz epistemology that have been used to great effect by the evo-devo researchers who have developed the
concept of modularity/complementarity when describing the constraints on evolution - to wit - the possibilities -
the search space in which evolution functions - are not limitless, but are in fact constrained by the limits of
POSSIBLE interactions amongst the many constituent entities.”
As the need to have integral descriptions for research endeavours grows, so does the
effort to cope with such a task. FuGO [9] has started to address issues related to modularity
and ontology integration. Different groups such as FuGO within the functional genomics
context and the Generation Challenge Program (GCP) [10] within the plant world are
223
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
currently evaluating some existing standards for ontologies and metadata. Dublin Core [11],
SKOS [12], ISO/IEC [13] and others have been part of this assessment. Some guidance will
be available for bio-communities so there is a unified criterion to define classes, properties
and metadata in general. This is certainly a step in the right direction, but it is yet too soon as
to predict future outcomes.
This thesis has demonstrated the role of communities of domain experts when
developing ontologies; as ontologies imply the contribution and agreement of a community,
they may be understood as “social agreements” for describing the reality of a particular
domain of knowledge. The whole process resembles in many ways an exercise of participatory
design, and even more interestingly it follows the main precept of user-centric design that
states that designs should always focus on user’s perceptions. Chapters 2 and 4 not only
present methodological aspects about building community ontologies, but also important
details of those processes in which these methods were applied.
Within the context of designing technology for biological researchers, what is the role
of the domain expert? An interesting parallel may be drawn from the field of designing
children’s technology. Three main methodologies have been applied: User Centric Design
(UCD) [14], Participatory Design (PD) [15], and Informant Design (ID) [16]. They all focus
on describing the kind of relation between children and designers, which affects the input
obtained. Interestingly, the relationships described by these authors as well as the dynamics
that emerges from the relationship between the children and the designer proved to be
applicable when designing technology for the biological community.
The UCD approach involves children in the design process as testers. This is the
traditional role of children as end-users of technology; where they are placed in a reacting role
in order to give feedback to designers about the product [17]. In this approach the designers
224
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
define what it is suitable for children; they get to an advanced point in the design process
before getting input from the users.
The fundamental assumption in the PD approach is that users and designers can view
each other as equals. Both therefore take on active roles in the design [17]. Following the
same line of thinking, Druin and Soloman [18] have proposed to have children as part of the
design team, particularly suggesting metaphors for the designer, and sharing, in some way, in
responsibilities and decision making.
On the other hand, the ID perspective considers children’s input to play a fundamental
role in the design process, thus seeing children not just as testers of technology. The
participation of children in this process is defined according to the different phases of design
and their goals. This approach is placed somewhere between UCD and PD; children are
informants but cannot be considered as co-designers [17].
When developing technology within the biological domain, the predominant approach
has been to use the domain expert as an informant on requirements, as well as a tester of the
end product. Research, in order to determine what the role of the domain expert should be
when developing his/her technology, is therefore sorely needed as the current approach has
proven not to be very successful. From our experiences, as reported in Chapters 2 and 4, the
constant input and interaction of the domain experts is crucial for the success of the
information system. Domain experts should be involved throughout the entire process. This
involvement is not only needed when developing ontologies within biological communities,
but also during software development. Participatory design is thus the most suitable
methodology as the control is shared by all of the design team members, and their research
agendas are open to changes and redefinitions. The position of the designers is that of
someone who is interested in knowing about domain experts, someone who is willing to re-
shape his/her own ideas. This perspective supports a closer relationship, where everyone is
learning. Designers working within an ID approach assume a position mediated by the goals
of the different stages of the process. The research agenda is defined according to the
informants' input across the process, hence in those stages where domain experts take part as
225
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
informants, the relationship resembles that promoted by the PD; designers want to know
facts they do not know about domain experts. The “control” should therefore be shared
whenever possible; domain experts should “lead” the whole process by establishing an
equalitarian relationship.
8.5 REFERENCES
5. Spellman P.T., Miller M. SJ, Troup C., Sarkans U., Cher-vitz S., Bernhart D., Sherlock
G., Ball C., Lepage M., Swiatek M. et al: Design and implementation of microarray
gene expression markup language (MAGE-ML). Genome Biology 2002.
6. Pirolli P, Card KS: Report on Information Foraging. In. Palo Alto: Palo Alto Research
Center; 2006.
226
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
14. Norman D, Draper S: User centered system design: new perspectives on Human-
Computer Interaction. New Jersey: Lawrence Erlbaum Associates; 1986.
15. Schuler D, Namioka A: Participatory design: principles and practices. New Jersey:
Lawrence Erlbaum Associates; 1993.
16. Scaife M, Rogers Y, Aldrich F, Davies M: Designing for or designing with? Informant
design for interactive learning environments. In: Conference on human factors in
computing systems: 1997; Atlanta, Georgia, USA: ACM; 1997.
18. Druin A, Solomon C: Designing multimedia envoronments for children. New York:
John Wiley; 1996.
227
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
9 Future work
9.1.1 Introduction
As the need for integrated biological research grows, ontologies become more and
more important within the life sciences. The biological community has a need not only for
controlled vocabularies but also for guidance systems for annotating experiments, better and
more reliable literature mining tools, and most of all a consistent, shared understanding of
what the information means. Ontologies, thus, should be understood as a starting point, not
as an end by themselves. Although several efforts to provide biological communities with
these required ontologies are currently in progress some of them have thus far proven to be
too slow, too expensive, and too error-prove to meet the demand.
The difficulties in these developments are due not only to the ambiguity of natural
language and the fact that biology is a highly fragmented domain of knowledge, but is also
due to the lack of consistent methodologies for ontology building in loosely centralised
environments such as the biological domain [1, 2]. Biologists need methodologies and tools in
the same vein that computer scientists need real life problems to work on. Collaboration
would thus be the easiest way to move forward. However, such interaction has proven
difficult as the two “houses” of Biology and Computer Science continue to fight each other in
the field of windmills where Don Quixote is pointing a way towards the horizon.
Our three houses have been accurately described by Goble and Wroe [3]. Firstly “The
Montagues”: “one, comforted by its logic’s rigour/Claims ontology for the realm of pure”. Goble and
Wroe define this house as the one of computer science, knowledge management, and
Artificial Intelligence (AI). This community essentially works with well-scoped and behaved
problems; they work with generalisations, and expect to have broadly applicable results. Their
228
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
Our second house, “The Capulets”: “The other, with blessed scientist’s vigour/acts hastily on
models that endure.” As Goble and Wroe defined it, this is the house of Life Sciences. Within
this community the purpose of bioinformatics is to support their research endeavours. This is
a community with a pragmatic and lead-by-need vision of computer science, with a strong
application pull. Ontologies for the Capulets are basically controlled vocabularies, taxonomies
that allow them to classify things, very much in accordance with a very old tradition in this
domain, one that started with the likes of Aristotle and Linnaeus. Within this house, the role
of the knowledge engineer is that of someone who promotes collaboration in a loosely
centralised environment. Biologists are thus not only leading the process but also designing
the ontology and the software that will ultimately utilise the ontology. Their ontologies are
living entities, constantly evolving.
Following Goble and Wroe’s analogy (henceforth Act 1, Prologue) we also have a
third house: The Philosophers. For narrative purposes it has been decided to name this as the
house of Don Quixote. For this house the essence of the "things" is important, as they seek a
single model of truth itself. Some tangible contributions of this house are those studies of the
part/whole relationship, how to model time, criteria for distinguishing among mutations,
transformations, perdurance, and endurance. Thanks to the heavy emphasis on theory, their
work this house has provided us with a conceptual corpus for understanding ontologies.
The same story will be used as a baseline for the remaining of this paper. Although
their houses endure we may be shifting acts and scenarios. As the Montagues and Capulets
dig deeper into their discrepancy, are we moving from Verona, via La Mancha, to Macondo,
where we all may face a hundred years of solitude? This literary analogy introduces thus a
229
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
possible ending point, Macondo, where we all may find the land of endogenous agreements.
A brief history of this drama is presented in Section 6.1.2; Section 6.1.3 presents some of the
duels between the two main houses of our narrative. The disjunctive marriage or poison is
discussed in Section 6.1.4. We argue in this section the potential danger of heading towards
Macondo versus remaining in Verona and finally living happily ever after.
Act 1, Scene 1: “Verona. A public place.” This is indeed true for both of our dignified
households. Computer scientists and biologists actively promote open source initiatives. The
Capulets have a long-standing tradition where sharing code is an everyday activity. The
OpenBio initiatives are a clear example of this fact; however these initiatives are not resources
for workbench biologists, they are meant to support bio-programmers and bioinformaticians.
The Montagues also have an interesting record of collaborative efforts, the development of
the Linux kernel and KDE (K Desktop Environment) to name two. Sharing code is,
however, different from sharing knowledge. Unfortunately little attention has been paid as to
exactly “how” these communities have carried out the process of knowledge management in
their corresponding projects [4].
Act 1, Scene 2: “Halls and rooms in our households’ houses.” During the last several
years The Capulets have been developing different ontologies: The Gene Ontology (GO), the
Microarray Gene Expression Data (MGED) Ontology (henceforth, MO) [5], a
comprehensive list is provided by OBO [6]. By the same token The Montagues have several
ontological initiatives such as OpenCyc [7] that is a general knowledge base, and SUO
(Standard Upper Ontology) [8]. The SUO WG (working group) is developing a standard,
which aims to specify an upper ontology to support computer applications such as data
interoperability, information search and retrieval, automated inference, and natural language
processing.
Act 1, Scene 3: “A lane by the wall of our household’s orchard.” While the Capulets
focus on standardising the words and their meanings, our Montagues embrace relationships,
230
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
logical descriptions, agent technology, and give serious considerations to some of the insights
from the House of Don Quixote such as Time, Matter, Substance, Mutability and many other
essential properties of a concept. The Montagues and the Quixotes tend to see the Capulets’
ontologies as dictionaries, and are prone to point out those deficiencies with accuracy [9, 10].
The Capulets defend their efforts with passion; a valid point in their favor is the lack of
knowledge that Montagues and Quixotes have about the biological community.
Act 2, Scene 2: “What's in a name? That which we call a rose by any other name would
smell as “sweet”. It was over two years ago that Hunter [11] responded to Brenner’s [12]
comment in Genome Biology. The interesting issue in those discussions was that the
fundamental question both were trying to address was never stated explicitly: What is the role
of ontologies in the life sciences?
Since ontologies may also be understood as social agreements, arguing that they are for
programs and not for people, the way Hunter responds to Brenner is completely descriptive
of their purpose. It is also true that Brenner misses the point by portraying ontologies as
solely taxonomies of words. Conceptual objects have concrete representations, they are in a
way tangible objects, and therefore should enable computational tasks.
Act 2, Scene 2, part 2: “A fertile and dangerous playground” Recently Soldatova et al.
[10] published a series of shortcomings related to MO. It did not take a long time for
Stoeckert et al. to respond [13]. One interesting issue in this scene is that neither party actually
addressed a key point about an ontology that should provide the conceptual scaffold for
describing microarray experiments. For instance, such an ontology should provide not just
those minimal descriptors but also the logical constraints so that inference will be possible.
Ontologies are not just controlled vocabularies; they should also provide support for
reasoning processes. In order to describe a biological investigation it is necessary to use many
different ontologies; how can we integrate these orthogonal ontologies so the final narrative
makes sense?
231
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
So far ontologies have focused on describing things and processes. However, the
relationship between the entity in question and the process by which it is studied has not yet
been fully explored. A “thing” is immersed in a context in which it is informative. The
fragmentation of processes and things should be consistent (i.e. definition of whole/part of
relations) so agents are able to “cut through” the myriad of information annotated with
ontologies. How shall we encompass those apparently unrelated descriptions? The other side
of the coin deals with the context due to the broader picture where the “thing” is of interest.
For instance, when studying a disease we need to gather information not only about those
experimental processes, but also the different responses of the system to the alterations we
have caused. We don’t only need to describe the “thing” we are studying but also the context
in which it is being studied. A disease may be seen as alteration of one or more metabolic
pathways, with the subsequent molecular implications. It may be described as a series of
objects with individual states, individual disease instances, and with relationships between
particular objects. Disease representation requires capturing individual object states as well as
the relationships between different objects. For example one can use the GO term 0005021:
vascular endothelial growth factor receptor as a partial descriptor of gene FLT3: fms-related tyrosine
kinase. This also allows the implication of an ATP binding activity to this gene, as is
understood from [14].
This, however, says nothing about the circumstances of the gene/protein product in a
disease state, or in an individual disease instance. The same can be said for disease objects,
which can also be effectively described by ontologies, but without state or relationship
provision. Is it possible with existing ontologies to accurately describe a disease from both
phenotypic and genotypic perspectives? Since ontologies offer what Brenner defines as
dictionaries, such a representation is not yet possible. To some extent, the solution seems to
be a linguistic exercise, utilising curated data sources and the biomedical text to first define the
relevant objects as they are both officially and commonly expressed, and then to both define
and determine the syntactic and semantic relationships between objects. To us, the existing
and emerging ontologies play a key role in tethering the objects to an objective structure.
However, the object states and relationships are what truly represent disease states and
232
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
instances. In our view, the dynamic nature of individualised disease states requires a more
flexible conceptual model, which encompasses the bridging of separate ontologies through
relationships.
Act 2, Scene 3: “Lingua Franca: a sanctorum by the orchard: He discovered then that he could
understand written English and that between parchments he had gone from the first page to the last of the six
volumes of the encyclopedia as if it were a novel.” As Brenner states explicitly in his comment, we
need to be fluent in our own language - but what does it mean to be fluent in one’s own
language? If someone learned a few phrases so that they could read menus in restaurants and
ask for directions on the street, would you consider them fluent in the language? Certainly
not. That type of phrase-book knowledge is equivalent to the way most people use computers
today. Is such knowledge useful? Yes. But it is not fluency. To be truly fluent in a foreign
language, you must be able to articulate a complex idea or tell an engaging story; in other
words, you must be able to “make things” with language. Analogously, being digitally fluent
involves not only knowing how to use technological tools, but also knowing how to construct
things of significance with those tools [15]. Learning how to use Protégé does not make you
an ontologist; by the same token knowing GO does not make you a biologist. Respect and
understanding for others motivations, contributions and needs are fundamental for a
successful marriage.
“The world was so recent that many things lacked names, and in order to indicate them it was necessary
to point”. Verona and Macondo are an apt metaphor for Bio-ontologies today. On one hand
Verona represents the possible starting point from which we may all do business and thus
engage in win-win situations, as described by Stein [16]. Alternatively Macondo represents the
undesired possible arrival point, a magical realism where man's astonishment before the
wonders of the real world are expressed in isolation. Since the real world encompasses
different views, codes of practice, rules, values, and areas of interest, we should focus on our
common point: fostering interdisciplinary collaboration and communication and thus
engaging in business. GONG (Gene Ontology Next Generation) [17] and FuGO (Functional
233
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
Genomics Investigation Ontology) [18] may illustrate how to work together. They may
eventually show us important lessons not only from the ontological perspective but also from
the community perspective. However, it is yet too soon as to fairly evaluate those lessons
learnt. Some practical realism would consequently come in quite handy if we all are to avoid a
hundred years of solitude.
9.1.5 References
3. Goble C, Wroe C: The Montangues and the Capulets. Comparative and Functional
Genomics 2004, 5:623-632.
6. OBO [http://obo.sourceforge.net/]
7. OpenCyc [http://www.opencyc.org/]
9. Smith B, Williams J, Schulze-Kremer S: The Ontology of the Gene Ontology. In: AMIA
Symposium: 2003; 2003.
10. Soldatova N, Larisa., King D, Ross.: Are the current ontologies in biology good
ontologies? Nature Biotechnology 2005, 23:1095-1098.
234
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
12. Brenner S: Life sentences: Ontology recapitulates philology. Genome Biology 2002,
3(4).
13. Stoeckert CJ, Ball C, Brazma A, Brinkman R, Causton H, Fan L, Fostel J: Wrestling
with SUMO and bio-ontologies. Nature Biotechnology 2006, 24:21-22.
15. Resnick M: Rethinking Learning in the Digital Age. In The Global Information
Technology Report: Readiness for the Networked World: Oxford University Press;
2002.
235
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
APPENDIXES
GLOSSARY
Communities of practice: Communities of practice are the basic building blocks of a social
learning system because they are the social ‘containers’ of the competences that make up such
a system. Communities of practice define competence by combining three elements. First,
members are bound together by their collectively developed understanding of what their
community is about and they hold each other accountable to this sense of joint enterprise. To
be competent is to understand the enterprise well enough to be able to contribute to it.
Second, members build their community through mutual engagement. They interact with one
another, establishing norms and relationships of mutuality that reflect these interactions. To
be competent is to be able to engage with the community and be trusted as a partner in these
interactions. Third, communities of practice have produced a shared repertoire of communal
resources—language, routines, sensibilities, artefacts, tools, stories, styles, etc. To be
competent is to have access to this repertoire and be able to use it appropriately.
Competency questions: Are understood here as those questions for which we want the
ontology to be able to provide support for reasoning and inferring processes.
Concept maps: A concept map is a diagram showing the relationships among concepts.
Concepts are connected with labelled arrows, in a downward-branching hierarchical structure.
236
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
The relationship between concepts is articulated in linking phrases, e.g., "gives rise to",
"results in", "is required by," or "contributes to".
Domain expert: A domain expert or subject matter expert (SME) is a person with
special knowledge or skills in a particular area. Domain experts are individuals who are both
knowledgeable and extremely experienced with application domains
GCG: Formally known as the GCG Wisconsin Package, the GCG contains over 140
programs and utilities covering the cross-disciplinary needs of today’s research environment.
237
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
MOBY: The MOBY system for interoperability between biological data hosts and
analytical services.
Relevant scenarios: Scenarios in which they considered the term were going to be used.
Semantic Web (SW): The semantic web is an evolving extension of the World Wide
Web in which web content can be expressed not only in natural language, but also in a format
that can be read and used by software agents, thus permitting them to find, share and
integrate information more easily.
Task: Is the atomic unit of work that may be monitored, evaluated and/or measured. A
task is a well defined work assignment for one or more project member. Related tasks are
usually grouped to form activities.
Task ontologies: Those ontologies that describe vocabulary related to tasks, processes,
or activities.
238
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
TAVERNA: The Taverna project aims to provide a language and software tools to
facilitate easy use of workflow and distributed compute technology within the Science
community.
Text mining: Text mining, sometimes alternately referred to as text data mining, refers
generally to the process of deriving high quality information from text.
W2H: W2H is a free WWW interface to sequence analysis software tools like the GCG-
Package (Genetic Computer Group), EMBOSS (European Molecular Biology Open).
Software Suite) or to derived services (such as HUSAR, Heidelberg Unix Sequence Analysis
Resources).
239
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
ACRONYMS
240
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
241
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
This version of the RSBI ontology represents high-level concepts usually found in the
description of biological investigations. Protégé, version 3.1., was the ontology editor software
used during the development of this ontology.
242
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
243
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
244
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
245
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
This list of terms was gathered by using Text2Onto, part of the KAON ontology
framework. In total, eight documents were scanned with this software.
246
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
247
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
-12,62642133 1 12 ascii
-17,62642133 1 66 n
-12,62642133 1 11 production
-13,62642133 1 28 call
-17,62642133 1 68 crop
-13,62642133 1 20 development
-22,62642133 1 116 cf
-14,62642133 1 39 tree
-14,62642133 1 37 increase
-68,62642133 1 576 germplasm
-12,62642133 1 17 origin
-12,62642133 1 17 genesis
-12,62642133 1 10 polyploid
-24,62642133 1 137 icis
-12,62642133 1 11 evaluation
-13,62642133 1 22 second
-12,62642133 1 12 range
-25,62642133 1 146 gen
-12,62642133 1 12 case
-12,62642133 1 11 parentage
-12,62642133 1 14 see
-15,62642133 1 44 output
-12,62642133 1 15 descent
-33,62642133 1 224 population
-12,62642133 1 13 standardisation standard
-25,62642133 1 146 parent
-12,62642133 1 10 log
-13,62642133 1 26 starting start
-12,62642133 1 11 key
-15,62642133 1 43 command
-12,62642133 1 13 ltype
-12,62642133 1 15 dll
-12,62642133 1 14 display
-16,62642133 1 50 g
-12,62642133 1 10 ntype
-15,62642133 1 45 text
-15,62642133 1 47 gidx
-21,62642133 1 108 c
-12,62642133 1 14 column
-12,62642133 1 10 lgms
-12,62642133 1 12 session
-14,62642133 1 36 male
-17,62642133 1 66 half
248
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
-12,62642133 1 11 copy
-36,62642133 1 250 database
-13,62642133 1 24 genealogy genealogies
-12,62642133 1 11 exe exes
-15,62642133 1 49 browse
-12,62642133 1 17 inger
-19,62642133 1 87 derivative derivation
-13,62642133 1 24 mating mate
-16,62642133 1 53 menu
-12,62642133 1 13 instance
-18,62642133 1 74 cultivar
-12,62642133 1 10 aim
-13,62642133 1 20 pollination
-14,62642133 1 33 order
-12,62642133 1 10 factor
-15,62642133 1 41 directory
-12,62642133 1 14 purification
-17,62642133 1 62 dsp
-14,62642133 1 30 link
-13,62642133 1 20 abbreviation
-16,62642133 1 57 syntax
-15,62642133 1 46 general generation generative
-26,62642133 1 155 gid gids
-12,62642133 1 10 k
-13,62642133 1 20 will
-19,62642133 1 82 date
-12,62642133 1 18 meth
-21,62642133 1 102 self selfing selfs
-12,62642133 1 19 variety
-17,62642133 1 65 default
-12,62642133 1 17 gmsinput
-14,62642133 1 34 cv
-12,62642133 1 17 print
-23,62642133 1 123 plant planting
-13,62642133 1 21 end
-23,62642133 1 121 information
-27,62642133 1 163 o
-45,62642133 1 349 list listing
-14,62642133 1 39 variable variability
-25,62642133 1 144 field
-13,62642133 1 23 top
-12,62642133 1 15 gene
-19,62642133 1 87 installation
249
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
-17,62642133 1 60 pedigree
-20,62642133 1 97 integer
-15,62642133 1 41 section
-12,62642133 1 15 descriptor
-12,62642133 1 14 tool
-14,62642133 1 36 d
-13,62642133 1 27 history
-12,62642133 1 13 double doubling
-23,62642133 1 127 value
-13,62642133 1 22 release
-12,62642133 1 15 export exporting
-12,62642133 1 10 methn
-13,62642133 1 29 expansion
-18,62642133 1 79 collection
-13,62642133 1 23 point
-32,62642133 1 219 user
-14,62642133 1 30 l
-20,62642133 1 96 structure
-29,62642133 1 182 data
-12,62642133 1 16 gidy
-17,62642133 1 67 diallel
-28,62642133 1 179 source
-12,62642133 1 11 destination
-26,62642133 1 155 ids id
-12,62642133 1 17 help
-15,62642133 1 48 ini
-12,62642133 1 15 mass
-13,62642133 1 21 odbc
-12,62642133 1 13 mixture
-12,62642133 1 13 reason
-12,62642133 1 13 listbox
-12,62642133 1 11 szbuffer
-17,62642133 1 65 character
-12,62642133 1 11 item
-12,62642133 1 16 form
-12,62642133 1 11 h
-16,62642133 1 52 check checking
-31,62642133 1 200 click
-13,62642133 1 27 import
-16,62642133 1 50 tester
-12,62642133 1 19 auto
-13,62642133 1 25 password
-13,62642133 1 25 right
250
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
-14,62642133 1 30 convention
-30,62642133 1 197 type
-13,62642133 1 25 site
-13,62642133 1 20 run running
-17,62642133 1 62 bulk bulking
-18,62642133 1 78 man
-16,62642133 1 56 fertilising fertilisation
-14,62642133 1 31 culture
-21,62642133 1 106 figure
-14,62642133 1 33 find_next
-12,62642133 1 13 return
-13,62642133 1 25 element
-12,62642133 1 14 entity
-13,62642133 1 23 identification
-12,62642133 1 14 t
-15,62642133 1 47 administrator administration
-13,62642133 1 23 length
-16,62642133 1 51 use
-13,62642133 1 24 level
-23,62642133 1 128 clone
-12,62642133 1 12 termination terminal terminator
-22,62642133 1 110 group
-13,62642133 1 28 implementation
-12,62642133 1 15 parse parsing
-12,62642133 1 16 day
-12,62642133 1 18 ir
-13,62642133 1 20 target
-12,62642133 1 18 replacement
-12,62642133 1 10 array
-12,62642133 1 12 follows
-12,62642133 1 14 material
-12,62642133 1 10 multiple multiplication
-26,62642133 1 154 access accessing accession
-12,62642133 1 11 initialisation
-13,62642133 1 23 germplsm
-13,62642133 1 28 wheat
-12,62642133 1 10 test testing
-14,62642133 1 31 cytoplasm
-12,62642133 1 13 cd
-12,62642133 1 12 bw bws
-12,62642133 1 17 fieldbook
-32,62642133 1 215 method
-23,62642133 1 128 record recording
251
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
-18,62642133 1 74 example
-16,62642133 1 54 system
-27,62642133 1 164 description
-13,62642133 1 25 identifier
-12,62642133 1 18 back
-12,62642133 1 13 cycle
-51,62642133 1 405 table
-12,62642133 1 13 mutation
-12,62642133 1 15 works working work
-17,62642133 1 67 file
-12,62642133 1 18 a
-13,62642133 1 20 size
-13,62642133 1 21 month
-12,62642133 1 10 buffer
-17,62642133 1 62 e
-12,62642133 1 10 download
-13,62642133 1 23 match matching
-12,62642133 1 12 find_first
-12,62642133 1 11 definition
-20,62642133 1 93 argument
-12,62642133 1 12 batch
-13,62642133 1 23 dialog
-13,62642133 1 24 button
-12,62642133 1 19 char
-12,62642133 1 16 term
-23,62642133 1 124 process processing
-13,62642133 1 26 female
-12,62642133 1 19 i
-13,62642133 1 26 status
-12,62642133 1 15 local
-64,62642133 1 537 name naming
-13,62642133 1 24 maintenance
-16,62642133 1 54 management manager
-15,62642133 1 44 application
-21,62642133 1 104 set setting
-16,62642133 1 55 progenitor progenitors
-18,62642133 1 74 programming program
-12,62642133 1 11 part
-12,62642133 1 18 layout
-13,62642133 1 26 box
-30,62642133 1 197 window
-13,62642133 1 24 spp
-13,62642133 1 23 gms_germplasm
252
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
253
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
254
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
255
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
256
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
257
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
This version of the GMS ontology corresponds to the work done by Patrick Ward and
Mark Wilkinson. Domain experts from The International Centre for Tropical Agriculture
(CIAT) worked with this version at a later stage. Protégé, version 3.1., was the ontology editor
software used during the development of this ontology.
Appendix 3 - Figure 1. A portion of the first version of the GMS ontology, Germplasm.
258
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
Appendix 3 - Figure 2. The Germplasm Method section of the first version of the GMS ontology.
259
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
Appendix 3 - Figure 3. The Germplasm Identifier section of the first version of the GMS ontology.
260
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
This version of the GMS ontology mostly corresponds to the work done together with
domain experts from the Australian Centre for Plant Functional Genomics and The
International Center for Tropical Agriculture (CIAT). Protégé, version 3.1., was the ontology
editor software used during the development of this ontology.
261
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
262
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
Appendix 4 - Figure 3. Germplasm Breeding Stock, a portion of the second version of the GMS
ontology.
Appendix 4 - Figure 4. Naming convention according to the second version of the GMS ontology.
263
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
Appendix 4 - Figure 5. Plant Breeding Method according to the second version of the GMS
ontology.
264
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
265
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
266
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
+ <parameter name="window">
+ <parameter name="tossgaps">
+ <parameter name="strandgap">
+ <parameter name="helixendin">
+ <parameter name="helixendout">
+ <parameter name="terminalgap">
+ <parameter name="hgapresidues">
+ <parameter name="strandendout">
+ <parameter name="secstrout">
+ <parameter name="pwgapopen">
+ <parameter name="pgap">
+ <parameter name="actions">
+ <parameter name="endgaps">
+ <parameter name="output">
+ <parameter name="seed">
+ <parameter name="gapext">
+ <parameter name="matrix">
+ <parameter name="kimura">
+ <parameter name="dnamatrix">
+ <parameter name="quicktree">
+ <parameter name="outputtree">
+ <parameter name="topdiags">
</task>
</stage>
- <stage>
<annotation>The alignment result from previous step is used to build two phylogeny
using two different methods from the Phylip set of methods</annotation>
- <task id="2" email="a.garcia@imb.uq.edu.au">
<annotation>Parsimony method</annotation>
<transformer name="dnapars" version="3.6a2"
server="http://kun.homelinux.com/cgi-bin/Pise/5.a/dnapars.pl" />
+ <pipe_component>
+ <parameter name="print_steps">
+ <parameter name="print_treefile">
+ <parameter name="use_threshold">
+ <parameter name="indent_tree">
+ <parameter name="outgroup">
+ <parameter name="printdata">
+ <parameter name="print_tree">
+ <parameter name="threshold">
+ <parameter name="use_transversion">
+ <parameter name="replicates">
+ <parameter name="seqboot_seed">
267
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
+ <parameter name="method">
+ <parameter name="print_sequences">
+ <parameter name="jumble">
+ <parameter name="user_tree">
+ <parameter name="weights">
+ <parameter name="seqboot">
+ <parameter name="consense">
+ <parameter name="times">
+ <parameter name="jumble_seed">
</task>
- <task id="3" email="a.garcia@imb.uq.edu.au">
<annotation>Distance method</annotation>
<transformer name="dnadist" version="3.6a2"
server="http://gpipe.majorlinux.com/cgi-bin/Pise/5.a/dnadist.pl" />
+ <pipe_component>
+ <parameter name="matrix_form">
+ <parameter name="ratio">
+ <parameter name="printdata">
+ <parameter name="replicates">
+ <parameter name="gamma">
+ <parameter name="seqboot_seed">
+ <parameter name="method">
+ <parameter name="distance">
+ <parameter name="weights">
+ <parameter name="one_category">
+ <parameter name="seqboot">
+ <parameter name="empirical_frequencies">
</task>
</stage>
</protocol>
268
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
INDEX
A F
BRENDA 180, 192, 241 G-PIPE xxvi, xxvii, 177, 178, 194, 195, 197, 198, 204,
Domain expert 50, 57, 64, 65, 87, 90, 93, 94, 97, 100,
110, 125, 133, 140, 226, 238, 259 J
Domain ontologies 35, 238
Jemboss 176, 179, 241
DTD 241
K
E
KAON 135, 142, 144, 247
EMBOSS 175, 176, 178, 179, 191, 196, 199, 205, 211,
KEGG 180, 192, 241
215, 240, 241
Knowledge xviii, xxvii, xxviii, 33, 34, 36, 42, 43, 46,
evolution xi, 44, 45, 49, 50, 56, 57, 72, 74, 75, 87, 88,
47, 51, 52, 53, 63, 71, 76, 77, 78, 87, 106, 107, 108,
116, 119, 120, 121, 155, 158, 166, 186, 221, 224
269
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
116, 119, 127, 131, 142, 143, 151, 172, 189, 190, Protégé xxvi, xxvii, 81, 95, 96, 99, 100, 101, 104, 110,
235, 238 111, 113, 114, 115, 116, 118, 120, 127, 132, 133,
Knowledge acquisition 63, 71, 77, 190 142, 147, 234, 239, 243, 259, 262
Knowledge elicitation 36, 42, 43, 46, 47, 63, 119, 238 PSI 147, 151, 158, 242
L R
Life Cycle 77, 238 Relevant scenarios 239
RSBIxxi, xxv, xxvii, 80, 92, 93, 99, 105, 108, 129, 145,
M 146, 147, 148, 149, 150, 193, 242, 243, 244, 245,
263, 266
MAGE xxix, 158, 188, 220, 227, 241
MAGPIE 158, 188
S
Mailing lists 119
Management processes 62 Scheduling 60, 61, 62
Method 76, 239, 260, 265 SOAP 164, 179, 242
MGEDxvi, xix, xx, xxviii, xxix, 77, 80, 92, 93, 94, 105, SQL 160, 162, 164, 175, 185, 202, 242
108, 125, 129, 145, 146, 147, 148, 150, 158, 193, SRS 158, 160, 163, 164, 165, 178, 179, 183, 186, 188,
220, 227, 231, 235, 238, 241 242
MIAME xxi, xxix, 91, 145, 146, 150, 158, 241 SW 44, 49, 50, 57, 73, 75, 87, 88, 119, 222, 239, 242
MO xvi, xix, 49, 61, 94, 125, 220, 231, 232, 241
MOBY 164, 166, 175, 239 T
TAMBIS 171, 172, 183, 190, 221, 227, 242
O
Task 35, 178, 239, 242
On-the-Ontology comments 62 Task ontologies 35, 239
Onthology 239 TAVERNA 197, 204, 211, 240
OQL 160, 242 Technique 240
Outbound-interaction 63 Terminology extraction 85, 86, 132, 240
Text mining 240
P Text2ONTO 134, 240
The Bernaras methodology 36, 40
PATH 178, 191, 201, 203, 242
The DILIGENT methodology 36
PISE xxvii, 193, 195, 196, 204, 205, 206, 242
The Enterprise Methodology 36
PO 32, 52, 218, 227, 242
The METHONTOLOGY methodology 36, 41
PRECIS 158, 188
Process 41, 78, 176, 239
U
UNIX 175, 240
270
H H
F-XC A N GE F-XC A N GE
PD PD
!
W
W
O
O
N
N
y
y
bu
bu
to
to
k
k
lic
lic
C
C
w
w
m
m
w w
w
w
o
o
.d o .c .d o .c
c u-tr a c k c u-tr a c k
W X
W2H 176, 177, 178, 183, 191, 196, 215, 240 XML 99, 154, 156, 162, 164, 173, 174, 176, 177, 178,
W3H 176, 177, 178, 179, 196, 240 183, 185, 190, 195, 202, 204, 205, 206, 212, 214,
wiki pages 61, 62 242
WIT 180, 192 Xpath 173, 242
Workflow 177, 196, 205, 240 XQL 173, 202, 242
271