You are on page 1of 275

H H

F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

DEVELOPING ONTOLOGIES IN THE


BIOLOGICAL DOMAIN

A thesis submitted for the degree of Doctor of


Philosophy at The University of Queensland, Institute
for Molecular Bioscience

Alexander García Castro

September 2007
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

STATEMENT OF ORIGINALITY

I declare that the work presented in this thesis is, to the best of my knowledge and
belief, original and my own work, except as acknowledged in the text. The material (presented
as my own) has not been submitted previously, either in whole or in part, for a degree at this
or any other institution.

Alexander García Castro

STATEMENT OF CONTRIBUTION OF OTHERS

In those cases in which the work presented in this thesis was the product of
collaborative efforts I declare that my contribution was substantial and prominent, involving
the development of original ideas as well as the definition and implementation of subsequent
work. Detailed information about the participation of other researchers in parts of this thesis
is provided in the section "Author's contributions" at the beginning of each chapter.

Mark A. Ragan Alexander García Castro


H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

UNIVERSITY OF QUEENSLAND

ABSTRACT

Developing Ontologies In The Biological Domain

by Alexander Garcia Castro

Chairperson of Supervisory Committee: Professor: Mark Ragan

The development of "omic" technologies and its applications into biological sciences
has increased the need for an integrated view of bio-related information. The flood of
information as well as the technological availability has made it necessary for researchers to
share resources and join efforts more than ever in order to understand the function of genes,
proteins and biological systems in general. Integrating biological information has been
addressed mainly from a syntactical perspective. However, as we enter into the post-genomic
era integration has acquired a meaning more related to the capacity of inference (finding
hidden information) and sharebility in large web-based information systems. Ontologies play
a central role when addressing both syntactic and semantic aspects of information integration.

The purpose of this research has been to investigate how the biological community
could develop those highly needed ontologies in a way that ensures both, maintainability and
usability. Although the need for ontologies, as well as the benefits of having them, is obvious;
it has proven to be difficult for the biological community not only to develop but also to
effectively use them. Why? How should they be developed in such a way that they are
maintainable and usable by existing and novel information systems? A feasible methodology
elucidated from the careful study whilst developing biological ontologies is proposed.
Methodological extensions gathered from the acquired experience are also presented.
Throughout the chapters of this thesis diverse integrative approaches have also been analysed
from different perspectives; a workflow-based solution to the integration of analytical tools
was consequently proposed. This made it possible to better understand the need for well-
defined semantics in biological information systems as well as the importance of a thoughtful
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

understanding of the relationship between the semantic structure and the syntactic scaffold
that should ultimately host the former.

The role of communities in the construction of biological ontologies as well as the


argumentative structure that takes place during the development and maintenance of them
have been extensively studied in this thesis. What is the role of the domain expert when
developing ontologies within the biological domain? Different scenarios in which ontologies
were developed have been studied in order to answer this question. The relationship between
domain experts and knowledge engineers was analyzed during the development of loosely
centralised ontologies. As a consequence of those direct experiences developing ontologies a
viable use for concept maps supporting collaboration and annotation was anticipated;
consequent software developments are also part of this investigation.

From this investigation several conclusions have been drawn, one of them with a
particular significance is the relevance of collaboration between two asymmetric, yet not
antagonist, communities; computer scientists and biologists may work and achieve results in
different ways, nevertheless both communities hold valuable information that could be of
mutual benefit. Within the context of biological ontologies "Romeo and Juliet" proved to be an
apt metaphor that illustrates not only the importance of the collaboration, but also how we
may avoid heading towards "A hundred years of solitude".
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

TABLE OF CONTENTS

TABLE OF CONTENTS .............................................................................................................I

LIST OF FIGURES ................................................................................................................ VII

LIST OF TABLES ..................................................................................................................... X

ACKNOWLEDGMENTS......................................................................................................... XI

INTRODUCTION ................................................................................................................. XIII

OVERVIEW ............................................................................................................................ XIII


WHAT IS AN ONTOLOGY?........................................................................................................ XIV
CONTROLLED VOCABULARIES AND ONTOLOGIES ...................................................................... XVI
WHY ONTOLOGIES?............................................................................................................... XVII
WHY COMMUNITIES? ........................................................................................................... XVIII
BRINGING IT ALL TOGETHER .................................................................................................... XX
RESEARCH PROBLEM ............................................................................................................. XXII
CONTRIBUTIONS OF THIS THESIS............................................................................................ XXIII
OUTLINE OF THIS THESIS.......................................................................................................XXIV
PUBLISHED PAPERS ..............................................................................................................XXVI
SOFTWARE DEVELOPED, INCLUDING ONTOLOGIES. ................................................................XXVII
REFERENCES ......................................................................................................................XXVII

1 CHAPTER I - COMMUNITIES AT THE MELTING POINT WHEN BUILDING


ONTOLOGIES.................................................................................................................. 31

1.1 INTRODUCTION ........................................................................................................... 31


1.2 METHODS AND METHODOLOGIES FOR BUILDING ONTOLOGIES ........................................ 33
1.2.1 The Enterprise Methodology................................................................................... 36
1.2.2 The TOVE Methodology ......................................................................................... 38
1.2.3 The Bernaras methodology ..................................................................................... 40
1.2.4 The METHONTOLOGY methodology ..................................................................... 41
1.2.5 The SENSUS methodology...................................................................................... 43
1.2.6 DILIGENT ............................................................................................................ 44
1.3 WHERE IS THE MELTING POINT? ................................................................................... 45
1.3.1 Similarities between methodologies......................................................................... 46

i
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

1.3.2 Shortcoming of the methodologies........................................................................... 47


1.4 ACKNOWLEDGEMENTS ................................................................................................ 51
1.5 REFERENCES .............................................................................................................. 51

2 CHAPTER II - THE MELTING POINT, A METHODOLOGY FOR DEVELOPING


ONTOLOGIES WITHIN DECENTRALISED SETTINGS ............................................. 56

2.1 INTRODUCTION ........................................................................................................... 56


2.2 TERMINOLOGICAL CONSIDERATIONS ............................................................................ 58
2.3 THE METHODOLOGY AND THE LIFE CYCLE .................................................................... 59
2.3.1 Documentation processes ....................................................................................... 60
2.3.1.1 Activities for documenting the management processes .................................................... 60
2.3.1.2 Documenting classes and properties ............................................................................... 61
2.3.2 Management processes .......................................................................................... 62
2.3.2.1 Scheduling ................................................................................................................... 62
2.3.2.2 Control ........................................................................................................................ 62
2.3.2.3 Inbound-interaction ...................................................................................................... 62
2.3.2.4 Outbound-interaction .................................................................................................... 63
2.3.2.5 Quality assurance ......................................................................................................... 63
2.3.3 Development-oriented processes............................................................................. 63
2.3.3.1 Feasibility study and milestones .................................................................................... 63
2.3.3.2 Activities for the conceptualisation ................................................................................ 63
2.3.3.2.1.1 Milestones, techniques and tasks for the ka and da activities............................. 66
2.3.3.3 Iterative Building of Ontology Models (IBOM). ............................................................. 66
2.3.3.3.1 Methods, Techniques and Milestones for the IBOM. ............................................... 66
2.3.3.4 Formalisation ............................................................................................................... 67
2.3.3.5 Evaluation.................................................................................................................... 67
2.3.3.5.1 Application-dependent evaluation .......................................................................... 68
2.3.3.5.2 Terminology evaluation. ....................................................................................... 68
2.3.3.5.3 Taxonomy evaluation............................................................................................ 68
2.3.3.6 A summary of the process. ............................................................................................ 69
2.4 AN INCREMENTAL EVOLUTIONARY SPIRAL MODEL OF TASKS, ACTIVITIES AND PROCESSES 69
2.5 DISCUSSION................................................................................................................ 72
2.6 CONCLUSIONS........................................................................................................ 75
2.7 ACKNOWLEDGEMENTS ................................................................................................ 76
2.8 REFERENCES .............................................................................................................. 76

ii
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

3 CHAPTER III - THE USE OF CONCEPT MAPS DURING KNOWLEDGE


ELICITATION IN ONTOLOGY DEVELOPMENT PROCESSES ................................. 82

3.1 THE USE OF CONCEPT MAPS DURING KNOWLEDGE ELICITATION IN ONTOLOGY


DEVELOPMENT PROCESSES – THE NUTRIGENOMICS USE CASE .......................................... 82

3.1.1 Background ........................................................................................................... 82


3.1.1.1 A survey of methodologies............................................................................................ 84
3.1.2 Methods ................................................................................................................ 88
3.1.2.1 General view of our methodology .................................................................................. 88
3.1.2.2 Scenarios and ontology development process ................................................................. 92
3.1.2.2.1 Identification of purpose, scope, competency questions and scenarios....................... 93
3.1.2.2.2 Identification of reusable and recyclable ontologies ................................................. 94
3.1.2.2.3 Domain analysis and knowledge acquisition ........................................................... 94
3.1.2.2.3.1 Attributes of the domain experts..................................................................... 95
3.1.2.2.3.2 The knowledge elicitation sessions ................................................................. 95
3.1.2.2.3.3 Representing conceptual queries .................................................................... 96
3.1.2.2.4 Iterative building of informal ontology models........................................................ 97
3.1.3 Future work........................................................................................................... 99
3.1.3.1 Formalisation ............................................................................................................... 99
3.1.3.2 Evaluation.................................................................................................................. 101
3.1.4 Discussion ........................................................................................................... 102
3.1.5 Conclusions ......................................................................................................... 104
3.1.6 Acknowledgements............................................................................................... 105
3.1.7 References ........................................................................................................... 106
3.2 THE USE OF CONCEPT MAPS FOR TWO ONTOLOGY DEVELOPMENTS: NUTRIGENOMICS, AND A
MANAGEMENT SYSTEM FOR GENEALOGIES. ................................................................. 110
3.2.1 Introduction......................................................................................................... 110
3.2.2 Methodology........................................................................................................ 112
3.2.3 CM plug-in for Protégé ........................................................................................ 113
3.2.4 Conclusions and future work. ............................................................................... 115
3.2.5 Acknowledgements............................................................................................... 115
3.2.6 References ........................................................................................................... 115

4 CHAPTER IV - COGNITIVE SUPPORT FOR AN ARGUMENTATIVE STRUCTURE


DURING THE ONTOLOGY DEVELOPMENT PROCESS .......................................... 119

4.1 INTRODUCTION ......................................................................................................... 119

iii
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

4.2 ARGUMENTATIVE STRUCTURE AND CMS .................................................................... 120


4.3 ARGUMENTATION VIA CMS ....................................................................................... 122
4.4 DISCUSSION AND CONCLUSIONS ................................................................................. 126
4.5 REFERENCES ............................................................................................................ 126

5 CHAPTER V -NARRATIVES AND BIOLOGICAL INVESTIGATIONS .................... 130

5.1 THE USE OF CONCEPT MAPS AND AUTOMATIC TERMINOLOGY EXTRACTION DURING THE
DEVELOPMENT OF A DOMAIN ONTOLOGY. LESSONS LEARNT. ........................................ 130
5.1.1 Introduction......................................................................................................... 130
5.1.2 Survey of methodologies....................................................................................... 131
5.1.3 General view of our methodology. ........................................................................ 133
5.1.4 Our scenario and development process ................................................................. 136
5.1.5 Results: GMS baseline ontology............................................................................ 137
5.1.6 Discussion and conclusions .................................................................................. 140
5.1.7 References ........................................................................................................... 142
5.2 A PROPOSED SEMANTIC FRAMEWORK FOR REPORTING OMICS INVESTIGATIONS. ............. 145
5.2.1 Introduction......................................................................................................... 145
5.2.2 Methodology........................................................................................................ 147
5.2.3 The RSBI Semantic Framework ............................................................................ 148
5.2.4 Conclusions and Future Directions ....................................................................... 149
5.2.5 References ........................................................................................................... 150

6 CHAPTER VI - INFORMATION INTEGRATION IN MOLECULAR BIOSCIENCE 154

6.1 OVERVIEW OF ISSUES AND TECHNOLOGIES.................................................................. 156


6.1.1 Data availability .................................................................................................. 157
6.1.2 Data quality......................................................................................................... 157
6.1.3 Standardisation ................................................................................................... 158
6.1.4 Language ............................................................................................................ 158
6.1.5 Access ................................................................................................................. 159
6.2 STRATEGIES FOR DATA INTEGRATION ......................................................................... 160
6.2.1 Platforms ............................................................................................................ 160
6.2.2 Developments ...................................................................................................... 163
6.2.2.1 Sequence Retrieval System (SRS)................................................................................ 163
6.2.2.2 GeneCards® .............................................................................................................. 165
6.2.2.3 Entrez........................................................................................................................ 165

iv
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

6.2.2.4 Ensembl..................................................................................................................... 165


6.2.2.5 BioMOBY ................................................................................................................. 166
6.2.2.6 myGrid ...................................................................................................................... 167
6.2.2.7 Others........................................................................................................................ 167
6.3 SEMANTIC INTEGRATION OF INFORMATION IN MOLECULAR BIOSCIENCE ...................... 167
6.4 XML AS A DESCRIPTION OF DATA AND INFORMATION.................................................. 173
6.5 GRAPHICAL USER INTERFACES (GUIS) AS INTEGRATIVE ENVIRONMENTS...................... 175
6.6 METABOLIC PATHWAY DATABASES AS AN EXAMPLE OF INTEGRATION......................... 179
6.7 SUMMARY, CONCLUSIONS AND UNSOLVED PROBLEMS................................................ 183
6.8 ACKNOWLEDGMENTS................................................................................................ 187
6.9 REFERENCES ............................................................................................................ 187

7 CHAPTER VII - WORKFLOWS IN BIOINFORMATICS: META-ANALYSIS AND


PROTOTYPE IMPLEMENTATION OF A WORKFLOW GENERATOR .................. 195

7.1 BACKGROUND .......................................................................................................... 195


7.2 RESULTS .................................................................................................................. 198
7.2.1 Syntactic and algebraic components ..................................................................... 199
7.2.2 Workflow generation, an implementation .............................................................. 205
7.3 ARCHITECTURAL DETAILS......................................................................................... 206
7.4 SEMANTIC AND SYNTACTIC ISSUES ............................................................................. 207
7.5 DISCUSSION.............................................................................................................. 210
7.6 CONCLUSION ............................................................................................................ 214
7.7 ACKNOWLEDGEMENTS .............................................................................................. 214
7.8 REFERENCES ............................................................................................................ 215

8 CONCLUSIONS AND DISCUSSION............................................................................. 216

8.1 SUMMARY ............................................................................................................... 216


8.2 BIOLOGICAL INFORMATION SYSTEMS AND ONTOLOGIES. ............................................. 217
8.3 TOWARDS A SEMANTIC WEB IN BIOLOGY .................................................................... 220
8.4 DEVELOPING BIO-ONTOLOGIES AS A COMMUNITY EFFORT. ........................................... 224
8.5 REFERENCES ............................................................................................................ 226

9 FUTURE WORK ............................................................................................................ 228

9.1 BIO-ONTOLOGIES: THE MONTAGUES AND THE CAPULETS, ACT TWO, SCENE TWO: FROM
VERONA TO MACONDO VIA LA MANCHA. ................................................................... 228

v
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

9.1.1 Introduction......................................................................................................... 228


9.1.2 Some background information .............................................................................. 230
9.1.3 The Duels and the duets. ...................................................................................... 231
9.1.4 Marriage, Poison, and Macondo .......................................................................... 233
9.1.5 References ........................................................................................................... 234

APPENDIXES ........................................................................................................................ 236

GLOSSARY ............................................................................................................................ 236


ACRONYMS .......................................................................................................................... 240
APPENDIX 1 – RSBI ONTOLOGY ............................................................................................. 242
APPENDIX 2 – EXTRACTED TERMINOLOGY ............................................................................. 246
APPENDIX 3 – GMS BASELINE ONTOLOGY (VERSION 1) .......................................................... 258
APPENDIX 4 - GMS BASELINE ONTOLOGY (VERSION 2) ........................................................... 261
APPENDIX 5 – PROTOCOL DEFINITION FILE GENERATED BY G-PIPE ......................................... 266
INDEX .................................................................................................................................. 269

vi
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

LIST OF FIGURES

Introduction - Figure 1. Controlled vocabularies and ontologies.................................................xvi

Chapter 1 - Figure 1. Uschold and King methodology........................................................................37

Chapter 1 - Figure 2. The TOVE methodology..............................................................................39

Chapter 1 - Figure 3. METHONTOLOGY ...................................................................................42

Chapter 1 - Figure 4. Similarities amongst methodologies. ............................................................47

Chapter 2 - Figure 1. Terminological relationships. ........................................................................58

Chapter 2 - Figure 2. Life cycle, processes, activities, and view of the methodology..................60

Chapter 2 - Figure 3. An incremental evolutionary spiral model of tasks, activities and


processes. ................................................................................................................................71

Chapter 2 - Figure 4. Adding a term. ................................................................................................74

Chapter 3 - Figure 1. View of a concept map. .................................................................................89

Chapter 3 - Figure 2. Steps (1-6) and milestones (boxes). ..............................................................89

Chapter 3 - Figure 3. CMs as means to structure a conceptual query. ..........................................97

Chapter 3 - Figure 4. Elicitation of Is_a, whole/part-of, and classes............................................98

Chapter 3 - Figure 5. Methodology, milestones, and phases........................................................112

Chapter 4 - Figure 1. The major concepts of the argumentation ontology and their relations.121

vii
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

Chapter 4 - Figure 2. A simplification of the argumentative structure presented by Tempich et


al.............................................................................................................................................123

Chapter 4 - Figure 3. Biomaterial from MGED ............................................................................125

Chapter 5 - Figure 1. A schematic representation of our process, extending GM. ...................134

Chapter 5 - Figure 2. Classes, instances, and relationships gathered by bringing together


extracted terms and previously built ontological models. ...............................................137

Chapter 5 - Figure 3. Narrative, as seen from those concept maps and ontology models
domain experts were building. ...........................................................................................139

Chapter 5 - Figure 4. Baseline ontology..........................................................................................140

Chapter 5 - Figure 5. Our methodology. ........................................................................................148

Chapter 5 - Figure 6. A view of a section of the RSBI ontology. ................................................149

Chapter 6 - Figure 1. Schematic representation of the architecture of TAMBIS. .....................172

Chapter 6 - Figure 2. Valine biosynthetic pathway in Escherichia coli .......................................182

Chapter 7 - Figure 1. Syntactic components describing bioinformatics analysis workflows. ...199

Chapter 7 - Figure 2. Syntactic components and algebraic operators. ........................................200

Chapter 7 - Figure 3. Phylogenetic analysis workflow ..................................................................204

Chapter 7 - Figure 4. Case workflow ..............................................................................................205

Chapter 7 - Figure 5. G-PIPE Architecture. ..................................................................................207

Chapter 7 - Figure 6. Designing SNPs............................................................................................208

Chapter 7 - Figure 7. Mapping the RSBI ........................................................................................209

viii
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

Chapter 7 - Figure 8. G-PIPE..........................................................................................................213

Appendix 1 - Figure 1. Identified properties for the RSBI ontology. .........................................243

Appendix 1 - Figure 2. RSBI ontology ...........................................................................................244

Appendix 1 - Figure 3. A concept map for RSBI ontology. ........................................................245

Appendix 3 - Figure 1. A portion of the first version of the GMS ontology, Germplasm. .....258

Appendix 3 - Figure 2. The Germplasm Method section of the first version of the GMS
ontology. ...............................................................................................................................259

Appendix 3 - Figure 3. The Germplasm Identifier section of the first version of the GMS
ontology. ...............................................................................................................................260

Appendix 4 - Figure 1. Identified properties for the RSBI ontology. .........................................262

Appendix 4 - Figure 2. Genetic Constitution, as understood by the GMS ontology................263

Appendix 4 - Figure 3. Germplasm Breeding Stock, a portion of the second version of the
GMS ontology......................................................................................................................263

Appendix 4 - Figure 4. Naming convention according to the second version of the GMS
ontology. ...............................................................................................................................263

Appendix 4 - Figure 5. Plant Breeding Method according to the second version of the GMS
ontology. ...............................................................................................................................264

Appendix 4 - Figure 6. PlantPropagationProcesses according to the second version of the


GMS ontology......................................................................................................................264

Appendix 4 - Figure 7. Some of the parent classes in the RSBI ontology..................................265

ix
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

LIST OF TABLES

Chapter 1 - Table 1. Summary of methodologies............................................................................46

Chapter 2 - Table 1. A summary of the development process. .....................................................69

Chapter 2 - Table 2. Methodology compliance with IEEE ...........................................................73

Chapter 3 - Table 1. Comparison of methodologies.......................................................................86

Chapter 3 - Table 2. Example of the structure of linguistic definitions. .......................................91

Chapter 3 - Table 3. Examples of competency questions ..............................................................93

Chapter 5 - Table 1. Comparison of methodologies.....................................................................132

Chapter 6 - Table 1. Some existing developments in database integration in molecular biology164

Chapter 6 - Table 2. Some of the most commonly used Graphical User Interfaces (GUIs) for
EMBOSS and GCG® ........................................................................................................176

Chapter 7 - Table 1. Algebraic operators........................................................................................201

Chapter 7 - Table 2. Operator specifications. ................................................................................203

x
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

ACKNOWLEDGMENTS

“Acknowledgments” is usually the part of the thesis in which the author mention those
who have participated in the development and evolution of the research work. Expressing
gratitude to all those who had any kind of participation in the development of this work is, in
my opinion, mandatory. I do certainly thank all of them, for their understanding,
consideration, patience, and constant support throughout these, almost four years. However,
it is usually the case that some people acquired a more prominent role, and I am reserving this
section in order to express in a special way my gratitude for their actions.

Firstly, I thank my mother, without whose example and constant support I would ever
have found the courage for going through the whole doctoral process. I thank my sister, who
taught me an important lesson that helped me to understand the value of family in those
times in which I may not fully have appreciated it. My deepest gratitude goes to my entire
family, for those obvious things, but most of all for their unconditional love.

For having taught me how important it is to have a non-dogmatic, conciliatory attitude,


as well as respect and trust for the written word, I would like to express my gratitude to my
supervisor, Mark Ragan. The present work would have never been possible without the
understanding of those factors that make our work as knowledge engineers so interesting;
human factors are also those that make it so hard to represent and formalise knowledge. Is
there any piece of knowledge that exists independently from a human being? In my opinion
the answer is a straight no. For having helped me to understand this in particular, I would like
to specially thank Susana Sansone and Sue Roberthone. For advice on spelling and grammar,
I thank Kieran O'Neill.

Robert Stevens, Mark Wilkinson, Limsoong Wong, Vladimir Brusic and Kaye Basford
are persons for whom I feel a deep gratitude for having understood the importance of my
work; but most of all for having had trust in me.

xi
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

Finally, I am reserving my words to say “thanks”; not so much for the knowledge that
we all shared thought these years, but the humanity that allowed us to relate to each other as
human beings. Fortunately this research work proved to have a direct impact not only in my
understanding of the domain of knowledge, but also, and more importantly, in the
importance of those human factors within all of us.

xii
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

INTRODUCTION

OVERVIEW

High-throughput techniques have allowed the production of massive amounts of data


in modern biology. Sequencing full genomes is now part of a bigger task, that of identifying
the functional regions of genomes, or functional genomics. As modern biology becomes
more and more dependent on information technology, it also poses new challenges to
computer scientists. Integrative view that functional genomics demands, as it relates
information from different sources, may not be fully covered by today’s technology.
Answering users’ queries by providing them with an integrated, contextualised view has long
ago been considered to be one of the greatest challenges in natural language processing and
information retrieval [1].

In order to integrate heterogeneous information effectively, a number of approaches


applied to the bio-domain have been studied. Some of these are analyzed in the first chapter
of this thesis. Syntactical issues have been resolved, as different standardisation efforts have
been launched. However, modern biology still lacks the integrated view that is required.
Semantic issues have been identified as extremely important, and consequently the biological
community has organised a number of consortia that have taken care of developing biological
ontologies.

This introductory portion of the thesis is organised as follows. Initially, a brief overview
is given. The presentation of those main components and concepts (ontologies and
communities) of this thesis is given in pages XVII, and XIX. In these sections the broad
problem-space within which this research is situated is illustrated. The next section presents
the thesis outline; then, page XXIII exhibits hypothesis and research questions addressed by
this investigation. A list of those publications as well as software products that have arisen
from this thesis is given in the last section of this introductory chapter.

xiii
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

WHAT IS AN ONTOLOGY?

Definitions for the word ontology vary depending on the field; computer scientists tend
to understand the term in a more utilitarian way, whereas philosophers tend to have a more
holistic understanding of it. The term “ontology” (Greek on=being, logos=to reason) has its
roots in philosophy; it has traditionally been defined as the philosophical study of “what
exists”: the study of the kinds of entities in reality, and the relationships that these entities
bear to one another [2, 3]. Guarino [4] beautifully summarises the meaning of ontology as
being “a branch of metaphysics which deals with the nature and organization of realty”. The meaning of
the word ontology in philosophy is “the metaphysical study of the nature of being and existence” [5].
While within the philosophy community there is consensus on the definition for ontology,
there is still some dispute amongst members of the artificial intelligence (AI) community. This
is partly due to their goal, which is not always to study the nature of “what exists” but how to
classify, manage and organise information.

For those within the AI community the context in which the ontology is going to be
used largely influences the definition of the term. At a glance, an ontology represents a view
of the world with the set of concepts and relations amongst them, all of these defined with
respect to the domain of interest. For instance, John F. Sowa [6], defines the term as:

“The subject of ontology is the study of the categories of things that exist or may exist in some
domain. The product of such a study, called an ontology, is a catalog of the types of things that are
assumed to exist in a domain of interest D from the perspective of a person who uses a language L
for the purpose of talking about D.”
Computer scientists tend to view ontologies as being terminologies with associated
axioms and definitions, structured so as to support software applications [7] or in more detail
explained by Gruber:

“vocabularies of representational terms, classes, relations, functions and object constants with
agreed-upon definitions in the form of human-readable text and machine-enforceable, declarative constraints on
their well-formed use.” [8]

Even more succinctly, Gruber defines: “An ontology is a formal specification of a


conceptualization”.

xiv
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

In order to understand this definition Gruber et al., as well as Studer et al. agree on the
following terminology [9, 10]:

• “Conceptualization” is an abstract, simplified model of concepts in the world, usually


limited to a particular domain of interest.
• “Explicit” indicates that the type of domain concepts and the constraints imposed on
their use are explicitly defined.
• “Formal” means that the ontology specification must be machine readable
Others, such as Neches [11], by contrast consider an ontology to be:

“The definition of the basic terms and relations comprising the vocabulary of a topic area, as well as
the rules for combining terms and relations to define extensions to the vocabulary.”
Depending on the understanding of conceptualisation and context there are different
interpretations for the term “ontology”. Independently from the understanding of these terms it
could be said that every ontology model for knowledge representation is either explicitly or
implicitly committed to some conceptualisation. As this thesis’s context is that of an
information system the definition of an ontology that best serves our purpose is:

“An ontology is a non-necessarily complete, formal classification of types of information structured


by relationships defined by the vocabulary of the domain of knowledge and by the canonical
formulations of its theories”
Guarino and Smith heavily influence this definition. It complies with Guarino in that an
ontology is, possibly, an incomplete agreement about a conceptualisation and not a
specification of the conceptualisation. Ontologies should therefore be understood as
agreements amongst people within a community sharing interest in a common domain. By
“incomplete” it is understood that the classification of types of information should be left
open for interoperability purposes. By “formal” it is meant that the ontology specification can
be easily translated into a machine-readable code, as the ontology should support inference
processes within those information systems using it. However, it should be noted, that the
latter is not mandatory when defining ontologies on an abstract level.

xv
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

CONTROLLED VOCABULARIES AND ONTOLOGIES

Controlled vocabularies (CVs) are taxonomies of words built upon an is-a hierarchy; as
such they are not mean to support any reasoning process. Controlled vocabularies per se
describe neither relations among entities nor relations among concepts, and consequently
cannot support inference processes. CVs may be part of ontologies when they instantiate
classes. As the process of developing ontologies moves forward the hierarchy is formalised
not only by means of is-a and part-of, but also other relations are used, as well as logical
operators and description logics constructs. Figure 1 illustrates how within the process of
developing ontologies CVs play an important role.

Within biological sciences ontologies have been understood to be highly related to


controlled vocabularies. Gene Ontology (GO) [12] as well as the Microarray Gene Expression
Data (MGED) ontology (henceforth MO) [13] [14] have been used primarily to annotate and
unify data across biological databases. These controlled vocabularies have evolved over time;
the hierarchies upon which they have been built have used two kind of properties, is-a and
part-of. Thus an ontology is not simply a controlled vocabulary, nor merely a dictionary of
terms. Controlled vocabularies per se describe neither relations among entities nor relations
among concepts, and consequently cannot support inference processes.

Introduction - Figure 1. Controlled vocabularies and ontologies.


Independently from the methodology for developing ontologies, controlled
vocabularies are used at different stages during the development of the ontology. For

xvi
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

instance, some methodologies suggest the use of lists of lists of words from the beginning as a
mean to facilitate the identification of classes [15]. Others, such as Good et al. [16] use lists of
words to frame the knowledge elicitation process when developing the ontology. A more in-
depth analysis on methodologies for developing ontologies is presented in chapter one.

WHY ONTOLOGIES?

Several authors have extensively discussed the “whys” for ontologies. Within the
computer science community these reasons have been summarised [17, 18].

• To clarify and share the structure of knowledge


Different information systems might follow different business logics. However, they
are considered to be interoperable if they can exchange data and information. Such
heterogeneous applications are only able to share information if there is an agreed common
vocabulary to describe those items these information systems are meant to manage.

• To allow reusing knowledge


This is particularly evident within the biological domain. As large domains of
knowledge are highly fragmented, communities of experts have developed their own
ontologies that should in principle allow others to reuse them whenever needed. Thus the
ability to integrate and reuse an existing ontology, without needing to rebuild it, provides a
great benefit. Although reuse has been accepted to be one of the major advantages for using
ontologies, it is not clear how a merger or integration of ontologies should be carried out.
Ontology interoperability has been recognised as a challenging and as yet unachieved task. No
current ontology-building methodology really addresses this issue or deals with it explicitly.
There is no consensus for the methods used in merging and integration. These are still
unclear and more of an art than a methodology [19]. These issues are still part of ongoing
research in the area [20, 21].

• To make the assumptions used to create the domain explicit


• To allow a clear differentiation between domain knowledge and operational
knowledge

xvii
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

Operational knowledge should be here understood as that due to every-day practice.


Domain knowledge, by contrast, is the kind of knowledge that allows for creation and
generation of complementary discourse.

WHY COMMUNITIES?

A particularly recurrent and important term throughout this thesis is “community”, and
more broadly “community of practice”. Wenger defines communities of practice:

“Communities of practice are the basic building blocks of a social learning system because they
are the social ‘containers’ of the competences that make up such a system… Communities of practice
define competence by combining three elements. First, members are bound together by their
collectively developed understanding of what their community is about and they hold each other
accountable to this sense of joint enterprise. To be competent is to understand the enterprise well
enough to be able to contribute to it. Second, members build their community through mutual
engagement. They interact with one another, establishing norms and relationships of mutuality that
reflect these interactions. To be competent is to be able to engage with the community and be trusted
as a partner in these interactions. Third, communities of practice have produced a shared repertoire of
communal resources—language, routines, sensibilities, artefacts, tools, stories, styles, etc. To be
competent is to have access to this repertoire and be able to use it appropriately.” [22-24].

Interestingly, Wenger emphasises the “shared repertoire of resources” such as “language,


techniques artifacts”; this part of his definition has a remarkable parallel within an apparently
unrelated community, knowledge management.

Knowledge is defined by Davenport and Prusak as:

“Knowledge is a mix of framed experience, values, contextual information, expert insight and
grounded intuition that provides an environment and framework for evaluating and incorporating new
experiences and information. It originates and is applied in the minds of knowers. In organisations, it often
becomes embedded not only in documents or repositories but also in organisational routines, processes, practices
and norms.”[25]

xviii
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

Davenport and Prusak place emphasis on “organisational routines, processes, practices and
norms”. These shared repertoires between these two definitions make it clear that communities
of practice are brought together by their intersecting knowledge. Ontologies in bioinformatics
have been developed by communities of practices for which there is a common need. For
instance the MGED society, which is an international organisation of biologists, computer
scientists, and data analysts that aims to facilitate the sharing of microarray data generated by
functional genomics and proteomics experiments [14], develops and maintains MO. They
have initially focused on establishing standards for microarray data annotation and exchange,
facilitating the creation of microarray databases and related software implementing these
standards [14]. This does not mean that other omics technologies are not currently being
considered.

Annotating microarray experiments has been made possible by means of MO as it


provides a controlled vocabulary for describing microarray experiments. MO is in principle
independent from any software development using it. As microarray investigations can be
interpreted only in the context of the experimental conditions under which the samples used
in each hybridisation were generated [26], MO makes it possible not only to share but also to
better understand the context in which results were generated. Within this context the
communities of practice are being brought together by a common need and interest (e.g. the
use of a particular kind of technology), as well as by “organizational routines, processes, practices and
norms”. Their interaction takes place mostly via electronic means such as wiki-pages, email,
concurrent version systems (CVS), and phone conferences. As the goal in life sciences is to
make information available and exchangeable in the form of virtual knowledge a more
suitable and complementary definition for communities of practice is given by:

“Virtual communities of practice are communities of practice (and the social ‘places’ that they
collectively create) that rely primarily (though not necessarily exclusively) on networked
communication media to communicate, connect, and carry out community activities”
Biological communities are indeed communities of practice; not only do they have their
own way to interact (e.g. papers, conferences) but also, and more importantly, no matter how
fragmented they have a common vocabulary. Electronic means have facilitated not only the

xix
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

interaction but also the fragmentation of this community; interestingly it has also facilitated
the standardisation across the entire domain of knowledge by making explicit the need for a
holistic approach. For instance, data entries in GenBank [27] may encode a human Mendelian
disease, for which there are both metabolic pathways as well as reported single nucleotide
polymorphisms (SNPs) of interest. Such information may be scattered across Gene Cards
[28], BioCyc [29] and possibly other databases. Despite the divisions and specialisations of the
field, the systems studied by biological sub-communities interact in reality, and it is precisely
because of this that this community needs ontologies. As the knowledge is not owned by a
particular group, these knowledge should be captured and represented from and by the
community [15, 30].

BRINGING IT ALL TOGETHER

Communities have been developing ontologies in order to describe those entities that
we study, genes, proteins, DNA-binding factors, as well as biomaterial, and technology-
dependent artefacts. Gene Ontology [31] is an example of an ontology that aims to describe
those things biologists study, whereas MGED [32] ontology may be seen as one that aims to
describe the process by which we study those “things”. It is by using descriptors provided by
both ontologies that accurate representations of research endeavors may be possible. At any
given time a toxicological study may use a part of the liver of a rat in order to profile the
response of genes to a certain perturbation. In order to describe such an effort, different
ontologies are needed, some to describe the biology of the research endeavor (e.g. cells,
cellular compartment, animal, organism, etc.) as well as some to describe the techniques used
(e.g. microarrays, proteomics, PCR, chromatography, etc.).

Some of the required ontologies exist; however they are not always sufficient, nor are
they used for annotation in all biological investigations. Different views on the same issue
may mean that those mechanisms for involving the biological community in the development
of their own ontologies should be improved. The lack of methodologies and software tools
supporting these methodologies is a bottleneck in the development of biological ontologies.

xx
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

This research has analyzed the biological community as well as the intended use of
some of those ontologies currently under development. Different scenarios arise from three
ontology developments in which the author took part. These cases allowed a careful and
exhaustive study of the dynamic and features these kinds of developments have. The
nutrigenomics community permitted the author to understand the behavior of communities
when developing ontologies, as well as the significance of groupware technology for
developing loosely centralised ontologies. From this initial experience it was also possible to
identify and illustrate how concept maps could be used to support knowledge elicitation
during the development of ontologies. A methodology describing how biological ontologies
could be better developed was consequently proposed.

Two other scenarios were studied. The Reporting Structure for Biological
Investigations (RSBI) case was one that aimed to define the structure and semantics for
reporting a biological investigation. Influenced by MIAME, the RSBI working group
addressed the issue of investigations in a broader sense; the context was not limited to
describing a microarray experiment, but any biological experiment. This experience was
interesting not only because of the involvement of three different communities (toxic-
genomics, environmental genomics, and nutrigenomics) but most importantly because it was
easy to understand how difficult it was to describe an investigation; how could technology be
classified in a way that inference is possible within any given Laboratory Information
Management System? As a high level container, investigation, what minimal descriptors should
accompany it in order to provide an insightful, useful and comprehensive view of the whole
investigation?

Finally, another ontology was also supported during its development, the Genealogy
Management System (GMS) Ontology. The GMS Ontology provided us with a fertile ground
in which it was possible to extend the methodology proposed from the nutrigenomics case.
Conceptual maps facilitate knowledge elicitation and sharing, but it is not easy to frame the
view of the domain experts, sometimes they tented to be quite specific, and some other times
quite general. By combining terminology extraction and conceptual mapping it became

xxi
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

possible to constrain the elicitation exercises with domain experts; thus making it possible to
capture classes and differentiate them from instances at early stages of the elicitation process.

The three previously mentioned ontology developments permitted the study of existing
software from the perspective of both users and knowledge engineers in decentralised
settings. Also from these experiences it was possible to identify the argumentative structure
that takes place when developing ontologies.

RESEARCH PROBLEM

This thesis has laid down a series of questions not previously considered when studying
biological ontologies, how to develop them, and their uses when integrating information.
Throughout this thesis, integration of information in bioinformatics is studied mainly from
the semantic perspective, placing particular attention on the actual process by which the
ontology is being developed. Despite this emphasis, other aspects related to integration of
information have also been considered. The overall research problem in this thesis is:

“The participation of communities in the development of biological ontologies poses


challenges not previously considered by existing methodologies for developing ontologies.”

To address this problem, the author presents a series of hypotheses and questions.
These seek to explore and analyze methodological and practical challenges in developing
ontologies within the bioscience community.

1. If ontologies are to be developed by communities then the ontology development


life cycle should be better-understood within this context.

2. When eliciting knowledge for developing an ontology on a community basis there


is an increasing need not only to support the process as such, but also to facilitate
the communication, structure and exchange of information amongst the
participants of the process.

3. If biological investigations bring together a mix of disciplines then the descriptions


for such research endeavours should encompass all those different views.

xxii
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

4. How should a well-engineered methodology facilitate the development of


ontologies within communities of practice? What methodology should be used?

By answering these questions this doctoral work addresses the ontology development,
information integration and the study description problem in modern biology and proposes
different methods to facilitate information integration across various information systems.

Initial chapters focus on providing answers to the main research question in this thesis.
By investigating existing methodologies and analyzing them within the context of biological
communities of practice it was possible to propose a methodology and understand the life
cycle of these ontologies. Those experiences that allowed the author to gather information,
test and improve the methodology are presented in chapters 3, 4, and 5. As it was important
for the successful conclusion of this thesis to constantly test those research outcomes a
simple, yet quite illustrative scenario, was laid down. This scenario (chapter 7) allowed the
author to study not just the development of ontologies, but also the use of ontologyes by
software layers within an integrative environment that also had a community of users.

CONTRIBUTIONS OF THIS THESIS

Four are the main contributions of this doctoral work:

• Improving our understanding of the role of ontologies in the domain of


bioinformatics and laboratory information management,
• Developing a way to engineer ontologies within this domain,
• Developing several ontologies of substantial complexity and of practical use in real
applications, and
• Developing a workflow system of substantial sophistication for which syntactical
and semantic aspects are easily observable and manageable.
Throughout the development of this thesis work special emphasis was placed in
studying cases for which this work could have a direct impact. The search for and interest in
real scenarios allowed me to extensively collaborate with other groups such as EBI (European
Bionformatics Institute), Pasteur Institute, CGIAR (Consultative Group in Agricultural
Research) and ACPFG (Australian Centre for Plant Functional Genomics). It also gave us an

xxiii
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

additional reason to publish our work; this active communication, via different means, enabled
us to receive relatively rapid feedback regarding our work. The combination of collaborations
and publications enriched this work, but more importantly permitted the rapid use and testing
of those intermediary products of this work. A list of research outcomes and outputs of this
thesis is given below.

OUTLINE OF THIS THESIS

This thesis begins by addressing the problem of information integration, and examines
the syntactic and semantic factors that should be taken care of when describing experiments.
A particular task was always considered as crucial throughout the development of this thesis:
intelligent information retrieval.

Attention was focused on semantic issues associated with information integration: How
could the reproducibility of biological experiments be ensured? How could experiments be
effectively shared? How could ontologies be built while ensuring the participation of a wider
community? Special attention was given to the involvement of the community when
developing ontologies. The agreement is critical as it, to some exempt, assures the use and
some how the correctness of the ontology.

This thesis is organised into a series of chapters that address some aspects related to
semantic issues in integration of information in bioinformatics. Chapter I, “Communities at the
melting point when building ontologies”, is a critical analysis of those existing methodologies for
developing ontologies; not only existing methodologies are presented, but also it is analyzed
how could these methodologies be used within the biological domain, as well as which issues
should be considered in order to propose a new methodology. Chapter II, “The melting point, a
methodology for developing ontologies within decentralised settings” presents a novel methodology that
has been engineered upon cases extracted form real scenarios. In principle this methodology
may be used not only within the bio domain but also in other contexts. Chapter III, “The use
of concept maps during knowledge elicitation in ontology development processes”, presents
the development of biological ontologies, factors associated with this process as well as a

xxiv
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

process that was followed. Chapter IV presents how cognitive support may be provided by
means of concept maps during the argumentative process that takes place when developing
ontologies. For this particular task we used two unrelated scenarios: Reporting Structure for
Biological Investigations (RSBI) and Genealogy Management Systems (GMS). It is important
to notice that both, Chapters III and IV, proved to be a fertile playground in which the
methodology presented in Chapter II was engineered; these two experiences were for the
development of this thesis experiments from which valuable information was gathered.
Chapter V presents an literature review in which different approaches to integration of
molecular data, as well as analytical tools are analyzed; this chapter aims to facilitate the
transition into chapter VI in which a different scenario, extracted mostly from in silico biology,
is studied from both syntactic and semantic perspectives. Interestingly, during the
development of this part of the doctoral work it became possible to understand better how
syntactically based solutions, despite being workable tools, still lack some important features
that only the correct use of ontologies could provide. As in silico experiments are also valid
examples of biological investigations another important outcome from Chapter V was the
actual practical use of the ontology proposed in Chapter IV.

Discussions, conclusions and future work are presented in the remaining chapters of
this thesis. In part, this was done by using literary analogies, mostly with Shakespeare’s
masterpiece “Romeo and Juliet” and also with “A hundred years of solitude” by Garcia Marquez.
These analogies seemed ideal because they illustrate what, in my opinion, constitutes a central
problem in the development of biological ontologies, and more broadly in the development
of information systems, namely interdisciplinary work. The relationship between
bioinformatics and the semantic web is used as an introduction to the rest of discussions and
conclusions. Chapter VII presents some future work, using literary analogy here mentioned.

xxv
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

PUBLISHED PAPERS

1. Garcia Castro A, Sansone AS, Rocca-Serra P, Taylor C, Ragan MA: The use of
conceptual maps for two ontology developments: nutrigenomics, and a
management system for genealogies. In: 8th Intl Protégé Conference Protégé: 2005;
Madrid, Spain; 2005: 59-62.

2. Garcia Castro A, Chen Y-PP, Ragan MA: Information integration in molecular


bioscience: a review. Applied bioinformatics 2005, 4(3):157-173.

3. Garcia Castro A, Chen Y-PP, Ragan MA: Workflows in bioinformatics: meta-analysis


and prototype implementation of a workflow generator. BMC 2005, 6:87.

4. Garcia Castro A, Thoraval S, Garcia Castro L-J, Ragan MA: G-PIPE, an


implementation. In: NETTAB: 2005; Naples, Italy; 2005.

5. Garcia Castro A, Sansone AS, Taylor CF, Rocca-Serra P: A conceptual framework for
describing biological investigations. In: NETTAB: 2005; Naples, Italy; 2005.

6. Garcia Castro A, Rocca-Serra P, Stevens R, Taylor C, Nashar K, Ragan MA, Sansone


S: The use of concept maps during knowledge elicitation in ontology development
processes - the nutrigenomics use case. BMC 2006, 7:267.

7. Garcia Castro A: Cognitive support for an argumentative structure during the


ontology development process. In: 9th Intl Protégé Conference: July, 2006 2006;
Stanford, CA, USA; 2006.

8. Garcia Castro A: The Montagues and the Capulets, act two, scene two: from
Verona to Macondo via-La Mancha. Submitted for publication.

9. Fostel J, Choi D, Zwickl C, Morrison N, Rashid A, Hasan A, Bao W, Richard A, Tong W,


Garcia Castro A, Bushel P et al: Chemical effects in biological systems - data
dictionary (CEBS-DD): a compendium of terms for the capture and integration of
biological study design description, conventional phenotypes and omics data.
Journal of Toxicological Sciences 2005, 88(2):585-601.

10. O'Neill K, Schwegmann A, Jimenez R, Jacobson D, Garcia Castro A: OntoDas - integrating


DAS with ontology-based queries. In: Bio-ontologies SIG, ISMB 2007. Viena, Austria;
2007.

xxvi
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

SOFTWARE DEVELOPED, INCLUDING ONTOLOGIES.

1. G-PIPE: a workflow generator for PISE: http://if-web1.imb.uq.edu.au


/Pise/gpipe.html

2. Reporting Structure for Biological Investigations (RSBI), see Appendix 1

3. An ontology for Genealogy Management Systems, available at:


http://cropwiki.irri.org/icis/index.php/ICIS_Domain_Models, see Appendix 3
and 4

4. Conceptual mapping plug-in for Protégé, available at: http://if-


web1.imb.uq.edu.au/plug-in.html

REFERENCES

1. Chagoyen-Quiles M: Integration of biologycal data: systems, infrastructures and


programmable tools. Doctoral Thesis. Madrid: Universidad Autonoma de Madrid,
Escuela Politecnica Superior; 2005.
2. Smith B, Ceusters W: Ontologies as the core discipline of biomedical
informatics, legacies of thepast and recomendations for the future. In:
Computing, Philosohy, and Cognitive Sciences. Edited by Crnkovic GD, Stuart S.
Cambridge: Cambridge Scrholar Press; 2006.
3. Simth B: Ontology. In: Guide to Philosophy of Computing and Information. Edited by
Floridi L. Oxford: Blackwell; 2004: 155-166.
4. Guarino N, Giaretta P: Ontologies and Knowledge Bases: Toward a
Terminological Clarification. Towards Very Large Knowledge Bases 1995, In N.J.I
Mars (ed.):25-32.
5. WordNet. In: http://wordnet.princeton.edu/. 2007.
6. Sowa JF: Knowledge Representation: Logical, Philosophical, and
Computational Foundation. Pacific Grove, CA: Brooks Cole Publishing Co; 2000a.
7. Smith B, Williams J, Schulze-Kremer S: The ontology of the gene ontology. In:
AMIA Annual Symposium: 2003; 2003: 609-613.
8. Gruber T: The role of knowledge representation in achieving sharable, reusable
knowledge bases. In: Second International conference in Principles of Knowledge Representation
and Reasoning. Cambridge, MA; 1991.
9. Gruber TR: Toward principles for the design of ontologies used for knowledge
sharing. International Journal of Human-Computer Studies 1995, 43(5-6):907-928.
xxvii
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

10. Studer R, Benjamins VR, Fensel D: Knowledge Engineering: Principles and


methods. Data & Knowledge Engineering 1998, 25(1-2):161-197.
11. Neches R, Finin R, Gruber T, Patil R, Senator T, Swartout WR.: Enabling
Technology for Knowledge sharing. AI Magazine 1991:36-55.
12. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP,
Dolinski K, Dwight SS, Eppig JT et al: Gene ontology: tool for the unification of
biology. Nat Genet 2000, 25(1):25-29.
13. Stoeckert CJ, Parkinson H: The MGED ontology: a framework for describing
functional genomics experiments. Comparative and Functional Genomics 2003,
4(1):127-132.
14. Microarray Gene Expression Data [http://www.mged.org/]
15. Garcia CA, Rocca-Serra P, Stevens R, Taylor C, Nashar K, Ragan MA, Sansone S:
The use of concept maps during knowledge elicitation in ontology
development processes - the nutrigenomics use case. BMC 2006, 7:267.
16. Good B, Tranfield E, M.,, Tan P, C., , Shehata M, Singhera G, K.,, Gosselink J, Okon
E, B.,, Wilkinson M: Fast, Cheap, and Out of Control: A Zero Curation Model
for Ontology Development. In: Pacific Symposium on Biocomputing: 2006; 2006.
17. Guarino N: Understanding, building and using ontologies. International Journal of
Human-Computer Studies 1997, 46(2-3):293-310.
18. Noy NF, McGuinness DL: Ontology Development 101: a Guide to Creating
Your First Ontology. In. Stanford, CA: Stanford University; 2001.
19. Mirzaee V: An Ontological Approach to Representing Historical Knowledge.
PhD Thesis. Vancouver: Department of Electrical and Computer Engineering,
University of British Columbia.; 2004.
20. Beck H, Pinto HS: Overview of Approach, Methodologies, Standards, and Tools
for Ontologies. In.: The Agricultural Ontology Service (UN FAO); 2003.
21. Pinto S, Perez AG, Martins JP: Some issues on ontology integration. In: Workshop
on Ontologies and Problem-Solving Methods: Lessons Learned and Future Trends (IJCAI99):
1999; Stockholm, Sweden; 1999.
22. Wenger E, McDermott R, Snyder S: Cultivating communities of practice. A guide
to managing knowledge. Boston: Harvard Business School Press; 2002.
23. Wenger E: Commuities of practice. Learning, meaning, and identity.
Cambridge, UK: Cambridge University Press; 1998.
24. Wenger E: Comunities of practice and social learning systems. Organization 2000,
7(2).
25. Davenport TH, Prusak L: How organizations manage what they know Boston,
Massachusetts: Harvard Bussines School Press; 1998.
xxviii
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

26. Rayner T, F., Rocca-Serra P, P.T. S, Causton C, Helen., Brazma A: A simple


spreadsheet-based, MIAME-suportive format for microarray data:MAGE-
TAB. BMC Bioinformatics 2006, 7:489.
27. Benson DA, Boguski MS, Lipman DJ, Ostell J, Ouellette BFF, Rapp BA, Wheeler
DL: GenBank. Nucleic Acids Res 1999, 27(1):12-17.
28. Rebhan M, Chalifa-Caspi V, Prilusky J, Lancet D: GeneCards: a novel functional
genomics compendium with automated data mining and query reformulation
support. Bioinformatics 1998, 14(8):656-664.
29. Karp PD, Riley M, Paley SM, Pellegrini-Toole A: The MetaCyc database. Nucleic
Acids Res 2002, 30(1):59-61.
30. Pinto HS, Staab S, Tempich C: Diligent: towards a fine-grained methodology for
Distributed, Loosely-controlled and evolving engineering of ontologies. In:
European conference on Artificial Intelligence: 2004; Valencia, Spain; 2004: 393-397.
31. Ashburner M, Ball C, Blake J, Botstein D, Butler H, Cherry J, Davis A, Dolinski K,
Dwight S, Eppig J et al: Gene ontology: tool for the unification of biology. The
Gene Ontology Consortium. Nature Genetics 2000, 25(1):25-29.
32. Stoeckert CJ, Parkinson H: The MGED ontology: a framework for describing
functional genomics experiments. Comparative and Functional Genomics 2003, 4:127-
132.

xxix
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

Communities at the melting point when


building ontologies

Although several methodologies have been proposed addressing the problem of


building ontologies, ontology engineering does not have a standard methodology. There is an
ongoing debate amongst those in the ontology community about the best methodology to
build them. Several groups have engineered particular methodologies to solve their specific
problems. Some may have been more interested in using the ontology than in how it was
built. Others have proposed methodologies without having a specific problem to solve; the
methodology itself was the main purpose of their research. These proposed methodologies
differ in the stages, steps, methods and techniques; all of them have been conceptualised for
scenarios in which domain experts are in one place. None of them explicitly addresses the
problem of decentralised settings, furthermore none of them specifically targets domains such
as the biological one for which domain experts are at the same time designers and users of the
technology. From the investigated methodologies it was possible to identify precisely these
stages, steps, methods and techniques that were common, and in principle applicable when
developing ontologies within the biological domain. This chapter presents a detailed analysis
of these methodologies in order to have a unified comparison criterion by which it was
possible to do an in-depth analysis that facilitated the identification of reusable components,
shortcomings and strong points in those studied methodologies. Terms such as knowledge
engineer, method, domain expert, domain ontology, and many others are here explained.
Descriptions of previously proposed methodologies can be found here.

The author conceived the project, and identified those key issues elaborated here. The
manuscript was entirely written by Alex Garcia Castro.

30
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

1 Chapter I - Communities at the melting


point when building ontologies

1.1 INTRODUCTION

Building well-developed ontologies represents an important and difficult challenge as


ontology engineering is still in its infancy. It is precisely the availability of standard and
broadly applicable methodologies in a particular discipline which represents its “adulthood”
stage [1]. Currently ontology engineering has no standard methodology for developing
ontologies [2, 3]; there is an ongoing debate amongst those in the ontology community about
the best methodology to build them [4-6]. Several groups have engineered particular
methodologies to solve their specific problems. Some may have been more interested in using
the ontology than in how it was built. Others have proposed methodologies without having a
specific problem to solve; the methodology itself was the main purpose of their research.
Most of the literature focuses on issues such as the suitability of particular tools and languages
for building ontologies, with little attention being given to how ontologies should be built. This
is almost certainly because the main interest has been in reporting content and use, rather
than in engineering methodologies [7].

Biologists have been building classification systems since before Linnaeus. In the past,
biologists have understood classification systems as systems that allow them to identify, name,
and group organisms according to predefined criteria. This makes it possible for the
community as a whole to be sure they know the exact organism that is being examined and
discussed. More recently, the biological community has started to classify genes and gene
products; with this need in mind the Gene Ontology (GO) was created. The involvement of
the community has played a major role since the foundation of the GO consortium as it is a
collaborative effort that addresses the need for consistent descriptions of gene products in
different databases [8]. Initially GO provided a controlled vocabulary for only model
organisms databases such as FlyBase (Drosophila) [9], the Saccharomyces Genome Database
31
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

(SGD) [10] and the Mouse Genome Database [11] . It has since been adopted as the de-facto
standard ontology for describing gene and gene products.

The plant ontology (PO) [12] also illustrates a biological ontology for which
communities are central to its development. The Plant Ontology Consortium (POC)
(www.plantontology.org) is a collaborative effort that brings together several plant database
administrators, curators and experts in plant systematics, botany and genomics. A primary
goal of the POC is to develop simple yet robust and extensible controlled vocabularies that
accurately reflect the biology of plant structures and developmental stages. These vocabularies
form a network, linked by relationships facilitating thus the construction and execution of
queries that cut across datasets within a database or between multiple databases [12]. The
developers of both GO and PO focus on providing controlled vocabularies, facilitating cross-
database queries, and having strong community involvement.

Despite these efforts, bio-ontologies still tend to be built on an ad hoc basis rather than
by following well-defined engineering processes. To this day, no standard methodology for
building biological ontologies has been agreed upon. The “hacking” process usually involves
gathering terminology and organizing it into a taxonomy, from which key concepts are
identified and related to create a concrete ontology. Case studies have been described for the
development of ontologies in diverse domains, although surprisingly only one of these has
been reported to have been applied in a domain allied to bioscience – the chemical ontology
[13] – and none in bioscience per se. The actual “how to build the ontology” has not been the main
research focus for the bio-ontological community [7].

This chapter presents a description of the previously proposed methodologies in


section two. A summary of the comparison, as well as discussion and conclusions, is
presented in section three.

32
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

1.2 METHODS AND METHODOLOGIES FOR BUILDING ONTOLOGIES

Several approaches have been reported for developing ontologies; some of them
provide insights when developing de novo ontologies, whereas others pay more attention to
extending, transforming and re-using existing ontologies. Independently from the focus, both
methods and methodologies have not yet been standardised. Not only are there several
different methodologies, but also there are numerous software tools aiming to assist
knowledge engineers during the process.

A methodology is a “comprehensive integrated series of techniques or methods creating a general


system theory of how a class of thought-intensive work ought be performed” [14]. Methodologies are
composed of both techniques and methods. A method is an “orderly” process or procedure
used in the engineering of a product or performing a service [15]. A technique is a “technical
and managerial procedure used to achieve a given objective” [14]. Methodologies bring together
techniques and methods in an orchestrated way so that the work can be done. From those
experiences reported by Garcia et al. [7] as well as by Pinto et al. [16] the knowledge engineer is
understood as being a person who applies knowledge engineering techniques to transfer
human knowledge into artificial intelligent systems; not only by modelling the knowledge and
problem solving techniques of the domain expert into the system but also by promoting the
collaboration amongst domain experts. This definition is also influenced by [17, 18].

Several approaches are here analyzed. Strong points, and shortcomings are reviewed
according to the following criteria (C), heavily influenced by the work done by Fernandez [1]
Mirzaee in [3] and Corcho et al. [19].

C1. Inheritance from knowledge engineering. As most ontology building


methodologies are inspired by work done in the field of Knowledge Engineering (KE) to
create methodologies for developing knowledge based systems (KBS) this criterion considers
the influence traditional KE has had on the studied methodologies.

33
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

C2. Detail of the methodology. This criterion is used to assess the clarity with which
the methodology specifies the orchestration of methods and techniques.

C3. Strategy for building the ontology. This should provide information about the
purpose of the ontology, as well as the availability of domain experts. There are three main
strategic lines to consider: i) how tightly coupled the ontology is going to be in relation to the
application that should in principle use it; ii) the kinds of domain experts available; iii) the kind
of ontology to be developed. These matters are better explained from C3a to C3e.

C3a. Application-dependent: The ontology is built on the basis of an application


knowledge base, by means of a process of abstraction [1]

C3b. Application-semidependent: Possible scenarios of ontology use are identified in


the specification stage [1]

C3c. Application-independent: The process is totally independent of the uses to which


the ontology will be put in knowledge-based systems, agents, etc.

C3d. Specialised domain experts: Both C3c and C3e have to do with the kind of
domain experts who are available and willing to participate in the development process. This
influences C4. Specialised domain experts are those with an in-depth knowledge of their field.
Within the biological context these are usually researches with vast laboratory experience, very
focused and narrowed within the domain of knowledge. The ontology is built from very
specific concepts; this is also known as a bottom-up approach.

C3e. Broader-Knowledge domain experts: Broader-Knowledge domain experts are


those who tend to have a broader picture. Having this kind of domain experts usually
facilitates capturing concepts more related to high-level abstraction, and general processes,
rather than specific vocabulary describing those processes. The ontology may be built from
high-level abstractions downwards to specifics. This facilitates the approach known as top-
down.

C3f. Top-level ontologies: These describe very general concepts like space, time, event,
which are independent of a particular problem domain. Such unified top-level ontologies aim

34
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

at serving large communities [20]. These ontologies are also known as foundational
ontologies; see for instance [21].

C3g. Domain ontologies: These describe specific vocabulary.

C3h. Task ontologies: These describe vocabulary related to tasks, processes, or


activities.

C3i. Application ontologies: As Sure [20] describes them, application ontologies are
specialisations of domain and task ontologies as they form a base for implementing
applications with a concrete domain and scope.

C4. Strategy for identifying concepts. As has been previously mentioned in C3d and
C3e there are two strategies regarding the construction of the ontology and the kinds of terms
it is possible to capture [22]: The first is to work from the most concrete to the most abstract
(bottom-up), whereas the second is to work from the most abstract to the more concrete
(top-down). An alternative route is to work from the most relevant to the most abstract and
most concrete (middle-out) [1, 22, 23].

C5. Recommended life cycle. Analysis of whether the methodology implicitly or


explicitly proposes a life cycle [1].

C6. Recommended methods and techniques. This criterion evaluates whether or


not there are methods and techniques as part of the methodology. This is closely related to
C2. An important issue to be considered is the availability of software supporting either the
entire methodology or a particular method of the methodology. This criterion also deals with
the methods or software tools available within the methodology for representing the
ontology, whether these be OWL (Ontology Web Language), frames, RDF (Resource
Descriptive Framework) etc.

C7. Applicability. As knowledge engineering is still in its infancy it is important to


evaluate the methodology in the context of those ontologies for which it has been used.

C8. Community involvement. As has been pointed out before in this thesis (see
chapter one), it is important to know the level of involvement of the community. Phrasing
35
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

this as a question, is the community a consumer of the ontology or is the community taking
an active role in its development?

C9. Knowledge elicitation. As has been pointed out by [24], knowledge elicitation is a
major bottleneck when representing knowledge. It is therefore important to know if the
methodology assumes it to be an integral part of this process.

The methodologies reviewed under the above criteria are:

• The Enterprise Methodology proposed by Uschold and King [25].


• The TOVE Methodology by proposed by Gruninger and Fox [26]
• The Bernaras methodology proposed by Bernaras et al. [27]
• The METHONTOLOGY methodology proposed by Fernandez et al. [2]
• The SENSUS methodology proposed by Swartout et al. [28]
• The DILIGENT methodology proposed by Pinto et al. [16, 29]

1.2.1 The Enterprise Methodology

Uschold and King propose a set of four activities:

• Identify the purpose and scope of the ontology


• Build the ontology, for which they specify three activities
o Knowledge capture
o Development -coding
o Integrating with other ontologies
• Evaluate
• Document the ontology

36
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

Chapter 1 - Figure 1. Uschold and King methodology.


C1. The methodology does not especially inherit methods from knowledge engineering.
Although Uschold and King identify steps that are in principle related to some methodologies
from knowledge engineering the authors do not comply with some of the principles in the
field. Neither a feasibility study nor a prototype method is proposed.

C2. Stages are identified, but no detail is provided. In particular the “Ontology Coding”
“Integration” and “Evaluation” sections are presented in a superfluous manner [3].

C3. Very little information is provided. The proposed method is application-


independent and very general, in principle it is applicable to other domains. The methodology
is application-independent. The authors do not present information about the kind of domain
experts they advise working with.

C4. For Uschold and King the disadvantage of using the top-down approach is that by
starting with a few general concepts there may be some ambiguity in the final product.
Alternatively, with the bottom-up approach too much detail may be provided, and not all this
detail could be used in the final version of the ontology [22]. This in principle favors the
middle-out approach proposed by Lakoff [23]. The middle-out is not only conceived as a
middle path between bottom-up and top-down, but also relies on the understanding that
categories are not simply organised in hierarchies from the most general to the most specific,
but are rather organised cognitively in such a way that categories are located in the middle of

37
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

the general-to-specific hierarchy. Going up from this level is the generalisation and going
down is the specialisation [3, 23].

C5. No life cycle is recommended.

C6. No techniques or methods are recommended. The authors mention the


importance of representing the captured knowledge but do not make explicit
recommendations as to which knowledge formalism to use. This methodology does not
support any particular software as a development tool. The integration with other ontologies
is not described, nor is any method recommended to overcome this issue, nor is whether this
integration involves extending the generated ontology or merging it with an existing one
explained.

C7. The methodology was used to generate the Enterprise ontology [30].

C8. Communities are not involved in this methodology.

C9. For those activities specified within the building stage the authors do not propose
any specific method for representing the ontology ( e.g. frames, description logic, etc). The
authors place special emphasis on knowledge elicitation. However, they are not specific in
developing this further.

1.2.2 The TOVE Methodology

The Toronto Virtual Enterprise (TOVE) methodology involves building a logical


model of the knowledge that is to be specified by means of an ontology. The steps involved
as well as their corresponding outcomes are illustrated in figure 2.

C1. Gruninger and Fox propose a methodology which is heavily influenced by the
development of knowledge based systems using first order logic [19].

C2. Gruninger and Fox do not provide specifics on the activities involved.

C3. The authors emphasise competency questions as well as motivating scenarios as


important components in their methodology. This methodology is application-semi
dependent as specific terminology is used not only to formalise questions but also to build the
38
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

completeness theorems used to evaluate the ontology. Once the competency questions have
been formally stated, the conditions under which the solutions to the questions must be
defined should be formalised. The authors do not present information about the kind of
domain experts they advise working with

C4. This methodology adopts a middle-out strategy.

C5. No indication about a life cycle is given.

C6. Although Gruninger and Fox emphasised the importance of competency questions
they do not provide techniques or methods to approach this problem.

C7. The Toronto Vrtual Enterprise ontology was built using this methodology.

C8. Communities are not involved in this methodology

C9. No particular indication for eliciting knowledge is given.

Chapter 1 - Figure 2. The TOVE methodology.

39
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

1.2.3 The Bernaras methodology

C1. Bernaras’ work was developed as part of the KACTUS [27] project which aimed to
investigate the feasibility of knowledge reuse in technical systems. This methodology is thus
heavily influenced by knowledge engineering.

C2. The original paper by Bernaras et al. provides little detail about the methodology.

C3. This methodology is application-dependant. As the development of this


methodology took place within a larger engineering effort ontologies were being developed
hand-in-hand with the corresponding software. This implies that domain experts were being
used for both tasks, for requirements interviews and studies as well as for ontology
development. This however, does not mean that domain experts were taking an active role.
The authors present very little information about the kind of domain experts they advise
working with.

C4. This methodology adopts a bottom-up [19].

C5. As the ontology is highly coupled with the software that uses it the life cycle of the
ontology is the same as the software life cycle.

C6. For the specific development of the ontology no particular methods or techniques
are provided. However, as this methodology was meant to support the development of an
ontology at the same time as the software it is reasonable to assume that some software
engineering methods and techniques were also applied to the development of the ontology.

C7. It has been applied within the electrical engineering domain.

C8. Communities are not involved in this methodology

C9. No particular indication for knowledge elicitation is provided.

40
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

1.2.4 The METHONTOLOGY methodology

C1. METHONTOLOGY has its roots in knowledge engineering. The authors aim to
define a standardisation of the ontology life cycle (development) with respect to the
requirements of the Software Development Process (IEEE 1074-1995 standard) [3].

C2. Detail is provided for the ontology development process; Figure 3 illustrates the
methodology. It includes the identification of the ontology development process, a life cycle
based on evolving prototypes, and particular techniques to carry out each activity [19] This
methodology heavily relies on the IEEE software development process as described in [14].
Gomez-Perez et al. consider that all the activities carried out in an ontology development
process may be classified into one of the following three categories:

• Management activities: Including planning, control and quality assurance. Planning


activities are those aiming to identify tasks, time and resources.
• Development activities: Including the specification of the states, conceptualisation,
formalisation, implementation and maintenance. From those activities related to the
specification knowledge engineers should understand the context in which the
ontology will be used. Conceptualisation activities are mostly those activities in
which different models are built. During the formalisation phase the conceptual
model is transformed into a semi-computable model. Finally, the ontology is
updated, and corrected during the maintenance phase [31].
• Support activities: these include knowledge elicitation, evaluation, integration,
documentation, and configuration management.

41
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

Chapter 1 - Figure 3. METHONTOLOGY


with permission from [19]
C3. Application independent. No indication is provided as to the kind of domain
experts they advise working with. In principle METHONTOLOGY could be applied to the
development of any kind of ontology.

C4. This methodology adopts a middle-out

C5. METHONTOLOGY adopts an evolving-prototype life cycle.

C6. No methods or techniques recommended. METHONTOLOGY heavily relies on


WebODE [32] as the software tool for coding the ontology. However, this methodology is in
principle independent from the software tool.

C7. This methodology has been used in the development of the Chemical OntoAgent
[33] as well as in the development of the Onto2Agent ontology [33].

C8. No community involvement is considered.

C9. Knowledge elicitation is part of the methodology, however o indication is provided


as to which method to use.

42
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

1.2.5 The SENSUS methodology

The SENSUS-based methodology [28] is a methodology supported on those


experiences gathered from building the SENSUS ontology. SENSUS is an extension and
reorganisation of WordNet [34], this 70,000-node terminology taxonomy may be used as a
framework into which additional knowledge can be placed [35]. SENSUS emphasises merging
pre-existing ontologies, and mining other sources such as dictionaries.

C1. SENSUS is not influenced by knowledge engineering as this methodology mostly


relies on methods and techniques from text mining.

C2. Although there is extensive documentation for those text mining techniques and
developing structures for conceptual machine translation [36-38] no detail is provided as for
the “how” to build the ontology.

C3. As SENSUS makes extensive use of both text mining and conceptual machine
translation the methodology as such is application semi-independent. The methods and
techniques proposed by SENSUS may, in principle, be applied to several domains.

C4. SENSUS follows a bottom-up approach. Initially instances are gathered, as the
process moves forward abstractions are then identified.

C5. No life cycle is identified; from those reported experiences the ontology is deployed
on a one-off basis.

C6. Methods and techniques are identified for gathering instances. However, no further
detail is provided.

C7. SENSUS was the methodology followed for the development of knowledge-based

applications for the air campaign planning ontology [39].

C8. No community involvement is considered.

C9. Knowledge elicitation is not considered explicitly.

43
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

1.2.6 DILIGENT

Diligent (DIstributed, Loosely-controlled and evolvInG Engineering of oNTologies) is


one of the few methodologies engineered more for the Semantic Web (SW). The SW is a
vision in which the current, largely human-accessible Web, is annotated using ontologies such
that the vast content of the Web is available for machine processing [40]:

“... an extension of the current Web in which information is given well-defined meaning, better
enabling computers and people to work together to work in cooperation. It is the idea of having data on the
Web defined and linked in a way that it can be used for more effective discovery, automation, integration and
reuse across various applictions... data can be shared and processed by automated tools as well as by
people."[20, 40, 41]

The goal of the SW is:

“The goal of the Semantic Web initiative is as broad as that of the Web: to create a universal
medium for the exchange of data. It is envisaged to smoothly interconnect personal information
management, enterprise application integration, and the global sharing of commercial, scientific and
cultural data. “Facilities to put machine-understandable data on the Web are quickly becoming a high priority
for many organizations, individuals and communities” [41]

DILIGENT was conceived as a methodology for developing ontologies on a


community basis. Although the DILIGENT approach assumes the active engagement of the
community of practice throughout the entire process it does not give extensive details. Some
particularities may be found reported for those cases in which DILIGENT has been used, for
instance [42].

C1. DILIGENT is influenced by knowledge engineering as this methodology has been


developed assuming the ontologies will be used by knowledge-based systems. However,
DILIGENT introduces novel concepts such as the importance of the evolution of the
ontology and the participation of communities within the development and life cycle of the
ontology.

44
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

C2. DILIGENT provides some details specifically for those developments in which it
has been used.

C3. DILIGENT is application-dependant. There is no indication about the kind of


domain experts they advise working with.

C4. The selection between top-down, bottom-up or middle-out is problem dependent.


No indication is given as to which strategy would be best to follow.

C5. DILIGENT assumes an iterative life cycle in which the ontology is in constant
evolution.

C6. In principle DILIGENT does not recommend methods or techniques. By the


same token DILIGENT is not linked to any software supporting, either the development, or
the collaboration.

C7. Some cases for which DILIGENT has been used have been reported, for instance
see [42].

C8. The involvement of communities is considered in this methodology.

C9. Although knowledge elicitation is considered in this methodology no special


emphasis is placed on it.

1.3 WHERE IS THE MELTING POINT?

The considerable number of methodologies and the little detail provided by each of
them makes it difficult to find a melting point. Some similarities and shortcomings are
analyzed in this section. A summary of the comparison is given in Table 1.

45
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

methodology

Recommend

Recommend

Applicability
Detail of the

involvement
ed methods,
building the

Community
engineering

ed life cycle
Strategy for

Strategy for
Inheritance

Knowledge
techniques

technology
identifying
knowledge

elicitation
concepts
ontology
from

and
Partial Very AI MOut N/A N/A N/A N/A
Uschold and

little
King

Small Little ASD MOut TBD N/A Business and N/A N/A
Gruninger

foundational
and Fox

ontologies

A lot Very AD TD N/A N/A N/A N/A


Bernaras

little

A lot A lot AI MOut EP Some Multiple N/A Partially


Fernadez

activities developments
missing. reported
Technology
recommended
Inexistent Medium ASD N/A TBD N/A N/A N/A
Swartout

Small Small AI DED/TED TBD N/A N/A Community Partially


involvement
Pinto

Small A lot ASD DED/TED EP Some Developments Community Supported


techniques reported for involvement
and methods the Bio
Garcia

recommend. domain
Technology
recommended
Application-independent= AI, Application-SemiDependent=ASD, Application-Dependant=AD Top-Down=TD, Bottom-Up=BU,
Middle-Out=MOut, Domain Expert Dependent=DED, Terminology Extraction Dependent =TED N/A=Not available, To Be
Detailed=TBD, Evolving Prototypes=EP

Chapter 1 - Table 1. Summary of methodologies.


Reproduced and extended with permission from Fernadez, M. [1]

1.3.1 Similarities between methodologies

Although the investigated methodologies are different from each other, it was possible
to identify some commonalities amongst them. Figure 4 illustrates those shared stages across
all investigated methodologies except DILIGENT.

46
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

Chapter 1 - Figure 4. Similarities amongst methodologies.


DILIGENT presents some fundamental differences as it was engineered as a
methodology for developing ontologies within geographically non-centralised settings. Those
identified differences are listed below:

• Life cycle: Within the DILIGENT methodology the ontology is constantly evolving,
in a never-ending cycle. The life cycle of the ontology is understood as an open cycle
in which the ontology evolves in a dynamic manner.
• Collaboration: Within the DILIGENT methodology a group of people agrees on
the formal specification of the concepts, relations, attributes, and axioms that the
ontology should provide. This approach empowers domain experts in a way that
sets DILIGENT apart from the other methodologies.
• Knowledge elicitation: Due in part to the involvement of the community and in part
to the importance of an agreement within the DILIGENT methodology knowledge
elicitation is assigned a high level of importance as it supports the process by which
consensus is reached.

1.3.2 Shortcoming of the methodologies.

From the analysis previously presented it is clear that no single methodology brings
together everything that is needed when developing ontologies; methodologies have been

47
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

developed on an ad hoc basis. Some of the methodologies, such as those of Bernaras, provide
information about the importance of the relationship between the final application using the
ontology and the process by which the ontology is engineered. This consideration is not
always taken from the beginning of the development; clearly the kind of ontology that is
being developed heavily influences this relationship. For instance, foundational ontologies
rarely consider the software using the ontology as an important issue; these ontologies focus
more on fundamental issues affecting the classification system such as time, space, and events.
They tend to study the intrinsic nature of entities independently from the particular domain in
which the ontology is going to be used [20].

The final application in which the ontology will be used also influences the kind of
domain experts that should be considered for the development of the ontologies. For
instance, specialised domain experts are necessary when developing application ontologies,
domain ontologies or task ontologies, but they tend not to have such a predominant role
when building foundational ontologies. For these kinds of ontologies philosophers and
broader knowledge experts are usually more suitable.

None of the investigated methodologies provided real detail; the descriptions for the
processes were scarce, and where present theoretical. No recollection was given about the
ontology building sessions. The methods employed during the development of the ontologies
were not fully described. For instance the reasons for choosing a particular method over a
similar one were not presented. Similarly there was no indication as to what software should
be used to develop the ontologies. METHONTOLOGY was a particular case for which
there is a software environment associated to the methodology; the recommended software
WebODE [32] was developed by the same group to be used within the framework proposed
by their methodology.

Although the investigated methodologies have different views on the life cycle of the
ontology none of them, except for DILIGENT, considers the life cycle to be dynamic. This is
reflected in the processes these methodologies propose. The development happens in a
continuum; some parts within the methodologies are iterative processes, but the steps are

48
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

linear, taking place one after the other. In the case of DILIGENT the different view on the
life cycle is clear. However, there is no clear understanding as to how this life cycle is dynamic
and evolving; the authors don’t present any such discussion.

The lack of support for the continued involvement of domain experts scattered around
the world is a shortcoming in the investigated methodologies. As the SW poses a scenario in
which information is highly decentralised, such a consideration is important. Biological
sciences pose a similar scenario, in which domain experts are geographically distributed and
the interaction takes place mostly on a virtual basis.

Ontologies in the semantic web should not only be domain and/or task specific but
also application oriented. Within the SW the construction of applications and ontologies will
not always take place as part of the same software development projects. It is therefore
important for these ontologies to be easily extensible; their life cycle is one in which the
ontologies are in constant evolution, highly dynamic and highly reusable. Ontologies in
biology have always supported a wide range of applications; MO for instance is used by
several unrelated microarray laboratories information systems around the world. In both
scenarios, SW and biology, not only is the structure of the ontology constantly evolving, but
also the role of the knowledge engineer is not that of a leader but more that of a facilitator of
collaboration and communication among domain experts.

Parallels can be drawn between the biological domain and the SW. Pinto and coworkers
[16] define SW-related scenarios as distributed, loosely controlled and evolving. As has been
pointed out by Garcia et al. [7] domain experts in biological sciences are rarely in one place;
they tend to form virtual organisations where experts with different but complementary skills
collaborate in building an ontology for a specific purpose. The structure of the collaboration
does not necessarily incorporate a central control and different domain experts join and leave
the network at any time and decide on the scope of their contribution to the joint effort.
Biological ontologies are constantly evolving, not only as new instances are added, but also as
new whole/part-of properties are identified as new uses of the ontology are investigated. The

49
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

rapid evolution of biological ontologies is due in part to the fact that ontology builders are
also those who will ultimately use the ontology [43].

Pinto and co-workers [16], as well as Garcia et al [7] have summarised those differences
between classic proposals for building ontologies and those requirements added by the SW in
four key points:

• Distributed information processing with ontologies: Within the SW scenario,


ontologies are developed by geographically distributed domain experts willing to
collaborate, whereas KE deals with centrally developed ontologies.
• Domain expert-centric design: Within the SW scenario, domain experts guide the
effort while the knowledge engineer assists them. There is a clear and dynamic
separation between the domain of knowledge and the operational domain. In
contrast, traditional KE approaches relegate the role of the expert as an informant
to the knowledge engineer.
• Ontologies are in constant evolution in SW, whereas in KE scenarios, ontologies are
simply developed and deployed.
• Additionally, within the SW scenario, fine-grained guidance should be provided by
the knowledge engineer to the domain experts.
The lack of unified criteria makes it difficult to amalgamate methodologies; each group
applies its own methodology adapting it to the specific problem they are addressing.
Unfortunately due to the lack of detail in methods and techniques used in the investigated
methodologies a unification of criteria is difficult. Collaboration is considered only by
DILIGENT; however this methodology does not propose methods for engaging the
collaborators. Moreover, knowledge elicitation whether within the context of collaboration or
as a focus group activity is not addressed. METHONTOLOGY considers knowledge
elicitation as part of the methodology, but there are no recommendations regarding
knowledge elicitation methods.

Collaboration, knowledge elicitation, a better understanding of the ontology life cycle


and, more detail for those different involved steps are important information that should be
described so the methodologies may be better replicated. There is also an increasing need for
more reuse of methodologies rather than developing ad hoc de novo methodologies. These are
precisely the issues that the methodology proposed in this thesis will address. Throughout

50
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

chapters three, four, and five the methodology as well as its corresponding methods and
illustrative cases will be presented. It is based on real cases worked out within the biological
domain as well as on a thoughtful analysis of previously proposed methodologies.

In this chapter a comparison and analysis of existing methodologies has been


presented. The framework that Fernandez [1] proposed has been extended and other
dimensions have been added to the comparison. The corresponding summary is presented in
Table 1.

1.4 ACKNOWLEDGEMENTS

The author specially thanks Oscar Corcho, and Mariano Fernandez for their extremely
helpful suggestions.

1.5 REFERENCES

1. Fernandez M: Overview Of Methodologies For Building Ontologies. In: In


Proceedings of the IJCAI-99 Workshop on Ontologies and Problem-Solving Methods(KRR5):
1999; Stockholm, Sweden; 1999.
2. Fernandéz M, Gómez-Pérez A, Juristo N: METHONTOLOGY: From
Ontological Art to Ontological Engineering. In: Workshop on Ontological Engineering
Spring Symposium Series AAAI97: 1997; Stanford; 1997.
3. Mirzaee V: An Ontological Approach to Representing Historical Knowledge.
MSc Thesis. Vancouver: Department of Electrical and Computer Engineering,
University of British Columbia; 2004.
4. Beck H, Pinto HS: Overview of Approach, Methodologies, Standards, and Tools
for Ontologies. The Agricultural Ontology Service (UN FAO) 2003.
5. Lopez MF, Perez AG: Overview and Analysis of Methodologies for Building
Ontologies. Knowledge Engineering Review 2002, 17(2):129-156.
6. Noy NF, Hafner CD: The state of the art in ontology design - A survey and
comparative review. AI Magazine 1997, 18(3):53-74.

51
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

7. Garcia CA, Rocca-Serra P, Stevens R, Taylor C, Nashar K, Ragan MA, Sansone S:


The use of concept maps during knowledge elicitation in ontology
development processes - the nutrigenomics use case. BMC Bioinformatics 2006,
7:267.
8. Ashburner M, Ball C, Blake J, Botstein D, Butler H, Cherry J, Davis A, Dolinski K,
Dwight S, Eppig J et al: Gene Ontology: tool for the unification of biology. The
Gene Ontology Consortium. Nature Genetics 2000, 25(1):25-29.
9. Crosby MA, Goodman JL, Strelets VB, Zhang P, Gelbart WM, FlyBase-Consortium.:
FlyBase: genomes by the dozen. Nucleic Acids Res 2007, 35:486-491.
10. Cherry JM, Ball C, Weng S, Juvik G, Schmidt R, Adler C, Dunn B, Dwight S, Riles L,
Mortimer RK et al: Genetic and physical maps of Saccharomyces Cerevisiae.
Nature 1997, 387(6632 Suppl):67-73.
11. Eppig JT, Bult CJ, Kadin JA, Richardson JE, Blake JA, Mouse-Genome-Database-
Group.: The Mouse Genome Database (MGD): from genes to mice—a
community resource for mouse biology. Nucleic Acids Res 2005, 33:471-475.
12. Pankaj J, Shulamit A, Katica I, Elizabeth A, Kellogg. , Susan M, Anuradha P, Leonore
R, Seung Y, Rhee., Martin M. S, Mary S et al: Plant Ontology (PO): a controlled
vocabulary of plant structures and growth stages. Comparative and Functional
Genomics 2005, 6(7-8):388-397.
13. Lopez FM, Perez G-A, Sierra JP, Pazos SA: Building a Chemical Ontology Using
Methontology and the Ontology Design Environment. IEEE Intelligent Systems &
Their Applications 1999, 14(1):37-46.
14. IEEE: IEEE standard for software quality assurance plans. In. Edited by IEEE,
vol. 730-1998: IEEE Computer Society; 1998.
15. IEEE: IEEE Standard Glossary of Software Engineering Terminology. In:
IEEE Standards vol. IEEE Std 610.12-1990: IEEE; 1991.
16. Pinto HS, Staab S, Tempich C: Diligent: towards a fine-grained methodology for
Distributed, Loosely-controlled and evolving engineering of ontologies. In:
European Conference on Artificial Intelligence: 2004; Valencia, Spain; 2004: 393-397.
17. Negnevitsky M: Artificial Intelligence: A Guide to Intelligent Systems, 2nd Rev
Ed edition (27 Sep 2004) edn: Addison Wesley; 2004.
18. Kendal S, Creen M: An Introduction to Knowledge Engineering. London:
Springer-Verlag; 2006.
19. Corcho O, Fernadez-Lopez M, Gomez-Perez A: Methodologies, tools, and
languages for building ontologies. Where is their meeting point? Data and
Knowledge Engineering 2003, 46(1):41-64.

52
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

20. Sure Y: Methodology, Tools & Case Studies for Ontology based Knowledge
Managment. Karlsruhe: Universitat Fridericiana zu Karlsruhe; 2003.
21. Gangemi A, Guarino N, Masolo C, Oltramari A, Schneider L: Sweetening
ontologies with DOLCE. In: Proceedings of the 13th International Conference on Knowledge
Engineering and Knowledge Management Ontologies and the Semantic Web: 2002: Springer-
Verlag 2002: 166-181.
22. Uschold M, Gruninger M: Ontologies: Principles, methods and applications.
Knowledge Engineering Review 1996, 11(2):93-136.
23. Lakoff G: Women, fire, and dangerous things: what categories reveal about the
mind. Chicago: Chicago University Press; 1987.
24. Cooke N: Varieies of Knowledge Elicitation Techniques. International Journal of
Human-Computer Studies 1994, 41:801-849.
25. Uschold M, King M: Towards Methodology for Building Ontologies. In:
Workshop on Basic Ontological Issues in Knowledge Sharing, held in conjunction with IJCAI-95:
1995; Cambridge, UK; 1995.
26. Gruninger M, Fox MS: The Role of Competency Questions in Enterprise
Engineering. In: Proceedings of the IFIP WG57 Workshop on Benchmarking - Theory and
Practice: 1994; Trondheim, Norway; 1994.
27. Bernaras A, Laresgoiti I, Correa J: Building and Reusing Ontologies for Electrical
Network Applications. In: Proceedings of the European Conference on Artificial Intelligence
(ECAI96). Budapest; 1996.
28. Swartout B, Ramesh P, Knight K, Russ T: Toward Distributed use of Large-Scale
Ontologies. In: Symposium on Ontological Engineering of AAAI: 1997: Stanford,
California; 1997.
29. Vrandecic D, Pinto HS, Sure Y, Tempich C: The DILIGENT Knowledge
Processes. Journal of Knowledge Management 2005, 9(5):85-96.
30. Uschold M, King M, Moralee S, Zorgios Y: The Enterprise Ontology. The Knowledge
Engineering Review 1998, 13(Special issue on Putting Ontologies to Use).
31. Fernadez-Lopez M, Gomez-Perez A: Overview and Analysis of Methodologies
for Building Ontologies. The Knowledge Engineering Review 2002, 17(2):129-156.
32. Arpirez JC, Corcho O, Fernadez-Lopez M, Gomez-Perez A: WebODE in a
nutshell. AI Magazine 2003, 24(3):37-47.
33. Arpirez JC, Gomez-Perez A, Lozano A, Pinto H, S.: Reference Ontology and
ONTO2 Agent: The Ontology Yellow Pages. In: Workshop on applications of
Ontologies and Problem-solving Methods, European Conference on Artificial Intelligence
(ECAI98): 1998; Brighton, UK; 1998.

53
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

34. Fellbaum C: WordNet, An Electronic Lexical Database. Boston: The MIT Press;
2000.
35. ISI: Information Sciences Institute. In: SENSUS Ontology, http://www.isi.edu/natural-
language/projects/ONTOLOGIEShtml. 2007.
36. Knight K, Luck S: Building a large knowledge base for machine translation. In:
Proceedings of the American Association of Artificial Intelligence: 1994; 1994: 773-778.
37. Knight K, Chander I: Automated Postediting of Documents. In: Proc of the National
Conference on Artificial Intelligence (AAAI): 1994; 1994.
38. Knight K, Graehl J: Machine Trasnliteration. In: Proc of the Conference of the Association
for Computational Linguistics (ACL): 1997; 1997.
39. Valente A, Russ T, McGregor R, Swartout B: Building and (Re)Using an
Ontology of Air Campaign Planning. IEEE Intelligent Systems & Their Applications
1999(January/February).
40. Berners-Lee T: Weaving the Web: HarperCollins; 1999.
41. W3C: Semantic Web Activity Statement. In: http://www.w3.org/2001/sw/Activity.
2007.
42. Pinto S, Staab S, Sure Y, Tempich C: OntoEdit Empowering SWAP: a Case Study
in Supporting DIstributed, Loosely-Controlled and evolvInG Engineering of
oNTologies (DILIGENT). In: ESWS 2004: 2004; 2004: 16-30.
43. Bada M, Stevens R, Goble C, Gil Y, Ashbourner M, Blake J, Cherry J, Harris M,
Lewis S: A short study on the success of the GeneOntology. Journal of Web
Semantics 2004, 1:235-240.

54
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

The melting point, a methodology for


developing ontologies within decentralised
settings

This chapter addresses two research questions, “How should a well-engineered


methodology facilitate the development of ontologies within communities of practice? What
methodology should be used?” and “If ontologies are to be developed by communities then
the ontology development life cycle should be better-understood within this context”. This
chapter presents the proposed methodology, describes each step, highlight those novel
components, and compare the methodology with alternatives. The methodology here
presented is the product of experiences gathered from those scenarios reported in chapters 3,
4, and 5.

Not only this methodology is based upon real cases but also, and more importantly,
those steps, methods and techniques described here have been extensible tested. This is the
first methodology engineered for decentralised communities of practice for which designers
of technology and users may be the same group. The use of concept maps throughout the
development process, the importance of the argumentative structures, and the usefulness of
the narratives and text-mining techniques are among methods and techniques here described.
Subsequent chapters present those experiences that allowed the author not only to test and
extend the methodology but also to validate it.

The author engineered the methodology, defined the steps, methods, and techniques
involved. The investigation that allowed gathering all the information and data supporting this
methodology was entirely conducted by Alex Garcia; the involvement of communities of
practice as well as the identification of those areas for which there could be interest in helping
the author to do his research were also activities carried out by Alex Garcia. Manuscripts as
well as the corresponding journal and conference publications were written by Alex Garcia.

55
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

2 Chapter II - The melting point, a


methodology for developing ontologies
within decentralised settings

“…it is extremely difficult to judge the value of a methodology in an objective way.


Experimentation is of course the proper way to do it, but it is hardly feasible because there are too
many conditions that cannot be controlled… Introducing a toy problem will violate the basic
assumption behind the need for a methodology: a complex development process.”
De Hoog R. 1988

2.1 INTRODUCTION

As presented in the previous chapter, building ontologies has been more of an ad hoc
process rather than a well-engineered practice. It has been argued by several authors that to
this day there are no agreed-upon standard methodology for building ontologies [1-3].
Nonetheless, there exist generic components fundamental to the ontology-building process,
present in most or all ontology developments even if they are not explicitly identified. A
detailed study of methodologies and those generic components was presented in Chapter 1.
In the present chapter “The melting point, a methodology for developing ontologies within decentralized
environments” those generic components are orchestrated in a coherent manner not only with
the way communities build ontologies but also with the life cycle of these ontologies. The
description of features and interrelationships is based upon experimentation and observation
that took place during developments in real scenarios. It was possible not only to have direct
access to domain experts but also to monitor the evolution and intended use of the ontology,
Moreover; it was possible to study the processes by which the community got involved in the
development of the ontology.

Those previously proposed methodologies have been engineered for centralised


settings, in which the ontology is developed and deployed on a one-off basis. The
maintenance, as well as the evolution, of the ontology is left to the knowledge engineer and a
56
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

reduced group of domain experts. This same situation is also true for the whole process in
which a reduced group of domain experts work together with the knowledge engineer during
the development of the ontology; the community is not widely involved.

Within the Semantic Web (SW), as well as within the biological domain, the
involvement of communities of practice is crucial not only for the development, but also for
the maintenance and evolution of ontologies. Domain experts in biological sciences are rarely
in one place; they tend to form virtual organisations in which experts with different but
complementary skills collaborate in building an ontology for a specific purpose. The structure
of the collaboration does not necessarily have a central control; different domain experts join
and leave the network at any time and decide on the scope of their contribution to the joint
effort. Biological ontologies are constantly evolving; new classes, properties, and instances
may be added at any time, and new uses for the ontology may be identified [2]. The rapid
evolution of biological ontologies is due in part to the fact that ontology builders are also
those who will ultimately use the ontology [4].

This chapter presents the methodology inferred from those scenarios for which it was
possible to conduct experiments that allowed the author to understand the importance and
impact of the community, as well as on the structure and orchestration of those fundamental
components in the ontology building process. The initial section of this chapter presents a
brief introduction stressing the important points that will be elaborated throughout this
chapter; some terminological considerations are presented in the second section. This is
followed by the presentation of the methodology and related information; methods,
techniques, activities and tasks are also presented in section three. Section four presents the
incremental evolutionary spiral model of tasks, activities and processes consistent with the life
cycle. Sections five and six present discussion and conclusions.

57
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

2.2 TERMINOLOGICAL CONSIDERATIONS

Some of the common points across the previously proposed methodologies have been
adapted for the present work. An important contribution to this methodology comes from
observations made by Perez-Gomez et al. [5, 6], Fernandez et al. [7] Pinto et al. [8, 9], and
Garcia et al. [2] -see chapters 3, 4, and 5 for more information on Garcia’s observations-. Both
Fernandez et al. and Perez-Gomez et al. emphasise the importance of complying with the
Institute of Electrical and Electronics Engineers (IEEE) standards, more specifically with the
“IEEE standard for software quality assurance plans” [10]. In the context of the conclusions
drawn in the previous chapter, such concern is understandable; not only does standards
compliance ensure a careful and systematic planning for the development, but it also ensures
the applicability of the methodology to a broad range of problems.

Also, from the previous chapter it became clear that methodologies bring together
techniques and methods in an orchestrated way so that the work can be done. A method is
“an orderly process or procedure used in the engineering of a product or performing a service” [11]. A
technique is defined as a “technical and managerial procedure used to achieve a given objective” [10].
Figure 1 illustrates in a more comprehensive manner these relationships.

Chapter 2 - Figure 1. Terminological relationships.


Greenwood [12] as well as Gomez-Perez et al. [13] present these terminological
relationships in a simple way: “a method is a general procedure while a technique is the specific application

58
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

of a method and the way in which the method is executed” [13]. According to the IEEE [14] a process
is a “function that must be performed in the software life cycle. A process is composed by activities”. The
same set of standards defines an activity as “a constituent task of a process” [14]. A task is the
atomic unit of work that may be monitored, evaluated and/or measured; a task is “a well
defined work assignment for one or more project member. Related tasks are usually grouped to form activities”
[14].

2.3 THE METHODOLOGY AND THE LIFE CYCLE

For the purpose of the proposed methodology it was decided that the activities
involved would be framed within processes and activities, as illustrated in Figure 1; this
conception is promoted by METHONTOLOGY [7] for centralised settings. As these
activities were not conceived within decentralised settings, their scope has been redefined, so
that they better fit the life cycle of ontologies developed by communities. The methodology
here presented emphasises: decentralised settings and community involvement. It also stresses
the importance of the life cycle these ontologies follow, and provides activities, methods and
techniques coherently embedded within this life cycle.

The methodology and the life cycle are illustrated in Figure 2. The overall process starts
with documentation and management processes; the development process immediately
follows. Managerial activities happen throughout the whole life cycle, as the interaction
amongst domain experts ensures not only the quality of the ontology, but also that those
predefined control activities take place. The development process has four main activities:
specification, conceptualisation, formalisation and implementation, and evaluation. Different
prototypes of the ontology are thus constantly being deployed. Initially these prototypes may
be unstable, as the classes and properties may drastically change. In spite of this, the process
evolves rapidly, achieving a stability that facilitates the use of the ontology; changes become
more focused on the inclusion of classes and instances, rather than on the redefinition of the
class hierarchy.

59
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

Chapter 2 - Figure 2. Life cycle, processes, activities, and view of the methodology.

2.3.1 Documentation processes

The documentation is a continuum process throughout the entire development of the


ontology. This documentation should make it possible for new communities of practice to get
involved in the development of the ontology.

2.3.1.1 Activities for documenting the management processes

• Scheduling: Gantt charts are useful when scheduling processes, also simple
spreadsheets, or Word documents may be used.
• Control: flowcharts allow for a simple view of the process and those points for
which there is the need to have a control activity.
Although there are several software suites that assist in project management, some of
them offering workgroup capabilities (see for instance http://www.mindtools.com/), large
60
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

biological ontology projects use simpler solutions such as those facilities Google offers for
networking. Scheduling and controlling activities can be done by using Google calendar
(http://www.google.com/calendar), by the same token sharing documents is facilitated by
google documents (http://docs.google.com). When establishing communication and
exchanging information email, wiki pages and voice over Internet Protocol (IP) systems have
proven to be useful in projects such as the Ontology for Biomedical Investigations (OBI) [15]
and the Microarray Ontology (MO) [16, 17]. A more-detailed description of the involvement
of communities by high-tech means was published by Garcia et al. [2]. For both scheduling
and controlling, the software tool(s) should in principle:

• help to plan the activities and tasks that need to be completed,


• give a basis for scheduling when these tasks will be carried out,
• facilitate planning the allocation of resources needed to complete the project,
• help to work out the critical path for a project where one must complete it by a
particular date,
• facilitate the interaction amongst participants, and
• provide participants with simple means for exchanging information.

2.3.1.2 Documenting classes and properties

Although documentation happens naturally as discussions often take place on an email


basis, it is often difficult to follow the argumentative thread. Even so, the information
contained in mailing lists is useful and should whenever possible be related to classes and
properties. Use cases, in the form of examples for which the use of a term is well-illustrated,
should also be part of the documentation of classes and properties. Ontology editors allow
domain experts to comment on the ontology; this kind of documentation is useful, as it
reflects the understanding of the domain expert. For classes and properties there are three
main sources of documentation:

• mailing lists: discussions about why should a class be part of the ontology, why
should it be part of a particular branch, how is it being used by the community, how
a property relates two classes, and in general all discussions relevant to the ontology
happen on an email basis.

61
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

• On-the-Ontology comments: in the cases when domain experts are familiarised with
the ontology editor, they usually comment on classes and properties.
• Use cases: this should be the main source of structured documentation provided by
domain experts. However, gathering use cases is often difficult and time-consuming.
The use cases should illustrate how a term is being used in a particular context, how
the term is related to other terms, and those different uses or meanings a term may
have. Guidance is available for the construction of use cases when developing
software; however this direction is not available when building ontologies. From
those experiences in which the author participated some general guide can be drawn,
for instance:
o use cases should be brief,
o they should be based upon real-life examples,
o knowledge engineers have to be familiar with the terminology as well as with the
domain of knowledge because use cases are usually provided in the form of
narratives describing processes,
o graphical illustrations should be part of the use case, and also
o whenever possible concept maps, or other related KA artefacts, should be used.

2.3.2 Management processes

These start as soon as there is a decision for developing the ontology and continue
throughout the whole ontology development process. Managerial processes aim to assure the
successful development of the ontology by providing domain experts with all that is needed.
Also, managerial processes define general policies that allow the orchestration of the whole
development. Some of the activities involved in the managerial processes are:

2.3.2.1 Scheduling

Scheduling identifies tasks, time and resources needed.

2.3.2.2 Control

Control warranties the planned tasks are completed.

2.3.2.3 Inbound-interaction

Inbound-interaction specifies how the interaction amongst domain experts will take
place, for instance by phone calls, mailing lists, wiki pages and, web publications.

62
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

2.3.2.4 Outbound-interaction

As different communities should in principle be allowed to participate, there has to be


an inclusion policy that specifies how a new community could collaborates and engages with
the ongoing development.

2.3.2.5 Quality assurance

This activity defines minimal standards for those outputs from each and every process,
activity or task carried out within the development of the ontology.

For both inbound and outbound interactions, there are some key questions that should
be addressed:

• Which are the communities of practice involved in this development?


• Are there going to be branches for the indented ontology?
• What is the relationship between communities of practice and the branches of the
ontology?
• Are there going to be editors for the different branches of the ontology?

2.3.3 Development-oriented processes

2.3.3.1 Feasibility study and milestones

Feasibility study: This first activity involves addressing straightforward questions such
as: what is the ontology going to be used for? How is the ontology ultimately going to be used
by the software implementation? What do we want the ontology to be aware of, and what is
the scope of the knowledge we want to have in the ontology?

The milestones for this activity are: competency questions, scenarios in which it is
foreseeable the ontology will be used, is there a “go” for the ontology?

2.3.3.2 Activities for the conceptualisation

• Domain Analysis (DA) and Knowledge Acquisition (KA): Knowledge elicitation


(KE) is the process of collecting from a human source of knowledge, information
that is relevant to that knowledge [18]. Knowledge acquisition includes the

63
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

elicitation, collection, analysis, modelling and validation of knowledge for knowledge


engineering and knowledge management projects. The notion for both KA and KE
comes from the development of knowledge bases; for the purposes of developing
ontologies, KA and KE can be considered as transposable terms. Domain analysis is
the process by which a domain of knowledge is analysed in order to find common
and variable components that best describe that domain. KA and DA are
interchangeable and complementary activities by which the information used in a
particular domain is identified, captured and organised for the purpose of making it
available in an ontology [19]. Both DA and KA are also part of the formalisation
and implementation activities, the difference being the maturity level of the expected
outcomes as well as the involved activities. When analyzing and acquiring
knowledge, activities and tasks are more oriented to produce a baseline ontology, for
instance the identification of recyclable and reusable ontologies, as well as of basic
terminology that describes the domain. Identifying available sources of knowledge is
also important; by doing so it is possible to better scope the ontology. More detailed
information about reusable and recyclable ontologies may be found in chapter 2,
section 3.1.2.2.2, and also in [2]. Reusing ontologies is not always instantly possible,
however it is important to identify how to extend and adapt existing ontologies so
collaborations with other groups developing ontologies becomes more fruitful.
Baseline ontologies tend to lack formal definitions, whole/part-of relationships, and
a stable is-a structure; those activities related to DA and KA focus more on
capturing and representing knowledge in a more immediate manner and not
necessarily on having logical expressions as part of the models; whereas when
formalizing and evaluating an ontology activities and tasks are more oriented to
include logical constrains and expressions. DA and KA may be seen as the ‘art of
questioning’, since ultimately all relevant knowledge is either directly or indirectly in
the heads of domain experts. This activity involves the definition of the terminology,
i.e. the linguistic phase. This starts by the identification of those reusable ontologies
and terminates with the baseline ontology, i.e. a draft version containing few but
seminal elements of an ontology. The following criteria are important during
knowledge acquisition [2]:
o Accuracy in the definition of terms. The linguistic part of the ontology
development is also meant to support the sharing of information/knowledge.
The availability of context as part of the definition is useful when sharing
knowledge.
o Coherence: as concept maps (CMs) are enriched, it is important to ensure the
coherence of the story we were capturing. Domain experts are asked to use the
CMs as a means to tell a story; consistency within the narration is therefore
crucial.
o Extensibility: this approach may be seen as an aggregation problem; CMs are
constantly gaining information, which is always part of a bigger narration.
Extending the conceptual model is not only about adding more detail to the
existing CMs, nor it is it just about generating new CMs; it is also about grouping
64
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

concepts into higher-level abstractions and validating these with domain experts.
Scaling the models involves the participation of both domain experts and the
knowledge engineer. It is mostly done by direct interview and confrontation with
the models from different perspectives. The participation of new “fresh” domain
experts, as well as the intervention of experts from allied domains, allows
analyzing the models from different angles. This participatory process allows re-
factorising the models by increasing the level of abstraction.
Throughout these activities Gruber’s design principles [20] such as those mentioned
below have to be considered.

• First design principle: “The conceptualization should be specified at the knowledge level without
depending on a particular symbol-level encoding.”
• Second design principle: “Since ontological commitment is based on the consistent use of the
vocabulary, ontological commitment can be minimised by specifying the weakest theory and defining
only those terms that are essential to the communication of knowledge consistent with the theory.”
• Third design principle: “An ontology should communicate effectively the intended meaning of
defined terms. Definitions should be objective. Definitions can be stated on formal axioms, and a
complete definition (defined by necessary and sufficient conditions) is preferred over a partial
definition. All definitions should be documented with natural language.”
For the purpose of DA and KA it is critical to elicit and represent knowledge from
domain experts. They do not, however, have to be aware of knowledge representation
languages; this makes it important that the elicited knowledge is represented in a language-
independent manner. Researchers participating in knowledge elicitation sessions are not
always aware of the importance of the session; however they are aware of their own
operational knowledge. This is consistent with the first of Gruber’s design principles.

Regardless of the syntactic format in which the information is encoded domain experts
have to communicate and exchange information. For this matter it is usually the case that
wide general theories, principles, broad-scope problem specifications are more useful when
engaging domain experts in discussions, as these tend to contain only essential basic terms,
known across the community and causing the minimal number of discrepancies, see the
second design principle. As the community engages in the development process and the
ontology grows, it becomes more important to have definitions that are usable by computer
systems and humans, see the third design principle.

65
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

2.3. 3.2.1.1 M ILEST ONES , T ECHN IQ UES AN D T ASK S F OR T HE K A AN D D A

ACTIVIT IES

The milestones, techniques and tasks identified for DA and KA related activities are:

• Tasks: focal groups, limited-information and constrained-processing tasks, protocol


analysis, direct one-to-one interviews, terminology extraction, and inspection of
existing ontologies.
• Techniques: concept mapping, sorting techniques, automatic or semi-automatic
terminology extraction, informal modelling and Ontology lookup service [21](OLS)1.
• Milestones: Baseline ontology, knowledge sources, basic terminology, reusable
ontologies.

2.3.3.3 Iterative Building of Ontology Models (IBOM).

Iterative building of informal ontology models helps to expand the glossary of terms,
relations, their definition or meaning, and additional information such as examples to clarify
the meaning where appropriate. Different models are built and validated with the domain
experts. There is a fine boundary between the baseline ontology and the refined ontology;
both are works in progress, but the community involved has agreed upon the refined
ontology.

2.3. 3.3.1 M E T HO D S , T ECHNIQUES AND M I LES T ON ES F O R T HE IBOM.

The milestones, techniques and tasks identified for IBOM related activities are:

• Methods: focal groups


• Techniques: concept mapping, informal modelling with an ontology editor
• Milestones: refined ontology.

1 The Ontology LookUp service (OLS) provides a user-friendly single entry point for querying publicly available ontologies in
the Open Biomedical Ontology (OBO) format. By means of the OLS it is possible to verify if an ontology term has already
been defined and in which ontology it available.

66
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

2.3.3.4 Formalisation

Formalisation of the ontology is the activity during which the classes are constrained,
and instances are attached to their corresponding classes. For example: “a male is constrained
to be an animal with a y-chromosome”. During the formalisation domain experts and
knowledge engineers work with an ontology editor. When building iterative models and
formalizing the ontology the model grows in complexity; instances, classes and properties are
added, also logical expressions are built in order to have definitions with necessary and
sufficient conditions. For both, formalisation and IBOM, Gruber’s fourth designing principle
is applicable, as well as Noy and McGuinness’s guidelines [22].

• Fourth principle: “An ontology should be coherent: that is, it should sanction inferences that are
consistent with de definitions. […] If a sentence that can be inferred from the axioms contradicts a
definition or example given informally, then the ontology is inconsistent.”
• Noy and McGuinness’s first guideline: “The ontology should not contain all the possible
information about the domain: you do not need to specialise (or generalise) more than you need for
your application.”
• Noy and McGuinness’s second guideline: “subconcepts of a concept usually i) have
additional relations that the superconcetp does not have, or ii) restrictions different from these of
superconcepts, or iii) participate indifferent relationships than supperconcepts. In other words, we
introduce a new concept in the hierarchy usually only when there is something that we can say about
this concept that we cannot say about the superconcept. As an exception, concepts in terminological
hierarchies do not have to introduce new relations”.
• Noy and McGuinness’s third guideline: “If a distinction is important in the domain and we
think of the objects with different values for the distinction as different kinds of objects, then we
should create a new concept for the distinction”.
• Noy and McGuinness’s fourth guideline: “A concept to which an individual instance
belongs should not change often”.

2.3.3.5 Evaluation

There is no unified framework to evaluate ontologies, and this remains an active field of
research [23]. When developing ontologies on a community basis three main evaluation
activities have been identified:

67
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

2.3. 3.5.1 A P P LICAT ION - DEP END ENT EVALUAT ION

It is considered that ontologies should be evaluated according to their fitness for


purpose, i.e. an ontology developed for annotation purposes should be evaluated by the
quality of the annotation and the usability of the annotation software [2]. The community
carries out this type of evaluation in an interactive manner; as the ontology is being used for
several purposes a constant feedback is generated. This makes it possible for the community
to effectible guarantee the usability and the quality of the ontology. By the same token, the
recall and precision of the data, and the usability of the conceptual query builder, should form
the basis of the evaluation of an ontology designed to enable data retrieval.

2.3. 3.5.2 T ERMINOLOG Y EVALUAT ION .

This activity was proposed by Perez-Gomez et al. [24]. The goal of the evaluation is to
determine what the ontology defines, and how accurate these definitions are. Perez-Gomez et
al. provides the following criteria for the evaluation:

• Consistency: it is assumed that a given definition is consistent if, and only if, no
contradictory knowledge may be inferred from other definitions and axioms in the
ontology.
• Completeness: it is assumed that ontologies are in principle incomplete [23, 24],
however it should be possible to evaluate the completeness within the context in
which the ontology will be used. An ontology is complete if and only if:
o All that is supposed to be in the ontology is explicitly stated, or can be inferred.
• Conciseness: an ontology is concise if it does not store unnecessary knowledge, and
the redundancy in the set of definitions has been properly removed.

2.3. 3.5.3 T AXON OMY EVALUAT ION

This evaluation is usually carried out by means of reasoned systems such as RACER
[25] and Pellet [26]. The knowledge engineer checks for inconsistencies in the taxonomy,
these may due to errors in the logical expressions that are part of the axioms.

68
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

2.3.3.6 A summary of the process.


Activities Activities Task Suggested software
Feasibility study Defining the scope of the ontology
Conceptualisation KA & DA Concept maps CMAP Tools
Informal ontology models CMAP Tools, Protégé
focal groups CMAP Tools
Direct one to one interviewsfocal groups CMAP Tools
D Direct one to one interviews CMAP Tools
e Gathering lists of terms Text2ONTO
v IBOM Refined ontology model Protégé
e Formalisation Competency questions Protégé
l P Community agreement Protégé & RACER
o r Evaluation Competency questions
p o Community evaluation
m c Terminology evaluation
e e Taxonomy evaluation
n s Techniques
t s Concept mapping
Automatic or semi-
automatic terminology
extraction

Ontology modeling with


an ontology editor

Chapter 2 - Table 1. A summary of the development process.

2.4 AN INCREMENTAL EVOLUTIONARY SPIRAL MODEL OF TASKS,


ACTIVITIES AND PROCESSES

Ontologies, like software, evolve over time; specifications often change as the
development proceeds, making a straightforward path to the ontology unrealistic. Different
software process models have been proposed; for instance, linear sequential models, also
known as waterfall models [27, 28] are designed for straight-line development. The linear
sequential model suggests a systematic, sequential approach in which the complete system will
be delivered once the linear sequence is completed [28]. The role of domain experts is passive,
as end-users of technology. They are placed in a reacting role in order to give feedback to
designers about the product. The software or knowledge engineer leads the process and
controls the interaction amongst domain experts.

69
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

The prototyping model is more flexible as prototypes are constantly being built.
Prototypes are built as a means for defining requirements [28], this allows for a more active
role from domain experts. A quick design is often obtained in short periods of time. The
model grows as prototypes are being released [23]; engineers and domain experts work on
these quick designs. They focus on representational aspects of the ontology, while, the main
development of the ontology (building the models, defining what is important, documenting,
etc) is left to the knowledge engineer. A high-speed adaptation of the linear sequential model
is the Rapid Application Development (RAD) model [29, 30]. This emphasises short
development cycles for which it is possible to add new software components, as they are
needed. RAD also strongly suggests reusing existing program components, or creating
reusable ones [28].

The evolutionary nature of the software is not considered in either of the


aforementioned models, from the software engineering perspective evolutionary models are
iterative, and allow engineers to develop increasingly more complex versions of the software
[28, 31, 32]. Ontologies are, in this sense, not any different from other software components
for which process models have evolved from a “linear thinking” into evolutionary process
models that recognise that uncertainty dominates most projects, that timelines are often
impossibly short, and that iteration provides the ability to deliver a partial but extendible
solution, even when a complete product is not possible within the time allotted. Evolutionary
models emphasise the need for incremental work products, risk analysis, planning followed by
plan revision, and customer (domain experts) feedback [28].

70
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

KA=Knowledge acquisition, DA=Domain Analysis, IBOM= Iterative Building of Ontology Models F=Formalisation,
EVAL=Evaluation

Chapter 2 - Figure 3. An incremental evolutionary spiral model of tasks, activities and processes.
Figure 3 illustrates how tasks and activities are incremental in the spiral and how the
process is constantly evolving. Activities such as Knowledge Acquisition (KA), Domain
Analysis (DA), Iterative Building of Ontology Models (IBOM) Formalisation (F), and
evaluation (EVAL) take place within the spiral; not necessarily following a strict order.
Initially those processes related to management occur. As soon as there is a “GO” for the
ontology development process these activities start with KA, DA, and IBOM. Once the first
prototype of the ontology has been modelled then activities, tasks and processes can coexist
simultaneously at some level of detail within the spiral. The process allows for its own
incremental growth by facilitating the incorporation of other activities and/or processes such
as Evaluation and Formalisation.

71
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

2.5 DISCUSSION

As discussed in the previous chapter METHONTOLOGY is the only methodology


that rigorously complies with IEEE standards; this facilitates the applicability and extendibility
of the methodology. Other methodologies, such as those studied in chapter 2 don’t
intentionally meet the terms posed by the IEEE. However, some of the proposed activities by
those ontologies may be framed within IEEE standards. Table 1 illustrates this comparison;
in the same vein Table 1 in chapter 1 presents the proposed methodology within the
comparison framework proposed in that chapter. The methodology here proposed reuses and
adapts many components from METHONTOLOGY and other methodologies within the
context of decentralised settings and participatory design. It also follows Sure’s [33] work as it
considers throughout the whole process the importance of the software applications that will
ultimately use the ontology. The work done by Sure is complementary to the one presented in
this thesis, as both works study different edges of the same process: developing knowledge-
based software.

METHONTOLOGY allows for a controlled development and evolution of the


ontology placing special emphasis on quality assurance (QA) thought the processes. Although
QA is considered, the authors don’t propose any methods for this specific task. Management,
development and support activities are carried out in a centralized manner; a limited group of
domain experts interact with the knowledge engineer, conceptualize and prototype the
ontology, successive prototypes are then built, the ontology gains more formality (e.g. logical
constraints are introduced) until it is decided that the ontology may be deployed. Once the
ontology has been deployed a maintenance process takes place. Neither the development nor
the evolution of the ontology involves a decentralized community; the process does not
assume a constant incremental growth of the ontology as it has been observed, and reported
by [2] QA is also considered to be a centralized activity, contrasting with the way
decentralized ontologies promote the participation of the community in part to ensure the
quality of the delivered ontology.

72
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

Project Project development-oriented process Integral


Management process
process
Pre-development Development process Post-
process development
process
Requirement Design Implementation
process process process
Uschold N/A N/A Proposed N/A Proposed N/A Activities not
and King identified for
training,
environment
study, and
configuration
management
Gruninger N/A N/A Proposed Proposed Proposed N/A Activities not
and Fox identified for
training,
environment
study, and
configuration
management
Bernaras N/A N/A Proposed Proposed Proposed N/A N/A
Fernadez N/A N/A Proposed Proposed Proposed Activities not
identified for
training,
environment
study, and
configuration
management
Swartout N/A N/A Proposed N/A Proposed N/A N/A
Pinto Distributed N/A Proposed, N/A Proposed Assumes a Some outline
across relies on SW post - for
members of community development configuration
the community involvement process in management
which the is given
community
maintains the
ontology
Garcia Distributed Proposed, relies on Proposed, Proposed Proposed Assumes a Training
across community involvement relies on SW post - activities
members of community development identified; an
the community involvement process in outline for
which the configuration
community management
maintains the and
ontology environment
study is also
given.
N/A Not available

Chapter 2 - Table 2. Methodology compliance with IEEE


Compliance 730-1998 [10] Reproduced and extended with permission from Fernadez, M. [34]
As those required ontologies grow in complexity so does the process by which they are
obtained. A quick inspection of those previously proposed methodologies allows one to see
how the involvement of communities has become a predominant requirement, not yet fully
addressed by most methodologies. Methods, techniques, activities and tasks become more

73
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

group-oriented, making it necessary to re-evaluate the whole process as well as the way by
which it is described. The IEEE proposes a set of concepts that should in principle facilitate
the description of a methodology; however these guidelines should be better-scoped for
decentralized environments.

Activities within decentralised ontology developments are highly interrelated. However,


the maturity of the product allows engineers and domain experts to determine boundaries,
and by doing so establishing milestones for each and every activity and task. Although
managerial activities are interrelated, and impact at a high-level those development processes
it is advisable not to have rigid management structures. For instance, control and inbound-
outbound activities usually coexist with some development activities when a new term needs
to be added. This interaction requires the orchestration of all the activities to ensure the
evolution of the ontology. An illustration of this situation and a feasible course of actions are
presented in Figure 4.

Chapter 2 - Figure 4. Adding a term.


When communities are developing ontologies the life cycle varies. The ontology is not
deployed on a one-off basis; there is thus no definitive final version of the ontology. The
74
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

involvement of the community allows for rapid evolution, as well as for very high quality
standards; errors are identified and discussed then corrections are made available within short
time frames.

The model upon which this proposed methodology is based brings together ideas from,
linear sequential modelling [28, 35] , prototyping, spiral [36], incremental [37, 38] and the
evolutionary models [28, 39]. Due to the dynamic nature of the interaction when developing
ontologies on a community basis the model grows rapidly and continuously. As this happens
prototypes are being delivered, documentation is constantly being generated, and evaluation
takes place at all times as the growth of the model is due to the argumentation amongst
domain experts. The development process is incremental as new activities may happen
without disrupting the evolution of the collaboration. The model is therefore an incremental
evolutionary spiral in which tasks and activities can coexist simultaneously at some level of
detail. As the process moves forward activities and/or tasks are applied recursively depending
on the needs. The evolution of the model is dynamic and the interaction amongst domain
experts and with the model happens all the time. Figure 3 illustrates the model as well as how
processes, activities and tasks are consistent with the model.

2.6 CONCLUSIONS

The methodology proposed in this chapter reuses some components that various
authors have identified as part of their methodologies. This thesis has investigated how to use
these components within decentralised settings such as the biomedical domain. The proposed
methodology is consistent with the challenges posed by the ontologies needed for the SW.
The importance of this chapter is a detailed description of methods, techniques, activities, and
tasks that could be used for developing community-based ontologies. Furthermore, this
chapter has also explained the model for the life cycle of these ontologies. Both the
methodology and the life cycle are consistent with those proposed processes. The
fundamental contribution of this chapter is the involvement of communities as both domain

75
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

experts and subjects of study. This allowed the author to base his results in real-life cases.
Successive chapters present engineering experiments from which the components presented
in this chapter were studied.

2.7 ACKNOWLEDGEMENTS

The author specially thanks Oscar Corcho, and Mariano Fernandez for their extremely
helpful suggestions.

2.8 REFERENCES

1. Corcho O, Fernadez-Lopez M, Gomez-Perez A: Methodologies, tools, and


languages for building ontologies. Where is their meeting point? Data and
Knowledge Engineering 2003, 46(1):41-64.
2. Garcia CA, Rocca-Serra P, Stevens R, Taylor C, Nashar K, Ragan MA, Sansone S:
The use of concept maps during knowledge elicitation in ontology
development processes - the nutrigenomics use case. BMC Bioinformatics 2006,
7:267.
3. Mirzaee V: An Ontological Approach to Representing Historical Knowledge.
MSc Thesis. Vancouver: Department of Electrical and Computer Engineering,
University of British Columbia; 2004.
4. Bada M, Stevens R, Goble C, Gil Y, Ashbourner M, Blake J, Cherry J, Harris M,
Lewis S: A short study on the success of the GeneOntology. Journal of Web
Semantics 2004, 1:235-240.
5. Perez AG: Some Ideas and Examples to Evaluate Ontologies. In: Knowledge
Systems, AI Laboratory. Stanford: Stanford University; 1994a.
6. Perez AG, Lopez MF, De Vicente A: Towards a Method to Conceptualize
Domain Ontologies. In: Workshop on ontological Engineering ECAI'96: 1996; Budapest,
Hungary; 1996: 41-51.
7. Fernandéz M, Gómez-Pérez A, Juristo N: METHONTOLOGY: From
Ontological Art to Ontological Engineering. In: Workshop on Ontological Engineering
Spring Symposium Series AAAI97: 1997; Stanford; 1997.

76
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

8. Pinto H, Sofia. , Martins P, Joao.: Ontoloigies: How can they be built? Knowledge
and information Systems 2004, 6:441-463.
9. Pinto HS, Staab S, Tempich C: Diligent: towards a fine-grained methodology for
Distributed, Loosely-controlled and evolving engineering of ontologies. In:
European Conference on Artificial Intelligence: 2004; Valencia, Spain; 2004: 393-397.
10. IEEE: IEEE standard for software quality assurance plans. In. Edited by IEEE,
vol. 730-1998: IEEE Computer Society; 1998.
11. IEEE: IEEE Standard Glossary of Software Engineering Terminology. In:
IEEE Standards vol. IEEE Std 610.12-1990: IEEE; 1991.
12. Greenwood E: Metodologia de la investigacion social. Buenos Aires: Paidos;
1973.
13. Gomez-Perez A, Fernadez-Lopez M, Corcho O: Ontological Engineering.
London: Springer-Verlag; 2004.
14. IEEE: IEEE Standard for Developing Software Life Cycle Processes. In. Edited
by IEEE, vol. IEEE Std 1074-1995: IEEE Computer Society; 1996.
15. OBI, Ontology for Biological Investigations [http://obi.sourceforge.net/]
16. Microarray Gene Expression Data [http://www.mged.org/]
17. Stoeckert CJ, Parkinson H: The MGED ontology: a framework for describing
functional genomics experiments. Comparative and Functional Genomics 2003, 4:127-
132.
18. Cooke N: Varieies of Knowledge Elicitation Techniques. International Journal of
Human-Computer Studies 1994, 41:801-849.
19. Gaines B, R.,, Shaw ML, Q.: Knowledge acquisition tools based on personal
construct psychology. The Knowledge Engineering Review 1993, 8(1):49-85.
20. Gruber R, Tom.: Toward Principles for the Design of Ontologies Used for
Knowledge Sharing. In: International Workshop on Formal Ontology: 1993; Padova, Italy,;
1993.
21. Cote R, Jones P, Apweiler R, Hermjakob H: The Ontology Lookup Service, a
lightweight cross-platform tool for controlled vocabulary queries. BMC
Bioinformatics 2006, 7(97).
22. Noy NF, L. MD: Ontology Development 101: a Guide to Creating Your First
Ontology. In: Protege Documentation. Stanford, CA: Stanford University; 2001.
23. Perez AG, Fernadez-Lopez M, Corcho O: Ontological Engineering: Springer;
2004.

77
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

24. Perez AG, Juristo N, Pazos J: Evaluation and assessment of knowledge sharing
technology In: Towards Vey Large Knowledge Bases: Knowledge Building and Knowledge
Sharing(KBK95): 1995; Amsterdan, The Netherlands: IOS Press; 1995: 289-296.
25. Haarslev V, Möller R: Racer: A Core Inference Engine for the Semantic Web. In:
Proceedings of the 2nd International Workshop on Evaluation of Ontology-based Tools
(EON2003): October 20 2003; Sanibel Island, Florida, USA; 2003: 27-36.
26. Sirin E, Parsia B, Cuenca-Grau B, Kalyanpur A, Katz Y: Pellet: A practical OWL-
DL resoner. Journal of Web Semantics 2007, 5(2).
27. Eden HA, Hirshfeld Y: Principles in formal specification of object oriented
design and architecture. In: Proceedings of the 2001 conference of the Centre for Advanced
Studies on Collaborative research: 2001; Toronto, Canada: IBM Press; 2001.
28. Pressman S, Roger.: Software Engineering, A practitioners Approach, Fifth edn:
Thomas Casson; 2001.
29. Kerr J, Hunter R: Inside RAD: Mc-Graw-Hill; 1994.
30. Martin J: Rapid Application Development: Prentice-Hall; 1991.
31. Gilb T: Principles of Software Engineering Management: Addison-Wesley
Longman; 1988.
32. Gilb T: Evolutionary Project Management: Multiple Performance, Quality and
Cost Metrics for Early and Continuous Stakeholder Value Delivery. In:
International Conference on Enterprise Information Systems: April 14-17 2004; Porto, Portugal.;
2004.
33. Sure Y: Methodology, Tools & Case Studies for Ontology based Knowledge
Managment. Karlsruhe: Universitat Fridericiana zu Karlsruhe; 2003.
34. Fernandez M: Overview Of Methodologies For Building Ontologies. In: In
Proceedings of the IJCAI-99 Workshop on Ontologies and Problem-Solving Methods(KRR5):
1999; Stockholm, Sweden; 1999.
35. Dagnino A: Coordination of hardware manufacturing and software
developmentlifecycles for integrated systems development. In: IEEE
International Conference on Systems, Man, and Cybernetics: 2001; 2001: 1850-1855.
36. Boehm B: A spiral model of software development and enhancement. ACM
SIGSOFT Software Engineering Notes 1986, 11(4):14-24.
37. McDermid J, Rook P: Software Developement Process Models. In: Software
Engineer's Reference Book. CRC Press; 1993: 15-28.
38. Larman C, Basili R, Victor. : Iterative and Incremental Development: A Brief
History Computer, IEEE Computer Society 2003, 36:47-56.
39. May L, Elaine, Zimmer A, Barbara.: The Evolutionary Development Model for
Software. HP Journal 1996:http://www.hpl.hp.com/hpjournal/96aug/aug96a94.htm.
78
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

The use of concept maps during knowledge


elicitation in ontology development processes

A critical assessment of the state of the art of methodologies for developing ontologies
was initially presented in this thesis work. It was subsequently followed by the presentation of
the proposed methodology; this methodology is the product of several experiments and
analysis addressing some of the previously identified key issues when developing ontologies in
communities of practice such as the biological domain.

This chapter is divided in two sections; the first one presents the process, results, and
conclusions for one the experiments upon which the proposed methodology relays on. Some
specific issues were addressed when conducting this experience; for instance, how could the
knowledge elicitation process be supported throughout the entire process? How could
domain experts be engaged in such a way that interaction could be facilitated? Which parts of
those previously proposed methodologies could be applied within this setting? Important
information was gathered from this experience, not only methodological aspects were
identified, but also the importance of conceptual maps was documented and well established
as part of the development process. The second part of this chapter presents another scenario
(an ontology for a genealogy management system) for which those identified steps were also
evaluated.

The contributions of this chapter are the thorough description of the suggested steps
when building an ontology, example use of concept maps, consideration of applicability to the
development of lower-level ontologies and application to decentralised environments. Other
authors had previously used conceptual maps when eliciting knowledge, but this was the first
reported experience of the use of concept maps with the specific aim of developing
ontologies. It was also found that within the specific presented scenario conceptual maps
played an important role in the development process. Another important outcome from this
experience was the evidence supporting the importance of communities and how these were

79
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

interacting when building ontologies.

The author investigated and identified those reusable steps from other methodologies
applicable for this specific environment; Alex Garcia also identified and conceptualised the
use of conceptual maps when developing ontologies as well as those different stages within
the development process for which conceptual maps could play a role. As the knowledge
engineer in charge of this experiment Alex Garcia could also explore and document the role
of both, domain experts and knowledge engineers. Manuscripts leading to those published
papers aroused from this chapter were written by Alex Garcia.

AUTHORS' CONTRIBUTIONS

Susanna Sansone conceived of and coordinated the project. Alex Garcia Castro was a
knowledge engineer during his 11-month student project at EBI. Philippe Rocca-Serra
coordinated the nutrigenomics community within MGED RSBI, and organised and
participated in the knowledge elicitation exercises. Karim Nashar contributed to the
knowledge elicitation exercises. Robert Stevens assisted Alex Garcia Castro in conceptualising
the methodology, Susanna Sansone and Philippe Rocca-Serra supervised the knowledge
elicitation exercises and, with Chris Taylor, the associated meetings. Alex Garcia Castro wrote
the initial version of the manuscript; contributions and critical reviews by the other authors, in
particular Susanna Sansone and Robert Stevens, delivered the final manuscript.

PUBLISHED PAPER ARISING FROM THIS CHAPTER – FIRST SECTION

Garcia Castro A, Rocca-Serra P, Stevens R, Taylor C, Nashar K, Ragan MA, Sansone S:


The use of concept maps during knowledge elicitation in ontology development
processes - the nutrigenomics use case. BMC Bioinformatics 2006, 7:267.

80
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

PUBLISHED PAPER ARISING FROM THIS CHAPTER – SECOND SECTION

Garcia Castro A, Sansone S, Rocca-Serra P, Taylor C, Ragan MA: The use of concept
maps for two ontology developments: nutrigenomics, and a management system for
genealogies. In: 8th Intl Protégé Conference Protégé: 2005; Madrid, Spain; 2005: 59-62.

81
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

3 Chapter III - The use of concept maps


during knowledge elicitation in ontology
development processes

3.1 THE USE OF CONCEPT MAPS DURING KNOWLEDGE ELICITATION IN


ONTOLOGY DEVELOPMENT PROCESSES – THE NUTRIGENOMICS
USE CASE

Abstract. Incorporation of ontologies into annotations has enabled ‘semantic integration’ of complex data,
making explicit the knowledge within a certain field. One of the major bottlenecks in developing bio-
ontologies is the lack of a unified methodology. Different methodologies have been proposed for different
scenarios, but there is no agreed-upon standard methodology for building ontologies. The involvement of
geographically distributed domain experts, the need for domain experts to lead the design process, the
application of the ontologies and the life cycles of bio-ontologies are amongst the features not considered by
previously proposed methodologies. Here, we present a methodology for developing ontologies within the
biological domain. We describe our scenario, competency questions, results and milestones for each
methodological stage. We introduce the use of concept maps during knowledge acquisition phases as a
feasible transition between domain expert and knowledge engineer. The contributions of this paper are the
thorough description of the steps we suggest when building an ontology, example use of concept maps,
consideration of applicability to the development of lower-level ontologies and application to decentralised
environments. We have found that within our scenario concept maps played an important role in the
development process.

3.1.1 Background

In the field of biological research, recent advances in functional genomics technologies


have given the opportunity to carry out complex and possibly high-throughput investigations.
Consequently, the storage, management, exchange and description of data in this domain
present challenges to biologists and bioinformaticians. It is widely recognised that capturing
descriptions of investigations at a high level of granularity is necessary to enable efficient data
sharing and meaningful data mining [1, 2]. However, this information is often captured in
diverse formats, mostly as free text, and is commonly subject to typographical errors. The
increased cost of interpreting the experimental procedures and exploring data has encouraged
several scientific communities to develop and adopt ontology-based knowledge
representations to extend power of their computational approaches [3].
82
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

Application of an ontologically based approach should be more powerful than simple


keyword-based methods for information retrieval. Not only can semantic queries be formed,
but axioms that specify relations among concepts can also be provided, making it possible for
a user to derive information that has been specified only implicitly. In this way, relevant
entries and text can be found even if none of the query words is present (e.g. a query for
“furry quadrupeds” might retrieve pages about bears) [4].

Many methodologies for building ontologies have been described [5] and seminal work
in the field of anatomy provides insights into how to build a successful ontology [6, 7].
Extensive work about the nature of the relations that can be used also provides solid grounds
for consistent development for building ontologies [8]. However, despite these efforts, bio-
ontologies still tend to be built on an ad hoc basis rather than by following a well-defined
engineering process. To this day, no standard methodology for building ontologies has been
agreed upon. Usually terminology is gathered and organised into a taxonomy, from which key
concepts are identified and related to create a concrete ontology. Case studies have been
described for the development of ontologies in diverse domains, although surprisingly only
one of these has been reported to have been applied in a domain allied to bioscience – the
chemical ontology [9] – and none in bioscience per se. Most of the literature focuses on issues
such as the suitability of particular tools and languages for building ontologies, with little
attention being given to how it should be done. This is almost certainly because the main
interest has been in reporting content and use, rather than engineering methodology.
Nevertheless, it is apparent that most ontologies are built with the ontological equivalent of
“hacking”.

A particular lack in these methodologies is support for the continued involvement of


domain experts scattered around the world. Biological sciences pose a scenario in which
domain experts are geographically distributed, the structure of the ontology is constantly
evolving, and the role of the knowledge engineer is not that of the leader but more of the one
who promotes collaboration and communication among domain experts. Bioinformatics has

83
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

demonstrated a need for bio-ontologies and several characteristics highlight the lack of
support for these requirements:

• The volatility of knowledge in the domain – biologists’ understanding of the domain


is in continual flux;
• The domain is large, complex, and cannot, therefore be modelled in one single
effort; the knowledge holders are distributed and will not be brought together for
frequent knowledge elicitation exercises.
To support these requirements, our methodology pays particular attention to the
knowledge elicitation stage of the process of building an ontology. This is the stage where the
person managing the development of the ontology gathers, in the form of concepts and
relationships between concepts, what the domain expert understands to exist in that domain.
To do this, we used concept maps (CMs), a simple graphical representation in which instances
and classes are presented as nodes, and relationships between them are shown as arcs [10].
CMs have a simple semantics that appears to be an intuitive form by which domain experts
can convey their understanding of a domain. We exploit this feature in order to perform the
informal modelling stage of building an ontology.

In support of this argument, we first present a survey of ontology development


methodologies, and then report our experience, with particular focus on the how of the initial
stages of building an ontology using CMs. We have studied and evaluated the key
methodologies and have adapted parts of several of them to produce an overall method,
which we describe here as a set of detailed stages that, we argue, can be applied to other
domains within the biological sciences. The major contributions of this paper are the
thorough description of our methodology for building an ontology (including an examination
of the utility of CMs), the consideration of its applicability to the development of ontologies,
and the assessment of its suitability for use in decentralised settings. Finally, we discuss the
issues raised and draw conclusions.

3.1.1.1 A survey of methodologies

We investigated five methodologies: Enterprise Methodology [11], TOVE (Toronto


Virtual Enterprise) [12, 13], the Unified Methodology [14, 15], Diligent [16] and
84
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

Methontology [17]. Table 1 presents a summary of our comparison. We analysed these


approaches according to the following criteria:

• Accuracy in the description of the stages: We were interested in knowing if the


stages were sufficiently described so they could be easily followed.
• Terminology extraction: We wanted to study how terminology extraction could
assist knowledge engineers and domain experts when building ontologies. We were
interested in those methodologies that could offer some level of support for
identifying terms.
• Generality: We needed to know how dependent on a particular intended use the
investigated methodologies are. This point was of our particular interest since our
ontology was intended to serve a particular task. This parameter may be assimilated
to the ability of the method to be applied to a different scenario, or use of the
ontology it self.
• Ontology evaluation: We needed to know how we could evaluate the completeness
of our ontology. This point was interesting for us since we were working with
agreements within the community, and domain experts could therefore agree upon
errors in the models.
• Distributed and decentralised: We were interested in those methodologies that could
offer support for communities such as ours in which domain experts were not only
geographically distributed but also organised in an atypical manner (i.e. not fully
hierarchical).
• Usability: We had a particular interest in those methodologies for which real
examples had been reported. Had the methodology been applied to building a real
ontology?
• Supporting software: We were interested in knowing whether the methodology was
independent from particular software.
We found that only Diligent offered community support for building ontologies and
none of them had detailed descriptions about knowledge elicitation, nor did they have details
on the different steps that had to be undertaken. The methodologies mentioned above have
been applied mostly in controlled environments where the ontology is deployed on a one-off
basis. Tools, languages and methodologies for building ontologies has been the main research
goal for many computer scientists; whereas for the bioinformatics community, it is just one
step in the process of developing software to support tasks such as annotation and text
mining.

85
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

Enterprise TOVE Unified Methodology Diligent


Methodology Methodology Methodology
Description High-level Detail is High-level Stages are High level
of stages description of provided for description of described. description
stages those stages More detail is
ontologies provided for
developed with specific
this developments:
methodology chemical and
legal ontology
Terminology N/A N/A N/A N/A N/A
extraction
Generality Not domain Not domain Not domain Not domain Not domain
specific specific specific specific specific
Ontology Competency Competency No evaluation An informal The
evaluation questions questions and method is evaluation community
formal axioms provided method is used evaluates the
for the ontology;
Chemical agreement
ontology process
Distributed / No No No No Yes
decentralised
Business and Chemical
Usability N/A foundational N/A ontology N/A
ontologies Legal ontology
Supporting N/A N/A N/A WebODE N/A
software
Chapter 3 - Table 1. Comparison of methodologies.
Unfortunately, none of the methodologies investigated was designed for the
requirements of bioinformatics, nor has any of them been standardised and stabilised long
enough to have a significant user community (i.e. large enough for the ontology to have an
impact on the community) [18]. Theoretically, the methodologies are independent from the
domain and intended use. However, none of the methodologies has been used long enough
as to provide evidence of its generality. They had been developed in order to address a
specific problem or as an end by it self. The evaluation of the ontology remains a difficult
issue to address; there is a lack of criteria for evaluating ontologies. Within our particular
scenario, the models were being built upon agreements between domain experts. Evaluation
was therefore based upon their knowledge and thus could contain “settled” errors. We
studied those knowledge elicitation methods described by [19] such as observation,
86
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

interviews, process tracing, conceptual methods, and card sorting. Unfortunately, none of
them was described within the context of ontology development in a decentralised setting.

We drew parallels between the biological domain and the Semantic Web (SW). This is a
vision in which the current, largely human-accessible Web, is annotated from ontologies such
that the vast content of the Web is available to machine processing [20]. Pinto and co-workers
[21] define these scenarios as distributed, loosely controlled and evolving. Domain experts in
biological sciences are rarely in one place; they tend to form virtual organisations where
experts with different but complementary skills collaborate in building an ontology for a
specific purpose. The structure of the collaboration does not necessarily have a central control
and different domain experts join and leave the network at any time and decide on the scope
of their contribution to the joint effort. Biological ontologies are constantly evolving, not only
as new instances are added, but also as new whole/part-of properties are identified as new
uses of the ontology are investigated. The rapid evolution of biological ontologies is due in
part to the fact that ontology builders are also those who will ultimately use the ontology [22].

Some of the differences between classic proposals from Knowledge Engineering (KE)
and the requirements of the SW, have been presented by Pinto and co-workers [21], who
summarise these differences in four key points:

1. Distributed information processing with ontologies: within the SW scenario,


ontologies are developed by geographically distributed domain experts willing to
collaborate, whereas KE deals with centrally-developed ontologies.

2. Domain expert-centric design: within the SW scenario, domain experts guide the
effort while the knowledge engineer assists them. There is a clear and dynamic
separation between the domain of knowledge and the operational domain. In
contrast, traditional KE approaches relegate the role of the expert as an informant
to the knowledge engineer.

3. Ontologies are in constant evolution in SW, whereas in KE scenarios, ontologies


are simply developed and deployed.

87
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

4. Additionally, within the SW scenario, fine-grained guidance should be provided by


the knowledge engineer to the domain experts.

We consider these four points to be applicable within biological domains, where


domain experts have crafted ontologies, taken care of their evolution, and defined their
ultimate use. Our proposed methodology takes into account all the considerations reported
by Pinto and co-workers [21], as well as those previously studied by the knowledge
representation community.

3.1.2 Methods

3.1.2.1 General view of our methodology

A key feature of our methodology is the use of CMs throughout our knowledge
elicitation process. CMs are graphs consisting of nodes representing concepts, connected by
arcs representing the relationships between those nodes [23]. Nodes are labelled with text
describing the concept that they represent, and the arcs are labelled (sometimes only
implicitly) with a relationship type. CMs proved, within our development, useful both for
sharing and capturing activities, and in the formalisation of use cases. Figure 1 illustrates a
CM.

Our methodology strongly emphasises: (i) capturing knowledge, (ii) sharing knowledge,
(iii) supporting needs with well-structured use cases, and (iv) supporting collaboration in
distributed (decentralised) environments. Figure 2 presents those steps and milestones that we
envisage to occur during our ontology development process.

88
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

Chapter 3 - Figure 1. View of a concept map.


Adapted with permission from: http://cmap.coginst.uwf.edu/info/

Chapter 3 - Figure 2. Steps (1-6) and milestones (boxes).


Step 1: The first step involves addressing straight forward questions such as: what is the
ontology going to be used for? How is the ontology ultimately going to be used by the

89
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

software implementation? What do we want the ontology to be aware of, and what is the
scope of the knowledge we want to have in the ontology?

Step 2: When identifying reusable ontologies, it is important to focus on what any


particular concept is used for, how it impacts on and relates to other concepts, how it is
embedded within the process to which it is relevant, and how domain experts understand it. It
is not important to identify exact linguistic matches. By recyclability of different ontologies, we
do not imply that we can indicate which other ontology should be used in a particular area or
problem; instead, we mean conceptually how and when one can extrapolate from one context
to another. Extrapolating from one context to another largely depends on the agreement of
the community, and specific conditions of the contexts involved. Indicating where another
ontology should be used to harmonise the representation at hand – for example, between
geographical ontologies and the NCBI (National Center for Biotechnology Information)
taxonomy – is a different issue that we refer to as reusability.

Step 3: Domain analysis and knowledge acquisition are processes by which the
information used in a particular domain is identified, captured and organised for the purpose
of making it available in an ontology. This step may be seen as the ‘art of questioning’, since
ultimately all relevant knowledge is either directly or indirectly in the heads of domain experts.
This step involves the definition of the terminology, i.e. the linguistic phase. This starts by the
identification of those reusable ontologies and terminates with the baseline ontology, i.e. a
draft version containing few but seminal elements of an ontology. We found it important to
maintain the following criteria during knowledge acquisition:

• Accuracy in the definition of terms. The linguistic part of our development was also
meant to support the sharing of information/knowledge. Table 2 presents the
structure of our linguistic definitions. The availability of context as part of the
definition proved to be useful when sharing knowledge.
• Coherence: as CMs were being enriched it was important to ensure the coherence of
the story we were capturing. Domain experts were asked to use the CMs as a means
to tell a story; consistency within the narration was therefore crucial.
• Extensibility: Our approach may be seen as an aggregation problem; CMs were
constantly gaining information, which was always part of a bigger narration.
Extending the conceptual model was not only about adding more details to the
90
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

existing CMs, nor it was it just about generating new CMs; it was also about
grouping concepts into higher-level abstractions and validating these with domain
experts. Scaling the models involved the participation of both domain experts and
the knowledge engineer. It was mostly done by direct interview and confrontation
with the models from different perspectives. The participation of new “fresh”
domain experts as well as the intervention of experts from allied domains allowed us
to analyse the models from different angles. This participatory process allowed us to
re-factorise the models by increasing the level of abstraction.

Word Investigation
Verb/Noun Noun
Definition An Investigation is a set, a collection of related studies and assays; a self-
contained contained unit of scientific enquiry.
Context Evaluating the effect of an ingredient in a diet traditionally relies on one or
more
related studies for example where the subject receive different concentrations
of the ingredient. The concept of investigation provides a container that
allows us to group these studies together.
Notes When can we consider an investigation completed? Ongoing discussion. For
instance, according to the Minimal Information About a Microarray
Experiment (MIAME) an Experiment is a set of related hybridisation that are
in some way related (e.g. related to the same publication). In the case of the
Investigation, we do not want to tie this concept to a publication or a
deposition to a database or a submission to regulatory authority. The decision
should be left to the individual investigator.
Chapter 3 - Table 2. Example of the structure of linguistic definitions.
The goal determines the complexity of the process. Creating an ontology intended only
to provide a basic understanding of a domain may require less effort than creating one
intended to support formal logical arguments and proofs in a domain. We must answer
questions such as: Why are we building this ontology? What do we want to use it for? How is
it going to be used by the software layer? Subsections Identification of purpose, scope,
competency questions and scenarios to Iterative building of informal ontology models
explain these steps in detail.

Step 4: Iterative building of informal ontology models helped to expand our glossary of
terms, relations, their definition or meaning, and additional information such as examples to
clarify the meaning where appropriate. Different models were built and validated with the
domain experts.

91
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

Step 5: Formalisation of the ontology was the step during which the classes were
constrained, and instances were attached to their corresponding classes. For example: “a male
is constrained to be an animal with a y-chromosome”. This step involves the use of an
ontology editor.

Step 6: There is no unified framework to evaluate ontologies, and this remains an active
field of research. We consider that ontologies should be evaluated according to their fitness
for purpose, i.e. an ontology developed for annotation purposes should be evaluated by the
quality of the annotation and the usability of the annotation software. By the same token, the
recall and precision of the data, and the usability of the conceptual query builder, should form
the basis of the evaluation of an ontology designed to enable data retrieval.

3.1.2.2 Scenarios and ontology development process

The methodology we report herein has been applied during the knowledge elicitation
phase with the European nutrigenomics community (NuGO) [24]. Nutrigenomics is the
study of the response of a genome to nutrients, using “omics” technologies such as genomic-
scale mRNA expression (transcriptomics), cell and tissue-wide protein expression
(proteomics), and metabolite profiling (metabolomics) in combination with conventional
methods. NuGO includes twenty-two partner organisations from ten European countries,
and aims to develop and integrate all facets of resources, thereby making future nutrigenomics
research easier. An ontology for nutrigenomics investigations would be one of these
resources, designed to provide semantics for those descriptors relevant to the interpretation
and analysis of the data. When developing an ontology involving geographically distributed
domain experts, as in our case, the domain analysis and knowledge acquisition phases may
become a bottleneck due to difficulties in establishing a formal means of communication (i.e.
in sharing knowledge).

Additionally, the NuGO participants collaborate with international toxicogenomics and


environmental genomics communities under the RSBI (Reporting Structure for Biological
Investigations) [25], a working group of the Microarray Gene Expression Data (MGED)

92
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

Society. One of the objectives of RSBI is the development of a common high-level


abstraction defining the semantic and syntactic scaffold of a record/document that describes
an investigation in these diverse biological domains. The RSBI groups will validate the high-
level abstraction against complex uses cases from their domain communities, ultimately
contributing to the Functional Genomics Ontology (FuGO), a large international
collaborative development project [26].

Application of our methodology in this context, with geographically distributed groups,


has allowed us to examine its applicability and understand the suitability of some of the tools
currently available for collaborative ontology development.

3.1. 2.2.1 I DEN T IF ICATI ON OF P URP OSE , SCOP E , COMP ET ENCY QUEST IONS

AND SCENAR IOS

Whilst the high-level framework of the nutrigenomics ontology will be build as a the
collaborative effort with the others MGED RSBI groups, the lower-level framework aims to
provide semantics for those descriptors specific to the nutritional domain.

Having defined the scope of the ontology we discussed the competency questions with
our nutrigenomics researchers (henceforth our domain experts); these were used at a later
stage in order to help evaluate our model. Examples of those competency questions are
presented in Table 3.

Which investigations were done with a high-fat-diet study?


Which study employs microarray in combination with
metabolomics technologies?
List those studies in which the fasting phase had as duration one
day.
Chapter 3 - Table 3. Examples of competency questions
Competency questions are understood here as those questions for which we want the
ontology to be able to provide support for reasoning and inferring processes. We consider
ontologies do not answer questions, although they may provide support for reasoning
processes. Domain experts should express the competency questions in natural language
without any constraint.

93
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

3.1. 2.2.2 I DEN T IF ICAT ION OF REUSA B LE A ND RECYCLAB LE ON T OLOGIES

For our particular purposes, we followed a ‘top-down’ approach where experts in the
biological domain work together to identify key concepts, then postulate and capture an initial
high-level ontology. We decided to follow this approach because of the availability of high-
level domain experts who could provide a more general picture. We identified for example
the Microarray Gene Expression Data (MGED) Ontology (henceforth, MO) [27] as a
possible ontology from which we could recycle - extrapolate from one context to another-
some terms and/or structure for investigation employing other omics technologies in addition
to expression microarrays. The Open Biomedical Ontologies project (OBO) [28, 29] was an
invaluable source of information for the identification of possible orthogonal ontologies.
Domain experts and the knowledge engineer worked together in this task; in our scenario, it
was a process where we focused on those high-level concepts that were part of MO and
relevant for the description of a complete investigation. We also studied the structure that
MO proposes, and by doing so came to appreciate that some concepts could be linguistically
different but in essence mean very similar things. This is an iterative process currently done as
part of the FuGO project. FuGO will expand the scope of MO, drawing in large numbers of
experimentalists and developers, and will draw upon the domain-specific knowledge of a wide
range of biological and technical experts.

3.1. 2.2.3 D OMAIN AN ALYSIS AN D KNOWLEDG E ACQUISIT ION

We hosted a series of meetings during which the domain experts discussed the
terminology and structure used to describe nutrigenomics investigations. For us, domain
analysis is an iterative process that must take place at every stage of the development process.
We focused our discussions on specific descriptions about what the ontology should support,
and sketched the planned area in which the ontology would be applied. Our goal was also to
guide the knowledge engineer and involve that person in a more direct manner.

An important outcome from this phase was an initial consensus reached on those terms
that could potentially have a meaning for our intended users. The main aim of these informal

94
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

linguistic models was to build an explanatory dictionary; some basic relations were also
established between concepts. We decided to use two separate tools (Protégé [30] and
CMAP-tools [10]) because none of the existing Protégé plug-ins provided direct manipulation
capabilities over the concepts and the relations among them the way CMAP-tools does.
Additionally, we studied different elicitation experiences with CMs such as [31, 32]. Our
knowledge formalism was Description Logic (DL), we used the Protégé OWL plug-in.

CMs were used in two stages of our process: capturing knowledge, and testing the
representation. Initially we started to work with informal CMs; although they are not
computationally enabled, for a human they appear to have greater utility than other forms of
knowledge representation such as spreadsheets or word processor tables. As the model gained
semantic richness, by formalising ‘is-a’ and ‘whole/part-of’ relationships between the concepts
the CMs evolved and became more complex. Using CMs, our domain experts were able to
identify and represent concepts, and declare relations among them. We used CMAP-tools
version 3.8 [10] as a CM editor.

3.1. 2.2.3.1 A TT RIBUTES OF T H E D O M A IN E XP E R T S

Experts should of course be highly knowledgeable in their respective areas. We


identified two kinds of nutrigenomics experts: high-level experts, scientists at a project
coordination level involved in interdisciplinary efforts, and domain-specific experts, with
extensive hands-on experience, experimentalists at a more technical level. When developing
an ontology, it is also important to have experts with broad vision, so the flow of information
could be captured and specific controlled vocabularies properly identified.

3.1. 2.2.3.2 T HE KN OWLEDG E ELICIT AT ION SESS IONS

The goal of these sessions was to identify both the high-level and low-level domain
concepts, why these concepts were needed, and how they could be related. A secondary goal
was to identify reusable ontologies where possible.

In the first sessions, it was important to see clearly the ‘what went where’, as well as the
structure of the relationships that ‘glued’ the information together. We were basically working
95
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

with informal artefacts (CMs, word processor documents, spreadsheets and drawings); it was
only at a later stage that we achieved some formalisation.

Some sessions took place by teleconference; these were supported by iterative use of
WEBEX (web, video, and teleconferencing software) [33] and Protégé. CMs were also used
to present structural aspects of the concepts. We found it important to set specific goals for
each teleconference, with these goals ideally specified as questions that are distributed prior to
the meeting. In our case, most of the teleconferences focused on specific concepts, with
questions of the form “how does A relate to B?”, “why do we need A here instead of B?”, and “how does
A impact on B?”. Cardinality issues were also discussed.

3.1. 2.2.3.3 R EP RESEN T ING CONCEPT UAL QUERI ES

We also used CMs to represent conceptual queries. We observed that domain experts
are used to querying information systems using keywords, rather than building structured
queries. In formalising the conceptual queries, CMs provided the domain experts with a tool
that allowed them to go from an instance to the appropriate class/concept, at the same time
identifying the relationships. For example, within the nutrigenomics domain some
investigations study the health status of human volunteers looking at the level of zinc in their
hair. These investigations may take place in different research institutes, but all the
information may be stored in just one central repository. In order to correlate all those
investigations the researcher should be able to formulate a simple query “what is the zinc
concentration in hair across three different ethnic groups”. Figure 3 illustrates this query. Conceptually
this query relates compounds, health function and ethnicity. The concept of compound implies a
measurement; by the same token the concept of health function implies a particular part of
the organism.

96
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

Chapter 3 - Figure 3. CMs as means to structure a conceptual query.


Conceptual queries are based on high-level abstractions, relationships between
concepts, concept-instances and logical operators; the selection of high-level abstraction
allows the class to be instantiated. Conceptual queries provide a level of interaction between
the user and the external sources, removing the need for the user to be aware of the schema.
We do not want only to guide the user by allowing him/her to select concepts, but would also
like to ask the user in a consistent and coherent way so the user can constrain the query
before execution takes place, and/or navigate intelligently across terms. Thus, we see why we
need an ontology ultimately and not simply a controlled vocabulary, nor merely a dictionary
of terms. Controlled vocabularies per se describe neither relations among entities nor relations
among concepts, and consequently cannot support inference processes [4].

The collected competency questions could be used as a starting point for building the
conceptual queries. Competency questions are informal, whereas conceptual queries are used
to identify the ‘class-relation-instance’ and thus improve the understanding of how users may
ultimately query the system. Conceptual queries may be understood as a formalisation of
competency questions.

3.1. 2.2.4 I T ERAT IVE B UILD ING OF INFOR MAL ONTOLOGY MODEL S

Domain experts represented their knowledge in different CMs that they were
generating. Their representation was very specific; they were providing instances and relating

97
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

these instances with very detailed whole/part-of relations. Figure 4 presents an example from
the nutrigenomics domain that illustrates how we used the CMs in order to move from
instances to classes, to identify is_a and defining the whole/part-of relationship more precisely.

Chapter 3 - Figure 4. Elicitation of Is_a, whole/part-of, and classes.


Initially, domain experts represented specific cases with instances rather than classes.
The specificity of the use cases made it easy to identify a subject-predicate structure where
subjects could be assimilated to instances. Alternatively, predicates in most of the cases had
relations and/or information pointing to other ontologies that were needed. Subjects were
understood as those entities that perform an action or who receive the action, whereas the
predicate contains whatever may be said about the subject.

By gathering use cases in the form of CMs, we could identify the classes and subclasses,
for example: beverage is_a food, juice is_a non-alcoholic beverage. The has_attribute /is_ attribute_of
property attached to the instance was also discussed. Moving from instances to classes was an
iterative process in which domain experts were representing their knowledge by providing a
narration full of instances, specific properties, and relationships. The knowledge engineer
analysed all the material. By doing so, different levels of abstractions that could be used in

98
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

order to group those instances were identified; ultimately domain experts validated this
analysis.

3.1.3 Future work

As the nutrigenomics work contributes to the development of FuGO, the final steps -
formalisation and evaluation- will be possible only at a later stage, after our results (e.g. new
concepts and/or structures) are evaluated and integrated into the structure of the functional
genomics investigation ontology. However, we will continue to evaluate our framework with
our nutrigenomics users and the other RSBI groups, to see if it accurately captures the
information we need, and if our terminology and definitions are sufficiently clear to assist the
annotation process.

3.1.3.1 Formalisation

Moving from informal models to formal models with accurate is-a and whole/part-of
relationships will be done using Protégé. FuGO will also be developed in Protégé because it
has a strong community support, multiple visualisation facilities, and it can export the
ontology in different formats (e.g. OWL, RDF, XML, and HTML). Partly because Protégé
and CMAP-tools are not currently integrated and partly because they aim to assist different
stages during the process of developing an ontology, this has to be done, mostly, by hand. We
envisage that integration of these two tools may help knowledge engineers in this process;
semi-automated translation from CMs into OWL structures through the provision of
assistance, in order to allow developers to formally encode bio-ontologies, would be desirable.

Hayes and co-workers [34] addressed the problem of moving from CMs into OWL
models. They extend CMAP-tools so it supports import and export of machine-interpretable
knowledge formats such as OWL. Their approach assumes that the construction of the
ontology starts from the CM and that the CM evolves naturally into the ontology. This makes
it difficult for large ontologies where several CMs shape only a part of the whole ontology.
Furthermore, adding asserted conditions (such as necessary, necessary and sufficient) was not
possible; formalisation involves the encoding of the CM into a valid OWL structure by

99
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

identifying and properly declaring classes and properties. Based on those experiences in which
we have used CMs, we are designing a tool that supports such transition.

Difficulties arise from the divergence of syntactic formats between CMs and OWL
models; CMs do not have logical constraints, whereas OWL structures are partially supported
by them; the lack of connection between concepts as understood in CMs and OWL classes
should also be noticed. During the elicitation process, the information gathered by means of
CMs was usually incomplete in the sense that it tended to be too narrow –meaningful within
the context of a particular researcher. Moreover, CMs were initially picturing processes and at
later stages as they were gaining specificity the identification of terms and relationships was
being enriched. All of these add to the difference between the information one could gather
in a CM and an OWL model. They also emphasises the complementary relationship between
one and the other. The node-arc-node structure of a CM may be assimilated to an RDF
representation as well as to an embryonic OWL model. The proximity between both CMs
and OWL models allows the arrangement of a CM directly into the syntactic structure of an
OWL file thereby avoiding thus some of the inconveniences of translations between non-
related models. The transition from a CM model to an OWL model may be made easier by
allowing domain experts to develop parts of the ontology with the assistance of knowledge
engineers.

The assistance of the knowledge engineer should focus on the consistency of the
whole/part-of properties in order to ensure orthogonality. Domain experts express in their
CMs their different views of the world; the fragmentation of the domain of knowledge is
mostly done by means of is-a relationship and whole/part-of properties. Once these
properties and relationships are properly defined, combining complementary CMs may be
much easier; also by doing so, the consistency of the OWL model may be assured.

It will not be only by integrating CM functionality into Protégé that the knowledge
acquisition process will be better supported and the formalisation/encoding of ontologies
might be achieved more rapidly. It is also important to harmonise both CMs and OWL
models syntactically and semantically. The construction of the class hierarchy should be done

100
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

in parallel with the definition of its properties. This will allow us to identify potential
redundancies and inconsistencies in the ontology. Domain analysis will thus be present
throughout the whole development process.

3.1.3.2 Evaluation

Before putting the ontology into use, we will need to evaluate how accurately it could
answer our competency questions and conceptual queries. To accomplish this, we will use
CMs as well as some functionalities included in Protégé.

Because our CMs represent the conceptual scaffold of the knowledge we are
representing, we will use them to evaluate how this discourse may be mapped into the
concepts and relationships we have captured. The rationale behind this is simple: the concepts
and relationships, if accurate, may then be mapped into the actual discourse. By doing this we
hope to identify:

• Where the concepts are not linguistically clear.


• Whether any redundancies are present.
• Whether the process has been accurately represented both syntactically and
semantically.
We envisage a simple structure for our validation sessions: domain experts will be
presented with the CM, and asked to map their narration into that CM. Minimal or no help
should then be given to the domain expert. The use of CMs as a narrative tool for evaluation
of ontologies has not to our knowledge been reported previously. Further research into this
particular application of CMs may be valuable.

Ultimately the ontology may also be evaluated by using the PAL (Protégé Axiom
Language) plug-in provided by Protégé. PAL allows the construction of more-sophisticated
queries. Among those methods described by [35] we checked the consistency using only
RACER [36].

101
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

3.1.4 Discussion

Building ontologies is a non-trivial task that depends heavily on domain experts. The
methodology presented in this paper may be used in different domains with scenarios similar
to ours. We used concept maps at different stages during this process, and in different ways.
The beauty of CMs is that they are informal artefacts; introducing formal semantics into them
remains a matter for further investigation. The translation from CMs to OWL remains
manual, and we acknowledge that some information may be lost, or even created, in this step
despite the constant participation of domain experts. An ideal ontology development tool
would assist users not only during knowledge elicitation, as CMAP-tools does well, but also
during the formalisation process, so that everything could be done within one software tool.

On the ‘art of questioning’: When to ask? How to ask? How to intervene in a


discussion without taking sides? These are some of the considerations the elicitor must bear
in mind during the sessions. When to ask? Basically he/she should ask only when the
discussion is not heading in the direction of answering the stated question. How to ask? The
question may be stated as a direct question, or as a hypothesis in the form of, ‘if A happens then
what happens to B?’, ‘what is the relationship between A and B?’, ‘what are the implications A may have over
B?’. The knowledge engineer should ideally intervene in discussions as little as possible. The
experts are presented with an initial scenario or question, after which their discussion takes
place so knowledge can start to be elicited. CMs proved to be a very powerful tool for
constraining the discussions in a consistent way.

Unfortunately, too little attention has been paid in the bio-ontological literature to the
nature of such relations and of the relata that they join together [8]. This is especially true for
ontologies about processes. OBO provides a set of guidelines for structuring the
relationships, as well as for building the actual ontology. We are considering these and will
follow these guiding principles in our future development. We will also consider the issue of
orthogonality very carefully, as we have always thought about those ontologies that could, at a
later stage, be integrated into our proposed structure.

102
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

Currently, knowledge is commonly exchanged via email, WIKI pages and


teleconferences. Where this may still work for closely related groups or when working within
a well-defined domain, we have demonstrated in this paper that CMs could effectively assist
both domain experts and the knowledge engineer, and provide a basis for properly visualising
the argument and its follow-ups. Tempich and co-workers addressed some of these issues by
proposing an argumentation ontology for distributed, loosely-controlled and evolving
engineering processes [16, 37].

The development of an ontology for Genealogy Management Systems (GMS) was


another scenario in which our methodology was applied during the knowledge elicitation
process [38]. This was a slightly different scenario because our domain experts were mostly in
one place. The GMS ontology is meant to partially support annotation of germoplasm
throughout the entire transformation process that takes place in several research institutes.
CMs were here initially used in order to represent those different transformation processes,
and at a later stage CMs, in combination with semi-automatic terminology extraction
algorithms, were also used in order to capture and organise vocabulary. The combination of
CMs and these semi-automatic methods for terminology extraction proved to be quite useful;
initially domain experts were presented with lists of terms, and were later requested to
organise them using CMs.

During the development of the GMS ontology, a narrative approach was also
investigated in conjunction with semi-automatic text extraction methods. The approach taken
was simple: domain experts were asked to build stories as they were providing vocabulary.
Empirical evidence from this experience suggests that CMs may provide us with a framework
for larger terminology extraction and validation efforts. A paper describing these experiences
is in preparation. Despite the differences between those domains, the CMs proved to be
useful when capturing and sharing knowledge, both as an external representation of the topic
being discussed, and as an organisational method for knowledge elicitation. It should be
noticed; however, that only time will tell about the transposability of this methodology into
other domains.

103
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

3.1.5 Conclusions

We have focused our efforts on knowledge elicitation within the nutrigenomics


community. We present a methodology for building ontologies and report our experiences
during the knowledge elicitation phase in particular. An informal evaluation of the knowledge
elicitation sessions suggests strong commonalities with the argumentative structure proposed
by several authors [21, 16, 37]. We identify the need for further research on how to manage
this arrangement. For instance, it could be desirable to track discussions in a more structured
and conceptual manner rather than browsing through a vast set of emails. The structure of
discussions over ontologies may follow a pattern. We consider that structuring discussions
requires technology to be able to provide some cognitive support to users, not only to post
their comments but also to follow and search the threads. Having provided evidence for the
applicability of our methodology, it would be interesting to see how it can be extended and
better supported by software tools such as Protégé.

Those general-purpose collaborative development environments focus more on


technical aspects such as consistency and version control rather than on the actual act of the
collaboration. Collaborative environments such as WIKIs or version-control software (e.g.
configuration management software) do not support ontology development in any special
way. Recent developments of Protégé, such as the one proposed by [39] and [19], are an
interesting step in the right direction; however too little attention has been placed on the
actual process of collaboration when building ontologies within decentralised environments.
Diaz and co-workers [39] have developed a tool that provides some extended multi-user
capability, sessions, and a versioning control system. Building ontologies in which domain
experts are informants and, at the same time, leaders of the process is, however, a more
complex process that requires more than just a tool in which different users may edit and
work on the same file. Hayes and collaborators [19] provide an extension to CMAP-tools in
which CMs may be saved as an OWL file. However, it proved to be difficult to read these
files in Protégé due to some inconsistencies in the generated OWL structure; unfortunately
this extension does not provide a way in which it is possible to fully exploit DL. [34] Both

104
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

Hayes and [39] Diaz, propose interesting solutions. However, we consider collaboration
emerges naturally when domain experts are provided with the tools that allow them to
represent and share their knowledge in such a way that it is easy to promote and support
discussion and concentrate on concepts and constraints. There is a need to support
collaborative work from the perspective of allowing users to make use of a virtual working
place; cognitive support is therefore needed. The design and development of such a
collaborative environment and an accompanying CM plug-in for Protégé that supports both
the knowledge acquisition phase and the translation from the CM to an OWL structure are
clearly desirable. The development of this plug-in, as well as a more comprehensive
collaborative environment, is currently in progress.

Ontologies are constantly evolving, and the conceptual structures should be flexible
enough as to allow this dynamic. It is important to report methodological issues (or just
“methodology”) as part of those papers presenting ontologies, in a section analogous to the
“methods and materials” sections required in experimental papers. The added clarity and
rigour that such presentation would bring would help the community extend and better adapt
existing methodologies, including the one we describe here.

3.1.6 Acknowledgements

We gratefully acknowledge our early discussions with Jennifer Fostel and Norman
Morrison, leaders of the toxicogenomics and environmental genomics communities within
MGED RSBI. We thank Ruan Elliot (Institute of Food Research) and Anne-Marie Minihane
(Reading University) for their expertise in nutritional science. We also acknowledge Mark
Wilkinson, Oscar Corcho, Benjamin Good, and Sue Robathan for their comments. Finally,
we thank Mark Green (EBI) for his constant support. This work was partly supported by the
student exchange grants of the EU Network of Excellence NuGO (NoE 503630) to SAS, the
EU Network of Excellence Semantic Interoperability and Data Mining in Biomedicine (NoE
507505) to RS, and Australian Research Council grant (CE0348221) to MAR.

105
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

3.1.7 References

1. Quackenbush J: Data standards for 'omic' science. Nature Biotechnology 2004,


22:613-614.

2. Field D, Sansone, SA: A special issue on data standards. OMICS: A Journal of


Integrative Biology 2006 (in press).

3. Blake J: Bio-ontologies—fast and furious. Nature Biotechnology 2004, 22:773-774.

4. Garcia Castro A, Chen YP, Ragan MA: Information integration in molecular


bioscience: a review. Applied Bioinformatics 2005, 4(3):157-173.

5. Corcho O, Fernandez-Lopez M, Gomez-Perez A: Methodologies, tools, and


languages for building ontologies. Where is their meeting point? Data and
Knowledge Engineering 2002, 46(1):41-64.

6. Smith B, Rosse C: The Role of Foundational Relations in the Alignment of


Biomedical Ontologies. Amsterdam: IOS Press; 2004.

7. Rosse C, Kumar A, Mejino J, Cook D, Detwiler L, Smith B: A Strategy for Improving


and Integrating Biomedical Ontologies. In: American Medical Informatics Association
2005 Symposium: 2005; Washington DC; 2005: 639-643.

8. Smith B, Ceusters W, Klagges B, Köhler J, Kumar A, Lomax J, Mungall C, Neuhaus F,


Rector AL, Rosse C: Relations in Biomedical Ontologies. Genome Biology 2005,
6(5):R46.

9. Lopez F, Perez G, Sierra J, Pazos S: Building a Chemical Ontology Using


Methontology and the Ontology Design Environment. IEEE Intelligent Systems &
Their Applications 1999, 14(1):37-46.

10. CmapTools [http://cmap.ihmc.us/]

11. Uschold M, King M: Towards Methodology for Building Ontologies. In: Workshop on
Basic Ontological Issues in Knowledge Sharing, held in conjunction with IJCAI-95: 1995;
Cambridge, UK; 1995.

12. Fox M: The TOVE Project: A Common-sense Model of the Enterprise Systems. In:
Industrial and Engineering Applications of Artificial Intelligence and Expert: 1992:
Springer-Verlag; 1992: 25-34.

106
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

13. Gruninger M, Fox MS: The Design and Evaluation of Ontologies for Enterprise
Modelling. In: Workshop on Implemented Ontologies, European Workshop on Artificial
Intelligence: 1994; Amsterdam, NL; 1994.

14. Uschold M: Building Ontologies: Towards a Unified Methodology. In: 16th Annual
Conf of British Computer Society Specialist Group on Expert Systems,: 1996;
Cambridge, UK; 1996.

15. Uschold M, Gruninger M: Ontologies: Principles, methods and applications.


Knowledge Engineering Review 1996, 11(2):93-136.

16. Vrandecic D, Pinto H, Sure Y, Tempich C: The DILIGENT Knowledge Processes.


Journal of Knowledge Management 2005, 9(5):85-96.

17. Fernandéz M, Gómez-Pérez A, Juristo N: METHONTOLOGY: From Ontological Art to


Ontological Engineering. In: Workshop on Ontological Engineering Spring Symposium
Series AAAI97: 1997; Stanford; 1997.

18. Beck H, Pinto HS: Overview of Approach, Methodologies, Standards, and Tools for
Ontologies. The Agricultural Ontology Service (UN FAO) 2003.

19. Hayes P, Eskridge CT, Saavedra R, Reichherzer T, Mehrotra M, Bobrovnikoff D:


Collaborative Knowledge Capture in Ontologies. In: K-CAP 05: 2005; Banff, Canada;
2005.

20. Berners-Lee T: Weaving the Web: HarperCollins; 1999.

21. Pinto H, Staab S, Tempich C: Diligent: towards a fine-grained methodology for


Distributed, Loosely-controlled and evolving engineering of ontologies. In:
European conference on Artificial Intelligence: 2004; Valencia, Spain; 2004: 393-397.

22. Bada M, Stevens R, Goble C, Gil Y, Ashbourner M, Blake J, Cherry J, Harris M, Lewis
S: A short study on the success of the GeneOntology. Journal of Web Semantics
2004, 1:235-240.

23. Canas A, Leake DB, Wilson DC: Managing, Mapping and Manipulating Conceptual
Knowledge,. In: AAAI Workshop Technical Report WS-99-10: Exploring the Synergies
of Knowledge Management & Case-Based Reasoning. Menlo California: AAAI Press;
1999.

24. European Nutrigenomics Organisation [http://www.nugo.org]

107
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

25. Sansone SA, Rocca-Serra P, Tong W, Fostel J, Morrison N: A strategy capitalizing on


synergies - The Reporting Structure for Biological Investigation (RSBI) working
group. . OMICS: A Journal of Integrative Biology 2006 (in press).

26. Whetzel P, Brinkman RR, Causton HC, Fan L, Fostel J, Fragoso G, Heiskanen M,
Hernandez-Boussard T, Morrison N, Parkinson H, Rocca-Serra P, Sansone SA,
Schober D, Smith B, Stevens R, Stoeckert C, Taylor C, White J, and members of the
communities collaborating in the FuGO project: Development of FuGO: an Ontology
for Functional Genomics Investigations. OMICS: A Journal of Integrative Biology
2006 (in press).

27. Whetzel P, Parkinson H, Causton HC, Fan L, Fostel J, Fragoso G, Game L, Heiskanen
M, Morrison N, Rocca-Serra P, Sansone SA, Taylor C, White J, Stoeckert CJ Jr: The
MGED Ontology: a resource for semantics-based description of microarray
experiments. Bioinformatics 2006, 22(7):866-873.

28. Open Biomedical Ontologies [http://obo.sourceforge.net/]

29. Rubin D, Lewis S, Mungall C, Misra S, Westerfield M, Ashburner M, Sim I, Chute C,


Solbrig H, Storey M, Smith B, Day-Richter J,,Noy NF and Musen M: The National
Center for Biomedical Ontology: Advancing Biomedicine through Structured
Organization of Scientific Knowledge. OMICS: A Journal of Integrative Biology 2006
(in press).

30. Noy N, Fergerson R, Musen M: The knowledge model of Protege-2000: Combining


interoperability and flexibility. In: 2th International Conference on Knowledge
Engineering and Knowledge Management (EKAW'2000): 2000; Juan-les-Pins, France;
2000.

31. Briggs G, Shamma DA, Cañas AJ , Carff R, Scaargle J, Novak JD: Concept Maps
Applied to Mars Exploration Public Outreach. In: Proceedings of the First
International Conference on Concept Mapping: 2004; Pamplona, Spain; 2004.

32. Leake D, Maguitman A, Reichherzer T, Cañas A, Carvalho M, Arguedas M, Brenes S,


Eskridge T: Aiding Knowledge Capture by Searching for Extensions of Knowledge
Models. In: Proceedings of K-CAP: 2003; Sanibel Island, Florida, USA; 2003.

33. WEBEX [http://www.webex.com/]

108
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

34. Hayes P, Saavedra R, Reichherzer T: A collaborative development environment for


ontologies. In: Semantic Integration Workshop: 2003; Sanibel Island, Florida, USA;
2003.

35. Seipel D, Baumeister J: Declarative Methods for the Evaluation of Ontologies.


Künstliche Intelligenz 2004:51-57.

36. Haarslev V, Möller R: Racer: A Core Inference Engine for the Semantic Web. In:
Proceedings of the 2nd International Workshop on Evaluation of Ontology-based Tools
(EON2003): October 20 2003; Sanibel Island, Florida, USA; 2003: 27-36.

37. Tempich C, Pinto H, Sure Y, Staab S: An Argumentation Ontology for DIstributed,


Loosely-controlled and evolvInG Engineering processes of oNTologies
(DILIGENT). In: Second European Semantic Web Conference: 2005; Greece; 2005:
241--256.

38. GMS Ontology [http://cropwiki.irri.org/icis/index.php/Germplasm_Ontology]

39. Diaz A, Baldo G: Co-Protege: A Groupware Tool for Supporting Collaborative


Ontology Design with Divergence. In: 8th International Protege Conference: 2005;
Madrid, Spain; 2005: 32-32.

109
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

3.2 THE USE OF CONCEPT MAPS FOR TWO ONTOLOGY DEVELOPMENTS:


NUTRIGENOMICS, AND A MANAGEMENT SYSTEM FOR
GENEALOGIES.

Abstract. We briefly describe the methodology we have adopted in order to develop ontologies. Because our
scenarios involved domain experts distributed geographically, the domain analysis and knowledge acquisition
phases used different independent technologies that were not always integrated into the Protégé suite.
Groupware capabilities were thus achieved. From these experiences we identify conceptual maps (CMs) as an
important collaborative and knowledge acquisition tool for the development of ontologies. Direct
manipulation and collaborative facilities that currently exist in Protégé can be improved with those lessons
learnt from this and similar experiences. Here we describe our scenario, competency questions, results, and
milestones for each methodological stage, use of CMs, and vision for a collaborative environment for
ontology development. This presentation is based on two different sets of experiences, one within
nutrigenomics and the other one in plant genealogy management systems.

3.2.1 Introduction

When developing an ontology involving geographically distributed domain experts, the


domain analysis and knowledge acquisition phases may become a bottleneck due to
difficulties in establishing a formal means of communication (i.e. in sharing knowledge).
Conceptual maps (CMs) have been demonstrated to be an effective means of representing
and communicating knowledge [1].

Traditionally, ontologies have been built by highly trained knowledge engineers with the
assistance of domain specialists. It is a time-consuming and laborious task. Ontology tools are
available to support this work, but their use requires training in knowledge representation and
predicate logic [2]. Bio-ontologies are developed primarily by biologists. Domain experts are
rarely available in one place, so the development of bio-ontologies is usually a distributed
effort in which teleconferences, email, commentary-tracking systems, and videoconferences
are used at all stages. During our ontology building efforts, we identified the lack of an
integrated environment in which at least some of these technologies come together to
facilitate both knowledge representation and sharing as a major bottleneck. CMs may help to
overcome these issues.

110
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

Conceptual maps are graphs that consist of nodes, with connecting arcs that represent
relationships between nodes [3]. The nodes are labeled with descriptive text representing the
"concept", and the arcs are labeled (sometimes only implicitly) with a relationship type. We
used CMs in two stages of our process, the capture of knowledge and testing the structure of
the ontology. Initially we started to work with informal CMs; although they are not
computationally enabled, for humans they appear to have greater "computational efficiency"
than other forms of knowledge representation, e.g. EXCEL™ spreadsheets or Microsoft
Word™ tables. As our models gained semantic richness, the CMs evolved and became more
complex by formalising the knowledge in our ontologies.

We found that the CMs made it possible for domain experts to identify and represent
concepts, and to declare relations among them. More importantly, they helped clarify the
difference between the ontological model, ER (Entity relationship) models and the possible
object model (OM). For biologists, ontologies have a concrete representation in dictionaries,
whereas they view object models as being more related to implementation. Implementation
details were thus separated from ontologically related issues. We used CMAP
(http://cmap.ihmc.us/) [1] as a CM editor.

The ontologies we are developing are asymmetric and complementary. In one we want
to ease the process of accurately capturing nutrigenomics data via web-forms, whereas on the
other hand we want to facilitate the building of queries over large genealogy databases
(http://cropwiki.irri.org/icis/index.php/Germplasm_Ontology). They are two different
experiences with similar problems, and a common bottleneck, knowledge acquisition. From
both ontologies we identified the importance of cognitive support over the groupware facility.

This paper is organised as follows. In Section 2 we present our methodology, and


describe how we used CMs not only to capture knowledge, but also to share it in a distributed
environment. Section 3 presents the development of a CM plug-in for Protégé. Brief
discussions, conclusions, and an outline of our future work, are presented in Section 4.

111
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

3.2.2 Methodology

For our particular purposes we decided to adapt some previously reported


methodologies in order to enable communication among domain experts and with the
ontologist, effectively reuse other ontologies, and provide to the extent possible a high-level
conceptual scaffold so other ontologies could be integrated later. We extended the
methodology proposed by Mirzaee et al. [4]. Figure 5 schematises the methodology we
followed.

Chapter 3 - Figure 5. Methodology, milestones, and phases.

Domain analysis is a process in which information used in a particular domain is


identified, captured, and organised for the purpose of making it reusable. We hosted a series
of meetings during which domain experts agreed on terminology, and on how to structure the
reporting of an investigation. We view domain analysis as an iterative process, taking place at
every stage. We focused our discussions on specific descriptions of what the ontology should
support, and sketched the intended area of application that the ontology was to capture. Our
goal was also to guide an ontology engineer, and involve him or her in a more direct manner;

112
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

so we also made decisions about inclusion, exclusion and the first draft of the hierarchical
structure of concepts in the ontology.

An important outcome from this phase was the consensus that we reached on terms
that could potentially have a meaning for our intended users. The main aim of these informal
linguistic models was to build an explanatory dictionary; some basic relations were, as well,
established between concepts.

We built different models throughout our analyses of available knowledge sources and
information gathered in previous steps. First a “baseline ontology” was gathered, i.e. a draft
version containing few but seminal elements of an ontology. Typically, the most important
concepts and relations were identified somewhat informally. We could assimilate this
“baseline ontology” into a taxonomy, in the sense of a structure of categories and
classifications. We consider a taxonomy as “a controlled vocabulary which is arranged in a
concept hierarchy”, and ontology as “a taxonomy where the meaning of each concept is
defined by specifying properties, relations to other concepts, and axioms narrowing down the
interpretation”. As the process of domain analysis and knowledge acquisition evolves, the
taxonomy takes the shape of an ontology. During this step, the ontologist worked primarily
with only very few of the domains experts; the others were involved in weekly meetings. In
this phase the ontologist sought to provide the means by which the domain experts he or she
was working with could express their knowledge. Some deficiencies in the available
technology were identified, and for the most part were overcome by our use of CMs.

For subsequent steps (i.e. formalisation and evaluation), different needs may be
identified.

3.2.3 CM plug-in for Protégé

Our knowledge acquisition phase took place in different stages, for some of which the
domain experts were not together. CMs proved very useful in facilitating the visualisation and
discussion, and in providing domain experts with a tool that could be used to declare the
primary elements of their knowledge. OWLviz [5] was initially tested to support domain

113
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

experts in this task, but this plug-in did not provide direct manipulation (DM) capabilities
over the concepts and the relations among them. We also tested Jambalaya [6] before deciding
to use two separate tools (i.e. Protégé [7] and the CMAP tools). Since CMs support the
declaration of nodes and relationships, it was easy to assimilate these to classes and properties.
The conversion was a straightforward, albeit manual, process.

The main feature we identified from our work with CMs was the DM capability
provided to us by the software. This functionality had several advantages, which we list below.
Interesting, all of these advantages had previously been identified by Shneiderman:

• Novices can learn basic functionality quickly, usually through a demonstration by a


more experienced user.
• Experts can work extremely rapid to carry out a wide range of tasks, even defining
new functions and features.
• Knowledgeable intermittent users can retain operational concepts.
• Error messages are rarely needed.
• Users can see immediately if their actions are furthering their goals; if not, they can
simply change the direction of their activity.
• Users have reduced anxiety because the system is comprehensible and because
actions are so easily reversible.
We are currently starting to develop the CM plug-in. Basically, it facilitates the
declaration properties and classes, writing to the OWL file. Some of the formal requirements
we have identified for our plug-in are:

• Graphic manipulation of classes and properties via contextual menus.


• Direct publication over the web of the CMs we generate.
• Drag-and-drop capabilities.
• Relationship between concepts and their concrete representations, and annotation
features (e.g. text, colors, graphics, and even files).
• Manipulation of the same file by different users, with a mechanism to track changes.
• Availability of a chat window.
• Possibility for moderated or un-moderated sessions. This is particularly important
for situations in which more than four people are working online on the same file.
• The user interface should be non-intrusive.

114
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

• The user should be presented with an empty canvas on which concepts, linking
phrases and properties can be declared by a direct click.

3.2.4 Conclusions and future work.

Since our methodology involves participatory design activities, it is important for the
tool to support this range of activities. We consider that CMs may play a crucial role in
assisting users in these activities. Our development is inheriting many of those features already
available in CMAPTOOLS; we are extending it so we may additionally also allow users to
“discuss” on-line, while at the same time manipulating the OWL file. We are thus extending
the capabilities currently available in Protégé, not just to enhance browsing but more deeply
to promote a collaborative environment for the development of ontologies. Since protégé was
mainly developed as a desktop tool its web implementation lacks some group-ware features.
In order to implement an integrated web-ontology development environment Human-
Computer Interaction studies need to be conducted.

3.2.5 Acknowledgements

The authors would like to thank Robert Stevens and Karim Nashar for the useful
discussions and collaboration. A. Garcia is supported by Institute for Molecular Bioscience,
Australian Centre for Plant Functional Genomics, the ARC Centre in Bioinformatics and the
EMBL-EBI. SA Sansone is supported by the ILSI-HESI Genomics Committee and Philippe
Rocca-Serra by the European Commission NuGO project.

3.2.6 References

1. Canas., A.J., et al. CMAPTOOLS: A knowledge modeling and sharing environment. In


Concept Maps: Theory, Methodology, Technology. 2004. Universidad Pública de Navarra:
Pamplona, Spain: Universidad Pública de Navarra.

2. Seongwook., Y., et al., Survey about ontology development tools for ontology-based
knowledge Management. 2003.

3. Lambiotte., J.G., et al. Multi-relational semantic maps. Educational Psychology review,


1989. 1(4): p. 331-367.

115
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

4. Mirzaee, V., L. Iverson, and B. Hamidzadeh. Towards ontological modelling of historical


documents. In The 16th International Conference on Software Engineering and Knowledge
Engineering (SEKE). 2004.

5. Knublauch., H., OWLviz a visualisation plugin for the Protege OWL plugin.,
http://www.co-ode.org/downloads/owlviz/.

6. Storey., M.-A., et al. JAMBALAYA: Interactive visualization to enhance ontology


authoring and knowledge acquisition in Protege. In Workshop on Interactive Tools for
Knowledge Capture, K- CAP-2001. 2001. Victoria, B.C. Canada.

7. Geniari., H., John.,, et al., The evolution of Protégé: An environment for knowledge-based
systems development. International Journal of Human Computer Studies, 2003. 58(1): p.
89-123.

116
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

Cognitive support for an argumentative


structure during the ontology development
process

The importance of conceptual maps as well as their use was largely studied in those
experiences reported in chapters 3 and 5. Although the benefits of concept maps were well
understood, it was also clear that in order to better-facilitate the communication amongst
domain experts and with the knowledge engineer it was important to have an argumentative
structure. Interaction amongst domain experts generates large amounts of data and
information, not always usable or understandable by the knowledge engineer; for this matter
conceptual maps could be used. This chapter addresses the problem of supporting the
argumentative structure that was the result of the interaction amongst domain experts; it also
studies the argumentative structure within the context of developing ontologies within
decentralised settings.

The main contribution of this paper is not only to present a practical use for
argumentative structures, but also to support this structure by means of conceptual maps. In
this chapter the use of concept maps is proposed as a mean to support and scaffold an
argumentative structure during the development of ontologies within loosely centralised
communities. This novel use of conceptual maps had not been previously studied.

The author conceived and coordinated the project. The proposed use of conceptual
maps, as well as the extensions for the argumentative structure was the product of the analysis
the author carried out during those experiences reported in this thesis. Alex Garcia wrote the
published paper based on this chapter.

117
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

AUTHORS' CONTRIBUTIONS

Alex Garcia Castro conceived and coordinated the project, he also wrote the
manuscripts for this paper. Angela Noreña and Andrés Betancourt were domain experts in
the knowledge elicitation exercises and also assisted Alex Garcia Castro in the implementation
of the first version of the plug-in. Mark A. Ragan supervised the project, and assisted Alex
Garcia Castro in the preparation of the final manuscript.

PUBLISHED PAPER ARISING FROM THIS CHAPTER

Garcia Castro A: Cognitive support for an argumentative structure during the ontology
development process. In: 9th Intl Protégé Conference: July, 2006; Stanford, CA, USA; 2006.

118
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

4 Chapter IV - Cognitive support for an


argumentative structure during the
ontology development process

Abstract: Structuring and supporting the argumentative process that takes place within the knowledge
elicitation process is a major problem when developing ontologies. Knowledge elicitation relies heavily on the
argumentative process amongst domain experts. The involvement of geographically distributed domain
experts and the need for domain experts to lead the design process, adds an interesting layer of complexity to
the whole process. We consider that the argumentative structure should facilitate the elicitation process and
serve as documentation for the whole process; it should also facilitate the evolution and contextualisation of
the ontology. We propose the use of concept maps as means to support and scaffold an argumentative
structure during the development of ontologies within loosely centralised communities.

4.1 INTRODUCTION

The applications of knowledge engineering are growing larger and more systematic,
now encompassing more ambitious ontologies—sizes in the hundreds of thousands of
concepts will not be uncommon [1]. Furthermore, the development of those ontologies is
usually a participatory exercise in which different experts interact via virtual means, resembling
thereby a loosely centralised community. We believe the requirements of the Semantic Web
(SW) bring with it an associated need for enhanced cognitive support in those tools we use.

Cognitive support is used to leverage innate human abilities, such as visual information
processing, to increase human understanding and cognition of challenging problems [2].
Developing ontologies in loosely centralised environments as those described by Pinto et al.
[3] poses challenges not previously considered by most existing methodologies. This user-
centric design relies heavily on the ability of domain experts to interact with each other and
with the knowledge engineer. By doing so the ontology evolves. Mailing lists, web forums,
and WIKI pages usually support this interaction. Despite this combination of tools (none of
them an ontology editor per se, nor a knowledge engineering tool), information is lost,
documentation is poorly structured, and the process is not always easy to follow. This results
in a decreased participation by the domain experts.
119
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

One of the key components in the development of ontologies in loosely centralised


environments is the discussion related to each and every term and relationship/property.
Pinto et al., as well as Tempich et al. [3, 4] have proposed an argumentative structure to
support and facilitate the discussion within the process of developing ontologies in loosely
centralised environments. Both Garcia et al. and Hayes et al. [5, 6], have studied the use of
CMs during the elicitation process when developing ontologies in distributed environments.
However, it is not clear how to support the proposed structure, nor what is the role of the
argumentative process within the development of the ontology. The knowledge elicitation
process, part of the whole ontology development, is a major bottleneck, particularly within
those communities in which domain experts are geographically distributed. In order to assist
the elicitation process and improve the interaction we propose the use of CMs as a means to
scaffold the argumentative structure.

This paper is organised as follows. Firstly we provide some background information,


and present our approach to the problem of supporting argumentative structures. We explain
in Section 2 what is an argumentative structure within the context of ontology development,
we also present in this section the relationship between a CM and the argumentative structure
proposed by Tempich et al. [4]. In Section 3 we present our CM plug-in for Protégé and
elaborate further how our plug-in supports, assists and facilitates the argumentative process.
We present a brief discussion and conclusions in Section 4.

4.2 ARGUMENTATIVE STRUCTURE AND CMS

Central to ontology development is the process by which domain experts and the
knowledge engineer argue about terms/types and relationships. This collaborative interaction
generates threads of arguments [3, 4, 7], and there is a need to support the evolution and
maintenance of this argumentative process in a way that makes it easy to follow and, more
importantly, links to evidence and provides room for conflicting points of view. Figure 1
presents the argumentative structure proposed by [4].

120
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

Chapter 4 - Figure 1. The major concepts of the argumentation ontology and their relations.
Reproduced with permission from [4]
CMs are semantically valid artefacts without OWL constraints; concepts and
relationships are the main scaffold of a CM. At any given point during the argumentative
process one has a concept/class and a relationship/property. The evolution of the discussions
increases the amount of information attached to the concept or relationship, the
argumentative structure is enriched as domain experts provide arguments and base them
upon evidence, which may be a paper, a commentary, or more generally a file of any kind (e.g.
information source). The different views of the world can be represented with a CM, and the
evidence may be attached to the particular concept/class or relationship/property at hand.
This graphic representation facilitates the continuous exchange of information amongst

121
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

domain experts –sharing knowledge. Following the threads of the discussions is not always
easy for domain experts. The information exchanged is usually structured as an email-based
chat. The knowledge engineer has to follow these text-based discussions in which there is
mostly verbal knowledge, filter them, and at some point “formalise” that implicit knowledge.
Moving from verbal knowledge into formalised-shared knowledge is difficult; some
information is usually lost, the evidence supporting those different positions is not always
provided by domain experts, and most importantly keeping domain experts engaged
throughout the entire process is not always possible. Cognitive support is thus required so we
may facilitate the useful flow/exchange of information and at the same time record the entire
process.

4.3 ARGUMENTATION VIA CMS

Concepts and relationships resemble the two key components within an argumentative
structure: arguments and positions. During the development process we argue in relation to a
concept and/or a relationship. Positions are supported upon evidence, and the simple
argumentative structure is by itself a particular view of the world that is being modelled.
Figure 2 illustrates the basics behind the relationship between CMs and an argumentative
structure.

122
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

Chapter 4 - Figure 2. A simplification of the argumentative structure presented by Tempich et al.


The pizza example (http://www.co-ode.org) is used in order to illustrate our simplified
argumentative structure
For any given issue there is an argument that is elaborated by presenting the
conflicting positions. The elaboration provides instances --concrete examples. For any issue

123
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

there is a concertation2 process that presents argument-elaborated conflicting positions. Once a


consensus is reached there is a position on the issue initially at hand. The issue is well focused
and specific, the same is true for the argument. It supports a position with simple and few
words whereas the elaboration of the argument tends to be larger, and supported by different
files (e.g. pdf, ppt, doc, xls). Although there may be more than one argument for any given
issue, there is only one elaboration for each argument. The dispute resolution process (also
known as conciliatory process) produces a position on the particular issue; with in this
process the knowledge engineer acts as a facilitator. Discussions over terminology, and over
conceptual models, tend to address one issue at a time, this is highly dependent on the
knowledge engineer.

As the ontology grows, so does the complexity of the information available for each
and every component of the ontology ( e.g. classes, properties, instances). Although having an
ontology that represents the structure of the argumentative process helps the knowledge
engineer in the classification of the information, in order for the evidentiary material to be
useful it needs to be attached to the relevant piece of the ontology. For instance, when
discussing about “biomaterial” within the development of a laboratory information
management system for functional plant genomics one feasible starting point for the
discussion would be to adopt the same understanding of biomaterial as it is available in the
microarray ontology.

2 Concertation. From the French concertation. A conciliatory processes by which two parts reach an agreement.

124
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

class BioMaterial
definition:
Description of the processing state of the biomaterial for use in the microarray hybridisation.
superclasses:
BioMaterialPackage
known subclasses:
BioSample
BioSource
LabeledExtract
properties:
unique_identifier MO_226
class_role abstract
class_source mage
constraints:
restriction: has_type has-class MaterialType
restriction: has_biomaterial_characteristics has-class BioMaterialCharacteristics.

Chapter 4 - Figure 3. Biomaterial from MGED


As defined by http://mged.sourceforge.net/ontologies/MGEDontology.php#BioMaterial
When discussing this term domain experts considered there was the need o have a
more general meaning. Domain experts proposed not only different meanings but also they
identified properties and instances at the same time they were providing the knowledge
engineer with competency questions, and relevant scenarios in which they considered the
term was going to be used. Furthermore, domain experts were discussing the relationship
between biomaterial and biosample. In order to gather all these information in a structured and
usable manner conceptual maps proved to be very useful. Not only domain experts were able
to follow the argumentative structure without even being aware of this, but also the process
was being documented in a way in which it was both, easy for domain experts to exchange
information, and easy for the knowledge engineer to assist domain experts in the process.

A very important part of the whole process is the management of the history. Tracing
back the argumentation process from the position_on_issue to the elaboration for a particular
argument; being able to “see” the argumentative structure in order to “stand” on a particular
place. The history should also allow us to go back and take an alternative route, thus we see
the history not just as a simple undo” but as a more complex feature.

125
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

4.4 DISCUSSION AND CONCLUSIONS

For any given issue there is an argument that is elaborated by presenting the
conflicting positions. The elaboration provides instances -concrete examples. For any issue
there is a process that presents argument-elaborated conflicting positions. Once a consensus
is reached there is a position on the issue initially at hand. The issue is well-focused and
specific, the same is true for the argument. It supports a position with simple and few words,
whereas the elaboration of the argument tends to be larger and supported by different files (
e.g. pdf, ppt, doc, xls). Although there may be more than one argument for any given issue,
there is only one elaboration for each argument. The dispute-resolution process (also known
as the conciliatory process) produces a position on the particular issue; within this process the
knowledge engineer acts as a facilitator. Discussions over terminology, and over conceptual
models, tend to address one issue at a time and this is highly dependent on the knowledge
engineer. Within this context conceptual maps provided a scaffold upon which the
argumentative ontology may be instantiated.

4.5 REFERENCES

1. Ernst A, Neil. , Storey M-A, Allen P: Cognitive support for ontology modeling. Int J
Human-Computer Studies 2005, 62:553–577.

2. Walenstein A: Cognitive support in software engineering tools: a distributed


cognition framework. Simon Fraser University; 2002.

3. Pinto HS, Staab S, Tempich C: Diligent: towards a fine-grained methodology for


Distributed, Loosely-controlled and evolving engineering of ontologies. In:
European conference on Artificial Intelligence: 2004; Valencia, Spain; 2004: 393-397.

126
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

4. Tempich C, Pinto HS, Sure Y, Staab S: An Argumentation Ontology for DIstributed,


Loosely-controlled and evolvInG Engineering processes of oNTologies
(DILIGENT). In: Second European Semantic Web Conference: May 2005; Greece; May
2005: 241--256.

5. Garcia Castro A., Rocca-Serra P, Stevens R, Taylor C, Nashar K, Ragan MA, Sansone
S: The use of concept maps during knowledge elicitation in ontology development
processes - the nutrigenomics use case. BMC Bioinformatics 2005, 2006, 7:267.

6. Hayes P, Eskridge CT, Saavedra R, Reichherzer T, Mehrotra M, Bobrovnikoff D:


Collaborative Knowledge Capture in Ontologies. In: K-CAP 05: 2005; Banff, Canada;
2005.

7. Garcia Castro A, Sansone AS, Rocca-Serra P, Taylor C, Ragan MA: The use of
conceptual maps for two ontology developments: nutrigenomics, and a
management system for genealogies. In: 8th Intl Protégé Conference Protégé: 2005;
Madrid, Spain; 2005: 59-62.

127
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

Narratives and Biological investigations

This chapter has two sections, the first one “The use of narratives and text-mining extraction
techniques to support the knowledge elicitation process during two ontology developments” presents an
insightful study about the narratives that were being gathered during the knowledge elicitation
process.

Conceptual maps could be used to support the argumentative structure; they were also
quite useful when eliciting knowledge. However, eliciting knowledge was not always a
straightforward question-answer process. Very often domain experts were building narratives
in order to explain in a more illustrative manner their scenarios to the knowledge engineer.
Moreover, it was observed that once baseline ontology was built domain experts tended to
support their discussions on these narratives. Empirically they were using conceptual maps as
they were “drawing” their ideas. Although from these narratives some instances were being
gathered, it was important to better frame the elicitation exercises. How could the narratives
and the elicitation exercises be better framed as well as how could these narratives be better
used and supported when eliciting knowledge? These are the two main issues this section
addresses.

The second section of this chapter, “A proposed semantic framework for reporting OMICS
investigations” addresses the issue of describing biological investigations, “How to provide
semantics for upper-level elements relevant to the representation and interpretation of omics-
based investigations?” This section presents an upper level ontology for the representations of
biological investigations. The experience here reported was useful as it was important for the
author to test the proposed methodology with domain experts from different disciplines
(Nutrigenomics, Toxicogenomics, Environmental Genomics); it was equally important to
study how these domain experts were reaching their consensuses after debating on their
conceptual models. Chapter 7 follows up on the issue of describing biological investigations,
not from the semantic perspective but by studying practical issues when describing these

128
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

investigations. The ontology described in this section is the product of the work between the
author and domain experts from the MGED-RSBI working group.

Alex Garcia conceived and coordinated the work presented in this chapter. He
identified the need for better using the narratives as they were being gathered. Alex Garcia
also investigated how to use text-mining techniques in order to support the development of
ontologies. The author conducted several meetings with members of the MGED-RSBI
working group in order to develop the presented ontology.

AUTHORS' CONTRIBUTIONS

Alex Garcia Castro conceived and coordinated both projects, he also wrote the
manuscripts for this paper. Susanna Sansone provided useful discussion, and assisted Alex
Garcia in the preparation of those submitted manuscripts. Philippe Rocca-Serra and Chris
Taylor provided useful discussion.

PUBLISHED PAPER ARISING FROM THIS CHAPTER

Garcia Castro A, Sansone AS, Taylor CF, Rocca-Serra P: A conceptual framework for
describing biological investigations. In: NETTAB: 2005; Naples, Italy; 2005.

129
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

5 Chapter V -Narratives and biological


investigations

5.1 THE USE OF CONCEPT MAPS AND AUTOMATIC TERMINOLOGY


EXTRACTION DURING THE DEVELOPMENT OF A DOMAIN
ONTOLOGY. LESSONS LEARNT.

Abstract. Extracting terminology is not always an integral part within most methodologies for building
ontologies. Moreover, the use of terms extracted from literature relevant to the domain of knowledge for
which the ontology is being built has not been extensively studied within the context of knowledge elicitation.
We present here some extensions to the methodology proposed by Garcia et al. (BMC Bioinformatics 7:267,
2006); two important advances on the initial proposed methodologies are the use of extracted terminology for
framing the conceptual mapping building, and the use of narratives during the knowledge elicitation phases.

5.1.1 Introduction

At a glance, an ontology represents some kind of world view with a set of concepts and
relations amongst them, all of these defined with respect to the domain of interest. Some
scholars redefine the term in an effort to capture an absolute view of the world. For instance,
Sowa [1] defines ontologies as “The study of existence, of all kind of things (abstract and concrete) that
make up the world”. A more pragmatic definition is given by Netches et al. [2], who considers
that an ontology “defines the basic terms and relations comprising the vocabulary of a topic
area, as well as the rules for combining terms and relations to define extensions to the
vocabulary”. For practical reasons we agree with this definition as the main aim of the GMS
(Genealogy Management System) ontology is to define a set of basic terms that may
accurately describe Germplasm, within the context of crop information systems, more
specifically within the International Crop Information System (ICIS) [3].

In this paper we present our early ontology for GMS as well as the methodology we
followed. Our scenario involved the development of an ontology having direct physical access
to domain experts within the Australian Centre for Plant Functional Genomics (ACPFG) and
the International Center for Tropical Agriculture (CIAT). We thus decided to adapt and

130
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

extend parts of different previously proposed methodologies for ontology development. We


adapted the methodology reported by [4] (henceforth GM) by proposing an alternative use
for concept maps (CM) [5] from the one described by [6]. We also investigated some semi-
automated techniques for extracting terms from texts, as well as some other aspects of the
ontology development that were not clearly illustrated in the GM methodology.

This paper is organised as follows: Section 5.2.1 presents an introduction and some
background information along with a brief description of our scenario. A survey on the
methodologies we investigated is given in Section 5.2.2. Section 5.2.3 presents the extensions
to the methodology we used; descriptions of those steps we took are also given in this
section. We place special emphasis on text extraction and conceptual mapping during the
elicitation process. Results (e.g. our ontology) are presented in Section 5.2.5. Our discussions
and conclusions are presented in Section 5.2.6.

5.1.2 Survey of methodologies

A range of methods and techniques has been reported in the literature regarding
ontology building methodologies. However, there is an ongoing argument amongst those in
the ontology community about the best method to build them [7, 9].

Most of the ontology building methodologies are inspired by the work done in the field
of knowledge-based engineering to create methodologies for developing Knowledge-Based
Systems (KBS). For instance, the Enterprise Methodology [10], like most KBS development
methodologies, distinguishes between the informal and formal phases of ontology
development. METHONTOLOGY [11] adapts the work done in the area of knowledge
based evaluation for the ontology evaluation phase. The “Distributed, Loosely-controlled and
evolving engineering of ontologies” (DILIGENT) methodology [12] offers a set of
considerations and steps suitable for loosely centralised environments where domain experts
are geographically distributed. Table 1 presents a summary of our comparison. GM (GM
henceforth Graph-based Methodology [4]) provided us with some detail for the knowledge
elicitation process, however, our scenario was not entirely one in which domain experts were

131
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

geographically distributed and thus some of the techniques described by GM could not be
directly applied to our case.

We analyzed these approaches according to the criteria proposed by Mirazee [13]. Most
of the methodologies do not provide details as to how one actually goes about building the
ontology. Although GM had reported the use of concept maps as well as details for
knowledge elicitation it is not entirely clear how, with in the process of eliciting knowledge,
the narrative provided by different but complementary CMs may be reached or used. Nor
there is any illustration as for the relationships between CMs and terminology extraction;
empirically we could see how these two techniques complement each other.

Enterprise TOVE Unified Methontology Diligent GM


Methodology Methodology Methodology
Description High-level Detail is High-level Stages are High level High level
of stages description of provided for description of described for description. description
stages. those stages. the chemical as well as
ontologies ontology. detailed
developed with information
this for each
methodology step
Terminology N/A N/A N/A N/A N/A N/A
extraction
Generality Not domain Not domain Not domain Not domain Not domain Not
specific specific specific specific specific domain
specific
Ontology Competency Competency No evaluation An informal The No
evaluation questions questions and method is evaluation community evaluation
formal axioms provided method is used evaluates method is
for the the provided
Chemical ontology;
ontology. agreement
process.
Distributed / No No No No Yes Yes
decentralised
Business and
Chemical
Usability N/A foundational N/A N/A
ontology
ontologies
Supporting N/A N/A N/A WebODE N/A Protégé,
software CMap tools

Chapter 5 - Table 1. Comparison of methodologies.


Adapted from [4]
There is a gap between the existing software and the available methodologies. Although
WebODE [14] was designed to support a particular methodology, it is general enough as to

132
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

support other methods. We studied Protégé [15], HOZO [16], and pOWL [17] as software
tools for developing ontologies. Neither of them supports in any special way a particular
methodology. Moreover, none of these software packages provides support for terminology
extraction or conceptual mapping. All of these methods and techniques are still determined to
some extent by the particular circumstances in which they are applied. We must note that, in
any given circumstance there might be no available guideline for deciding on what techniques
and methods to apply [18].

5.1.3 General view of our methodology.

Since none of the reported methodologies could be fully applied to our particular
scenario and/or needs, we decided to adapt and reuse some steps described in those
investigated methodologies. Those modifications we introduced to the methodology
proposed by Garcia et al. were mostly due to the close relationship that our domain experts
had with the implemented software ICIS. This familiarity brought in some situations not fully
addressed by Garcia et al., such as:

• Confusion between database schemata and ontology. Domain experts were not fully
aware of the difference between the conceptual and the relational model
• Difficulties with those extracted terms
• Domain experts were at the same time users, designers, developers, and policy
makers of a particular kind of GMS. Their vision was too broad on the process but
at the same time too narrow on the software
Since most of those steps we took have been described by Garcia et al., we will only
present details for those variations we introduced. A schematic representation of our process
is given in Figure 1.

133
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

Chapter 5 - Figure 1. A schematic representation of our process, extending GM.


We carefully followed, wherever possible, the GM methodology; our competency
questions were formulated in natural language by domain experts, who were also leading the
process and working closely with the knowledge engineer.

As when building ontologies it is equally important to gather not only classes but also
instances, we decided to investigate how we could better support our process by means of
terminology extraction. Initially we only wanted to have classes and instances within our
ontological corpus; however terminology extraction also proved to be useful during
knowledge elicitation, and more specifically when combined with conceptual mapping. We
used Text2ONTO [19] as our terminology extraction tool because it allowed us to use
documents in its original format (PDF, XLS, DOC, TXT, etc) as the main source of
information. Text2ONTO also facilitated the process of constraining the terminology by

134
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

allowing the domain experts and the knowledge engineer to inspect those inferred models
from the extracted terminology. In parallel to our terminology extraction exercises our
domain experts were building informal ontology models. By informal we mean basic
representations of their particular view of the world with no logical constraints; a “free-
drawing” exercise that helped to engage the communication between the knowledge engineer
and domain experts.

Text2Onto is a text-mining tool that produces a Probabilistic Ontology Model (POM)


that represents the results of the system and an assigned probability for each structure mined.
Text2Onto is built on top of an ontology management infrastructure named KAON [20, 21].
Text2Onto captures instances and relationships from text; by doing this, the ontological
structure grows. As Text2Onto mines different structures, it attaches a value that represents
how certain an algorithm is about an instance of a modeling primitive, and for each modeling
primitive there are various algorithms that can calculate this probability. We were not using
Text2Onto’s whole range of capacities, as we were not using KAON to build the ontology,
but we were only interested in using those extracted words in order to frame the elicitation of
the concept maps. Appendix 1 presents those extracted terms.

The terms were extracted using the TermExtractor component of the TextToOnto
ontology-engineering workbench. The TermExtractor uses the C-Value method to identify
and to estimate confidence in candidate multi-word terms in a corpus [22]. It utilises linguistic
methods to identify the candidate terms and then uses statistical methods to provide each
term with a "C-value" indicating confidence in its "termhood". This C-value is derived using a
combination of "the total frequency of occurrence of the candidate string in the corpus, the
frequency of the candidate string as part of other longer candidate terms, the number of these
longer candidate terms, and the length of the candidate string (in number of words)" [22]. For
additional details about the algorithm see [22] and for the implementation see [20].

135
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

5.1.4 Our scenario and development process

The main goal for the GMS ontology is to describe the breeding history of germplasm;
phenotypic and genotypic aspects of the germplasm are not to be considered by this
ontology. The function of GMS in ICIS is to provide a unique identifier for all packets of
seed for a given germplasm. It should be noted here that although almost all progenitors of
present germplasm no longer exist we should know them in order to trace pedigrees. GMS
also manages all the names attached to the packet of seed -homonyms, synonyms, and
abbreviations. Most importantly, GMS provides a breeding history for the germplasm so that
questions such as those listed below may be easily answered.

• Does the germplasm belong to an out-breeding, in-breeding or vegetatively


reproduced species?
• Is the germplasm homozygous or heterozygous?
• Is the germplasm homogeneous or heterogeneous?
• What type of cultivar (fixed lines, hybrid, clone, etc.) is formed?
• How has the germplasm been stored?
• Where did this germplasm come from (e.g. how did I get it)?
• What are its parents, grandparents, ancestors, descendants, and relatives?
• What probability do they have of having genes in common?
• What proportion of genes is expected to come from a list of ancestors?
• What parents do they have in common?
• Given an allele of a gene, from which ancestor did it come?
Our domain experts in these initial phases were not geographically distributed; we could
gather them in one place so they, along with the KE, could build an informal ontology model.
We consider the subsequent involvement of geographically distributed domain experts in a
loosely centralised environment at a later stage. We have generated a baseline ontology,
different domain experts participated during the iterative building of informal ontology
models. Interaction was supported by email, web based ontology browsers, and direct
interviews. In order to better support the future interaction amongst domain experts we
consider there is a need for collaborative ontology environments that promote collaboration
from the cognitive perspective as oppose to simple file sharing systems.
136
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

5.1.5 Results: GMS baseline ontology

By extracting terminology we could indistinctively gather instances and possible classes;


the knowledge engineer together with the domain experts then analyzed all this information.
Some instances gathered in this way are illustrated in Figure 2. Classes, relationships and
instances are part of this conceptual map, which was generated by our domain experts after
confronting the extracted terminology and some of the initial ontological models. By using
those informal models previously built, along with those extracted terms, we could reorganise
our conceptual structure. This task resembled in many ways the card-sorting technique [23],
but also a story-telling participatory exercise. Once the re-shuffling was complete, and the
narratives analyzed, our baseline ontology (e.g. one containing only those seminal elements of
an ontology) was ready along with a set of instances.

Chapter 5 - Figure 2. Classes, instances, and relationships gathered by bringing together extracted
terms and previously built ontological models.
The result of the elicitation stage within the functional plant genomics context is
illustrated in Figure 3. We gathered ten different, yet related, concept maps from two domain

137
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

experts; this graphic represents the consensus. The main aim of the process here modeled is
to improve the corresponding plant material; traditionally this improvement has dealt with
some specific phenotypic features such as: yield, abiotic and biotic stresses, nutritional quality,
and market preferences. From the elicitation sessions we could identify several orthogonal
ontologies highly needed in order to represent those processes part of the narrative we were
working with. For instance, ontologies to describe “stress” and “plant yield” were needed to
complement the model.

In order to assist the knowledge engineer in the harmonisation of those concept maps
gathered, domain experts were required to tell a unified story that could bring together those
different concept maps. As a guide, domain experts had access to the list of extracted
terminology. Interestingly, the story had a direct relationship with the main aim of the
laboratory process; some of the GMS ontology terms were used, but the narrative was not
limited to genealogies. A broader picture could thus be produced.

138
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

Chapter 5 - Figure 3. Narrative, as seen from those concept maps and ontology models domain
experts were building.
Our baseline ontology has classes, instances, and relationships; initially domain experts
organised the classes with no consideration for time and space. For them it was important to
have a coherent is-a structure they could relate to and consequently use in order to describe
the genealogy of a given germplasm. Figure 4 illustrates the structure of our baseline ontology.

139
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

Chapter 5 - Figure 4. Baseline ontology.


Our baseline ontology has 59 classes and 10 properties; we have not started to populate
the ontology with instances as domain experts are still gathering individuals from those
relevant ICIS databases. The corresponding OWL and PPRJ files for this ontology are
available at: http://cropwiki.irri.org/icis/genealogy_07_01_05_a.owl, appendixes 3 y 4
present the OWL (Ontology Web Language) file containing two versions of this ontology.

5.1.6 Discussion and conclusions

By showing one feasible use of text mining when building ontologies not only we
extended the methodology proposed by Garcia et al. [19], but also developed a deeper
understanding of how concept maps and text mining can be used together to build narratives
that can later be used in the construction of ontologies. These narratives were used not only
(by us) to ease the understanding of this particular domain but also, at a later stage, by the
knowledge engineer to assess the ontological corpus gathered in those different models
provided by domain experts.

Domain experts were requested to match some of the provided narratives against the
concept maps. By doing this exercise it was possible not only to extend our lexicon but also to
evaluate the informal models. Engaging domain experts in the process of building the

140
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

ontology was also simplified by the use of the narratives. Domain experts were telling a story
in a structured manner, and this allowed them to better understand the is-a relationship within
the class hierarchy.

In the experimental method for building an ontology to describe genealogies within the
context of plant breeders, we constructed several ontological models by combining terms and
relationships from those mined texts. Orthogonal ontologies were easily identified as domain
experts were representing their narratives as CMs. For instance, developmental stages
described in Plant Ontology were present in some of the CMs, not only developmental stages,
but also anatomical parts of the plant. This helped us to see more clearly how to better
describe germplasm within the context of an information system that was tightly coupled to a
Laboratory Information Management System (LIMS).

At the time of writing this chapter, our approach was also being used by the
International Center for Tropical Agriculture (CIAT) as part of their methodology for
building their LIMS, paying particular attention to the identification of those orthogonal
ontologies needed by that system. An important feature within narratives is the use of more
than one elemental vocabulary to describe complex terms. The result of this is the creation of
a relationship between the combinatorial vocabulary and each of the vocabularies that was
used in its construction.

The rationale behind this approach is that a plant’s anatomical vocabulary should
completely describe the anatomy of the plant, and a developmental process vocabulary should
completely describe all of the general biological processes involved in development.
Therefore, we should be able to combine the concepts from the two vocabularies to describe
all of the processes involved in the development of all of the anatomical parts of the plant.
The structures are represented in CMs as well as in those baseline ontologies gathered.
Initially those models contained a myriad of relationships, as the process evolved and the
hierarchy becomed better structured, the "whole/part of" relationships were better defined
between structures and substructures; in this the narratives proved to be very useful.

141
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

From this experience we could also identify the gap between two ontological models
built by two different software packages, Protégé and KAON. As KAON serves as the
"platform" on top of which Test2Onto runs, the models it produces are not readable by
Protégé. It was not possible for us to exploit all the functionalities of KAON due mostly to
incompatibility problems between Protégé and KAON.

5.1.7 References

1. Sowa JF: Knowledge Representation: Logical, Philosophical, and Computational


Foundation. Pacific Grove, CA: Brooks Cole Publishing Co; 2000a.

2. Neches RF, Finin RT, Gruber R, Tom., Patil R, Senator T, Swartout WR: Enabling
Technology for Knowledge Sharing. AI Magazine 1991, 11:36-56.

3. International Crop Informtion System [http://icis.cgiar.org:8080]

4. Garcia Castro A, Rocca-Serra P, Stevens R, Taylor C, Nashar K, Ragan MA, Sansone


S: The use of concept maps during knowledge elicitation in ontology development
processes - the nutrigenomics use case. BMC Bioinformatics 2006, 7:267.

5. Cañas AJ, Hill G, Carff R, Suri N, Lott J, Eskridge T, Gómez G, Arroyo M, Carvajal R:
CmapTools: A Knowledge Modeling and Sharing Environment. In: Proceedings of
the First International Conference on Concept Mapping: 2004; Pamplona, Spain; 2004.

6. Garcia Castro A, Sansone AS, Rocca-Serra P, Taylor C, Ragan MA: The use of
conceptual maps for two ontology developments: nutrigenomics, and a
management system for genealogies. In: 8th Intl Protégé Conference Protégé: 2005;
Madrid, Spain; 2005: 59-62.

7. Noy NF, Hafner CD: The state of the art in ontology design - A survey and
comparative review. Ai Magazine 1997, 18(3):53-74.

8. Lopez MF, Perez AG: Overview and Analysis of Methodologies for Building
Ontologies. Knowledge Engineering Review 2002, 17(2):129-156.

9. Beck H, Pinto HS: Overview of Approach, Methodologies, Standards, and Tools for
Ontologies. In.: The Agricultural Ontology Service (UN FAO); 2003.

142
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

10. Ushold M, King M: Towards Methodology for Building Ontologies. In: Workshop on
Basic Ontological Issues in Knowledge Sharing, held in conjunction with IJCAI-95: 1995;
Cambridge, UK; 1995.

11. Fernandéz M, Gómez-Pérez A, Juristo N: METHONTOLOGY: From Ontological Art to


Ontological Engineering. In: Workshop on Ontological Engineering Spring Symposium
Series AAAI97: 1997; Stanford; 1997.

12. Pinto HS, Staab S, Tempich C: Diligent: towards a fine-grained methodology for
Distributed, Loosely-controlled and evolving engineering of ontologies. In:
European conference on Artificial Intelligence: 2004; Valencia, Spain; 2004: 393-397.

13. Mirzaee V: An Ontological Approach to Representing Historical Knowledge. PhD


Thesis. Vancouver: Department of Electrical and Computer Engineering, University of
British Columbia.; 2004.

14. Arpiez JC, Corcho O, Fernadez-Lopez M, Perez AG: WebODE in a nutshell. AI


Magazine 2003, 24(3):37-47.

15. Noy NF, Fergerson RW, Musen MA: The knowledge model of Protege-2000:
Combining interoperability and flexibility. In: 2th International Conference on
Knowledge Engineering and Knowledge Management (EKAW'2000): 2000; Juan-les-
Pins, France; 2000.

16. Kozaki K, Kitamura Y, Ikeda M, Mizoguchi R: Hozo: An Environment for


Building/Using Ontologies Based on a Fundamental Consideration of "Role" and
"Relationship". In: Proc of the 13th International Conference Knowledge Engineering
and Knowledge Management(EKAW2002): October 1-4 2002; Siguenza, Spain; 2002:
213-218.

17. pOWL [http://powl.sourceforge.net]

18. Uschold M: Building Ontologies: Toward a Unified Methodology. In: 16th Annual
Conf of British Computer Society Specialist Group on Expert Systems: 1996;
Cambridge, UK; 1996.

19. Cimiano P, Völker J: Text2Onto -A framework for Ontology Learning and Data-
driven Change Discovery. In: International Conference on Applications of Natural
Language to Information Systems (NLDB): 2005; Alicante, Spain: Springer; 2005: 227-
238

143
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

20. Volz R, Oberle D, Staab S, Motick B: KAON SERVER - A Semantic Web Management
System. In: Alternate Track Proceedings of the Twelfth International World Wide Web
Conference, WWW2003: May 2004; Budapest, Hungary: ACM; 2004: 20-24.

21. Oberle D, Eberhart A, Steffen S, Volz R: Developing and Managing Software


Components in an Ontology-based Application Server. In: 5th International
Middleware Conference: 2004; Toronto, Ontario, Canada: Springer; 2004: 459-478.

22. Frantzi K, Ananiadou S, Mima H: Automatic recognition of multi-word terms: the C-


value/NC-value method. International Journal on Digital Libraries 2000, 3(2):115.

23. Card sorting to Discover the Users' Model of the Information Space.
[http://www.useit.com/papers/sun/cardsort.html]

144
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

5.2 A PROPOSED SEMANTIC FRAMEWORK FOR REPORTING OMICS


INVESTIGATIONS.

Abstract. The current science landscape is rapidly evolving and it is increasingly driven by computational
tasks. The deluge of data unleashed by omics technologies, such as transcriptomics, proteomics and
metabolomics requires systematic approaches for reporting and storing the data and the experimental
processes in a standard format, relating the biology information and the technology involved. Ontology-based
knowledge representations have proved to be successful in providing the semantics for a standardised
annotation, integration and exchange of data. The framework proposed by the MGED RSBI working group
would provide semantics for upper-level elements relevant to the representation and interpretation and of
omics-based investigations.

5.2.1 Introduction

When the first microarray experiments were published, it became apparent that the lack
of robust quality control procedures and capture of adequate biological metadata impeded the
exchange and reporting of array-based transcriptomics experiments. The MIAME checklist
(Brazma et al. [1]) was written in response to this lack, by a group of biologists, computer
scientists, and data analysts, and aims to define the minimum information required to
interpret unambiguously and potentially reproduce and verify a microarray experiment. This
group then went on to make its composition official and founded the Microarray Gene
Expression Data (MGED) Society. The response from the scientific community has been
extremely positive and currently most of the major scientific journals and funding agencies
require publications describing microarray experiments to comply with MIAME standard.
The adoption of these standard by public and community databases, Laboratory Information
Management Systems (LIMS) and several microarray informatics tools has greatly improved
the interpretation of microarray experiments described in a structured manner.

The MIAME model has been adopted by other communities (reviewed by


Quackenbush [2]) and as microarrays are incorporated into other complex biological
investigations (including toxicogenomics, nutrigenomics and environmental genomics), it has

145
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

become apparent that analogous minimal descriptors should be identified for these
applications. There have been several extensions to MIAME. MIAME/Tox is an array based
toxicogenomics standard developed by the EBI in collaboration with the ILSI Health and
Environmental Sciences Institute (HESI), National Institute of Environmental Health
Sciences (NIEHS), the National Center for Toxicogenomics, the FDA National Center for
Toxicological Research (NCTR). MIAME/Env has been developed by the Natural
Environmental Research Council (NERC) Data Centre to fulfill the diverse needs of those
working in functional genomic of ecosystems, invertebrates and vertebrates which are not
covered by the model organism community. MIAME/Tox and MIAME/Env have initiated
several discussions in the academic settings as well as in the industrial and regulatory arenas
(OECD Toxicogenomics Guidelines [3]).

However it has become evident that when other –omics technologies will be used in
combination with microarrays, these MIAME-based checklists will soon be insufficient to
serve the scope of experimenters’ needs. The toxicogenomics, nutrigenomics and
environmental genomics communities soon recognised the need for a strategy that capitalises
on synergy, forming the Reporting Structure for Biological Investigations (RSBI [4]) working
group under the MGED [5] umbrella. The RSBI working group feels that it is very important
to agree on a single source of basic conceptual information relating to the reporting process
of complex biological investigations, employing omics technologies. This unified approach to
describe the upper-level elements relevant to the representation and interpretation and of
these investigations should encompass any specific application. The possibility to enable
‘semantic integration’ of complex data, facilitating data mining, and information retrieval is
the rationale for developing an ontologically grounded conceptual framework. Ultimately, the
effort by the RSBI working group aims to constitute the foundation of standard reporting
structure in publications and submission to public repositories and knowledge-bases. The
need for information on which to base the evaluation and interpretation of the results
underlies the objectives of presenting sufficient details to the readers and/or reviewers.

146
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

The information in complex biological investigations is highly nested and formalizing


this knowledge to facilitate data representation is not a trivial task. To tackle this issue, the
RSBI working group has established links with the several standardisation efforts in their
biological domains (as reviewed by Sansone et al. [6]) and is working closely with the MGED
Ontology working group, the HUPO [7] Proteomics Standards Initiative (PSI), the Standard
Metabolic Reporting Structure (SMRS [8]) group These groups can clearly draw in large
numbers of experimentalists and developers and feed in the domain-specific knowledge of a
wide range of biological and technical experts.

This chapter is organised as follows. In Section 5.2.2 we briefly describe the


methodology we followed for developing an ontologically grounded conceptual framework;
in Section 5.2.3 we present the proposed upper-level ontology, Section 5.2.4 includes
conclusions and future directions.

5.2.2 Methodology

Our scenario involves communities distributed geographically and for the domain
analysis and knowledge acquisition phases the group has used different independent
technologies that were not always integrated into the Protégé suite (Noy et al. [9]). From these
experiences members of RSBI are also working with others on a collaborative and knowledge
acquisition tool for the development of ontologies integrated in Protégé (Garciaet al. [10]).

Figure 5 schematises the methodology we followed. We built different models


throughout our analyses of available knowledge sources and information gathered in previous
steps. Firstly, a “baseline ontology” was gathered, i.e. a draft version containing few but
seminal elements of an ontology. Typically, the most important concepts and relations were
identified somewhat informally. We could assimilate this “baseline ontology” into a
taxonomy, in the sense of a structure of categories and classifications. We consider a
taxonomy as “a controlled vocabulary which is arranged in a concept hierarchy”, and
ontology as “a taxonomy where the meaning of each concept is defined by specifying
properties, relations to other concepts, and axioms narrowing down the interpretation.” As

147
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

the process of domain analysis and knowledge acquisition evolves, the taxonomy takes the
shape of an ontology. During this step, the ontologist worked primarily with only very few of
the domains experts; the others were involved in weekly meetings. In this phase the ontologist
sought to provide the means by which the domain experts he or she was working with could
express their knowledge. Some deficiencies in the available technology were identified, and
for the most part were overcome by our use of concept maps (CMs).

Chapter 5 - Figure 5. Our methodology.

5.2.3 The RSBI Semantic Framework

Our approach is one of an upper ontology that would provide high-level semantics for
the representation of omics-based investigations that serves as a conceptual scaffold from
which other ontologies may be hooked. An example for the latter could be an ontology
specific for the microarray technology, such as the MGED Ontology, and/or specific for an
applications, such as toxicology. In order to describe the interaction of different technologies
during the course of a scientific endeavour we considered there was the need for a high-level
container where to place the information relevant to the biology as well as that relevant to
those different assays. Our high-level concept is an Investigation, a self-contained contained unit
of scientific enquiry, containing information for Study(-ies) and Assay(s). We consider a Study to

148
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

be the set of steps and descriptions performed on the Subject(s). In the cases where the Subject
is a piece of tissue, and no steps have been performed but just an Assay has been carried out,
then we the Study contains only the descriptors of the Subject (e.g. provenance, treatments,
storage, etc.). We consider an Assay as the container for the test(s) performed and the data
produced for computational purpose. There are different AssayType(s) and the different omics
technologies fall within this category. A view of the RSBI upper ontology is shown in Figure
6 and the ontology is available from the RSBI webpage (http://www.mged.org/
Workgroups/rsbi/rsbi.html).

Chapter 5 - Figure 6. A view of a section of the RSBI ontology.


The corresponding OWL (Ontology Web Language) file representing the ontology is presented in
Appendix 1.

5.2.4 Conclusions and Future Directions

Since our framework will allow the use of different ontologies the definition for
whole/part relationships should be consistent across those different ontologies. However,
currently there are no standards of guidance for defining whole/part of relationships, adding
another layer of complexity when developing an upper-level ontology.

Upper level, or top level, ontologies describe very general concepts like space, time,
event, which are independent of a particular problem domain. Such unified top-level
ontologies aim at serving large communities [11, 12]. For instance, the Standard Upper
Ontology (SUO) [13] provides definitions for general-purpose terms, and it acts as a

149
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

foundation for more specific domain ontologies. General purpose ontologies, such as the
RSBI, provides a more specific semantic framework from which it is, in principle, possible to
integrate other biological ontologies. As the RSBI aims to facilitate the annotation of
biological investigations, the generality of its concepts is constrained to those predefined 3
specific domains of knowledge for which it was created (toxicogenomics, nutrigenomics and
environmental genomics). Those principles recommended by Niles and Pease [13] when
developing upper level ontologies were considered during the development of the RSBI
ontology, however as the RSBI ontology aims to facilitate the description of biological
investigations some practical considerations were also taken.

Ultimately the RSBI upper-level ontology should be able to answer a few questions
and position almost anything approximately in the right place, even if the spot has a non-
existent ontology. The relationship between Study and Assay defines an Investigation,
different things participate in different processes and on the same token some things retain
their form over time. Study and Assay contain information about those processes. It is
particularly important to have minimal commitment when developing upper-level ontologies,
only those concepts providing a common scaffold should be considered.

5.2.5 References

1. Brazma, A., Hingamp, P., Quackenbush et al. 2001. Minimum information about a
microarray experiment (MIAME)-toward standards for microarray data. Nat Genet. 29
(4): 365-71.

2. Quackenbush J. 2004. Data standards for 'omic' science. Nat Biotechnol. 22:613-614.

3. OECD Toxicogenomics Guidelines: http://www.oecd.org /document/ 29/0,2340,


en_2649_34377_34704669_1_1_1_1,00.html

4. MGED RSBI: http://www.mged.org/Workgroups/rsbi

5. MGED Ontology: http://mged.sourceforge.net/ontologies/index.php

6. Sansone, S.A, Morrison, N., Rocca-Serra, P., Fostel, J. 2005. Standardization


initiatives in the (eco)toxicogenomics domain: a review. Comp. Funct. Genomics. 8,
633-641.

150
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

7. HUPO PSI: http://psidev.sourceforge.net

8. SMRS: http://www.smrsgroup.org

9. Noy, N.F., Crubezy, M., Fergerson, R.W. et al. 2003. Protege-2000: an open-source
ontology-development and knowledge-acquisition environment, AMIA Annu Symp
Proc, 953.

10. Garcia Castro, A., Sansone S.A., Rocca-Serra, P., Taylor, C., Ragan, M.A. 2005. The
use of conceptual maps for two ontology developments: nutrigenomics, and a
management system for genealogies. Proceedings of the 8th International Protege
Conference. (Accepted for Publication)

11. Sure, Y. 2003. Methodology, Tools & Case Studies for Ontology based Knowledge
Managment. Karlsruhe: Universitat Fridericiana zu Karlsruhe

12. Gangemi A, Guarino N, Masolo C, Oltramari A, Schneider L. 2002. Sweetening


ontologies with DOLCE. In: Proceedings of the 13th International Conference on
Knowledge Engineering and Knowledge Management Ontologies and the Semantic
Web. Springer-Verlag. 166-181

13. Niles, I., Pease, A. 2001. Towards a standard upper ontology. Proceedings of the
international conference on Formal Ontology in Information Systems-Volume 2001, 2-9,
2001. ACM Press New York, NY, USA.

151
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

Information integration in molecular


bioscience

As has previously been pointed out by this research, ontologies in biosciences are to be
used by software applications that ultimately seek to facilitate the integration of information in
molecular biosciences. Previous chapters of this research have explored how to develop
ontologies within a highly decentralised environment such as the bio-community. However,
as the involvement of the community does not just end at the time the ontology is deployed,
it is also important to consider those scenarios in which ontologies are to be used by software
that supports some of the activities carried out by biologists. This chapter introduces the
reader to some of the problems when integrating information in biosciences. Not only
technical issues concerning the integration of heterogeneous data sources and the
corresponding semantic implications, but also the integration of analytical results are
presented in this chapter. Within the broad range of strategies for integration of data and
information, platforms and developments are here distinguished.

The main contribution of this chapter is to present a view of the state of the art in data
and information integration in molecular biology that is general and comprehensive, yet based
on specific examples. The perspective this review gives to the reader is critical, and offers
insights and categorisations not previously considered by other authors. This chapter
concludes with the identification of some open issues for data and information integration in
the molecular biosciences domain and argues that with a wider application of ontologies and
semantic web technologies some of these issues can be overcome.

This chapter contains an original critical assessment made entirely by the author, who
conceived its structure, organisation and scope. The manuscripts that lead to the published
paper were written by Alex Garcia; the analysis and classification presented, as well as those
critical insights were worked out by Alex Garcia.

152
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

PUBLISHED PAPER ARISING FROM THIS CHAPTER

Garcia Castro A, Chen Y-PP, Ragan MA.: Information integration in molecular


bioscience: a review. Applied bioinformatics 2005, 4(3):157-173.

AUTHORS' CONTRIBUTIONS

Alex Garcia Castro conceived the project and wrote the manuscripts for this paper. Yi-
Ping Phoebe Chen provided useful discussion. Mark Ragan supervised the project, provided
useful discussion and assisted Alex Garcia Castro in the preparation of the final manuscript.

153
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

6 Chapter VI - Information integration in


molecular bioscience

Abstract. Integrating information in the molecular biosciences involves more than the cross-referencing of
sequences or structures. Experimental protocols, results of computational analyses, annotations and links to
relevant literature form integral parts of this information, and impart meaning to sequence or structure. In this
review, we examine some existing approaches to integrating information in the molecular biosciences. We
consider not only technical issues concerning the integration of heterogeneous data sources and the
corresponding semantic implications, but also the integration of analytical results. Within the broad range of
strategies for integration of data and information, we distinguish between platforms and developments. We discuss
two current platforms and six current developments, and identify what we believe to be their strengths and
limitations. We identify key unsolved problems in integrating information in the molecular biosciences, and
discuss possible strategies for addressing them including semantic integration using ontologies, XML as a data
model, and Graphical User Interfaces (GUIs) as integrative environments.

Molecular bioscience databases (MBDBs) are an essential resource for modern


biological research. Researchers typically use these databases in a decentralised manner,
mobilising data from multiple sources to address a given question. A researcher (user) thus
builds up and integrates information from multiple MBDBs. From an early point in the
development of online databases (the early 1990s), different approaches have been explored
to bring about this integration [1]. At the same time, appreciation has grown that not only
data, but information more broadly, must be integrated if the full potential of MBDBs is to be
realised. Data integration per se can be the least difficult part of this undertaking; indeed, data
integration sometimes means little more than achieving the “ interactive hypertext flavor” of
database interoperation [2]. In contrast, integrating information requires a conceptual model
in which MBDBs are described in context [2] and links become meaningful relationships. The
extraordinary degree and diversity of interconnectedness among biological information, and
the lack of network models that capture these connections, have so far made it very difficult
to achieve satisfactory integration of information in the molecular biosciences.

MBDBs are essentially collections of entries that contain descriptors of sequences.


Autonomous databases (often more than one) typically exist for each broad type of
information (e.g. nucleotide sequence data). Most are poorly interconnected and differ in

154
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

scope, organisation [3, 4] and functionality. Both the databases themselves, and individual
entries within them, may be incomplete.

Sequences and their descriptors are meant to inform us about organisms and life
processes. In this sense, a sequence is not merely an isolated entity, but, on the contrary, is
part of a highly interconnected network. Initially at least, MBDBs were intended merely as
data repositories. Later, some databases were developed to facilitate the retrieval of connected
information that goes beyond sequences – for example, metabolic pathway databases
(MPDBs), in which sequences are nodes of networks (subgraphs) linked by edges that
represent biochemical reactions. In such a context, the meaning of a sequence is given by the
way it relates to other sequences, correlating data beyond sequences and reactions. Only by
understanding this context can a user formulate an intelligent query.

We believe that the integration of both data and information in molecular bioscience
should embody more-holistic views: how do molecules, pathways and networks interact to
build functional cells, tissues or organisms? How can health, development or diseases be
modeled so all relevant information is accessible in meaningful context? Developing
computational solutions that allow biologists to query multiple data sources in meaningful
ways is a fundamental challenge in modern bioinformatics [5], one with important
implications for the success of biomedicine, agriculture, environmental biotechnology and
most other areas of bioscience.

There have been many diverse approaches to integrating information in bioinformatics,


and it is not feasible for us to review them all comprehensively. In this review, we focus
primarily on those we consider to represent the evolution of this field toward more
semantically driven integrative approaches. Much of this evolution has been driven by ad hoc
responses to specific needs. We attempt to provide a conceptual framework in which to
analyze these and to envision the potential of emerging semantically based technologies to
contribute to this domain in the immediate future.

This review is organised as follows. In the first section we present an overview of issues
and technologies relevant to integration of information in the molecular biosciences, and

155
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

distinguish platforms from developments. Next, we focus on data integration (Section 6.2),
describing some existing platforms and developments and considering the extent to which
they can be considered to integrate information. In the third section (6.3) we address deeper
issues of semantic integration of molecular-biological information, highlighting the role of
ontologies. Section 6.4 presents XML not only as a format for data exchanges, but also as a
technology for introducing and managing semantic content. In Section 6.5 we describe how
Graphical User Interfaces (GUIs) can provide integrative frameworks for data and analysis. In
Section 6.6 we further detail metabolic pathway databases as a special case of integration –
one in which data become more valuable in the context of other data, and in which a formal
description of information relatedness and flow helps shape the description of biological
processes. Section 6.7 summarises and concludes our analysis and present what we consider
to be key unsolved problems.

6.1 OVERVIEW OF ISSUES AND TECHNOLOGIES

It is a defensible philosophical position that integration – of observations with each


other, with previous knowledge, or with a concept or hypothesis – is a necessary part of
understanding. If so, most if not all scholarly or intellectual activities are integrative, as are
many research technologies and methodologies. Although such a broad conceptualisation of
integration is not by itself particularly powerful, it does help us appreciate that integration of
information can be, and has been, approached in many ways from diverse perspectives.

We have already distinguished data integration from integration of information, and in


Sections 2.1 and 2.2 we will discuss both general and more-specific approaches that provide
different (partial) solutions to data integration in the molecular biosciences. These interact to
greater or lesser extents with generic issues of data management and sharing, some of which
(e.g. version control, persistency, security) can be adequately addressed within commercial (or,
to a lesser extent, open-source) database management systems, while other issues (including

156
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

data availability, control of data quality and standardisation of formats) also have domain-
specific features.

6.1.1 Data availability

The scope and coordination of public databases such as those organised for the
research community by the U.S. National Center for Biotechnology Information (NCBI), the
European Bioinformatics Institute (EBI), and the DNA Data Bank of Japan (DDBJ) are a
characteristic feature of molecular bioscience. However, extensive data are also held privately,
some proprietarily within commercial enterprises (e.g. pharmaceutical or agrichemical
companies) and others available on a subscription basis. Public data can be integrated into
private databases, but (although the technical issues are presumably no different than for
integration among public databases) because of access policies the reverse does not happen.
Initiatives such as the International Crop Information System (http://www.icis.cgiar.org) and
the Global Biodiversity Information Facility (http://www.gbif.org) will likewise come up
against boundaries between public and private information.

6.1.2 Data quality

The major public sequence databases have instituted measures to ensure that the data
they provide to the research community are of high quality. These include standard formats
and software for data submission, automated processing of submissions, and the availability
of human assistance. Nonetheless, as open, comprehensive repositories, these databases
necessarily contain instances of incomplete or poor-quality data, missing fields and legacy
formats that can only create problems for data and information integration. These problems
should be largely absent from databases that are expert-curated (e.g. Mouse Genome Database
[http://www.informatics.jax.org/], UniProt/SwissProt [6]) or derived from a curated
database (e.g. ProDom [http://protein.toulouse.inra.fr/prodom/current/html/home.php]),
based around one or a few large projects (e.g. Ensembl [7], FlyBase
[http://flybase.bio.indiana.edu/]), or otherwise narrowly focused (e.g. Protein Kinase
Resource [http://pkr.sdsc.edu/html/index.shtml], Snake Neurotoxin Database

157
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

[http://research.i2r.a-star.edu.sg/Templar/DB?snake_neurotoxin/]). On the other hand, a


proliferation of boutique databases could create physical, networking and other obstacles to
integration.

6.1.3 Standardisation

International efforts through the 1990s, in part through Committee on Data for Science
and Technology CODATA (http://www.codata.org) of the International Council for
Science, led to highly coherent data formats for molecular sequence data at EBI, NCBI and
DDBJ. This is despite the evolution, during that decade, of sequencing technology from
manual slab gels with autoradiographic detection to automated slab gels with fluorescent
detection, to today’s capillary-based technologies. However, data standardisation remains a
major issue in fields where it may be less obvious what experimental conditions are relevant
to interpretation and where alternative technologies may be intrinsically less compatible. The
MIAME/MGED (minimum information about a microarray experiment/Microarray Gene
Expression Data Society; http://www.mged.org) and MAGE [8] (microarray and gene
expression) initiatives among the expression microarray community, and the Proteomics
Standards Initiative (PSI; http://psidev.sourceforge.net), exemplify the efforts being
undertaken to establish data standards for newer types of molecular data.

Technological issues cut across the integration of information in diverse ways, many of
which are discussed, in greater or lesser detail, in the sections that follow. Two others bear
further mention here: language and access.

6.1.4 Language

Integrative frameworks or developments can be built using either general purpose or


specialised languages. Thus, for example, the annotation pipeline PRECIS [9] is based on
Awk and Perl, while MAGPIE [10] uses Prolog, C, Perl and (for the GUI) JavaTM/Javascript.
By contrast, the data source-wrapping functions of Sequence Retrieval System (SRS) [11, 12]
are implemented in a unique language ICARUS, while Kleisli [13, 14] makes use of a high-
level query language called sSQL (Semi-Structured Query Language). Open Bioinformatics

158
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

Foundation projects such as Bioperl (http://www.bioperl.org), BioJava


(http://www.biojava.org) and the like provide modules, scripts and sometimes ready-to-use
code for many different tasks in bioinformatics. Open Bioinformatics Foundation projects are
an effort of the community to provide repositories of useful tools for end users in the cases
of stand-alone applications, as well as for developers in the case of Perl modules that can later
be used in different in-house applications.

6.1.5 Access

“Grid” initiatives refer to a vision of the future in which data, resources and services
will be seamlessly accessible through the Internet, in the same sense that the electricity
delivered to our homes and offices is generated and transmitted via diverse power plants,
transmission lines, substations and the like that electric-power users rarely have to think
about. “The grid” will actually be multiple grids (data grid, computation grid, services grid)
and will be useful only to the extent that relevant components communicate with each other.
Many existing grid initiatives are coordinated through the Global Grid Forum
(http://www.ggf.org) in which the bioinformatics community is actively represented. It is
envisioned that the computational grid will be implemented using a standard “toolkit” of
reference software (http://www.globus.org). Other initiatives focus on how data can be most
efficiently shared across a data grid. The Life Science Identifier (http://Isid.sourceforge.net),
for example, has been proposed as a uniform resource name (URN) for any biologically
relevant resource. It is being offered as a formal standard that would be served on top of, not
as a replacement for, existing formats or schemata.

A third interrelated set of perspectives on information integration relates to the


interface with humans. This is necessarily a broad area, and many issues relate more to
physiology, psychology or sociology than to bioinformatics per se. The most typical interface
with individual humans (users) is the GUI, discussed further in Section 5.

159
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

6.2 STRATEGIES FOR DATA INTEGRATION

6.2.1 Platforms

We recognise two broad strategies for integration of data and information in molecular
biology: (i) provision of a general platform (framework, backbone) and (ii) addressing a specific
problem via a specific development. A platform for data integration offers a technological
framework within which it is possible to develop point solutions; usually platforms provide
non-proprietary languages, data models and data exchange/exporting systems, and are highly
customizable. By contrast, a development may not provide technology applicable to other
problems even in the same domain. A platform is meant to be a deeper layer over which
several heterogeneous solutions may share a common background, and in this way some
degree of interoperability can be achieved. Kleisli and DiscoveryLinkR [15] are examples of
platforms over which heterogeneous data can be integrated. Platforms provide a data model,
can query optimisation procedures, and provide a general query language as well as flexible
data exchange mechanisms.

We consider developments to be proposed solutions that are either built on top of


platforms for data integration or make extensive use of wrappers or parsers. They do not
provide a significant environment, and their integrative context is limited. Davidson et al. [16]
define three different classes of integrative strategies: (i) link-driven federations, (ii) view
integration and (iii) warehousing. Proposed solutions in which users must navigate through
predefined links (provided by the system) among different data sources in order to extract
information are called link-driven federation solutions. SRS and GeneCardsR [17] are
examples of this category.

In the view integration approach, the schemata of a collection of underlying data


sources are merged to form a global schema under some common model. Users query this
global schema using a high-level query language such as CPL (Combined Programming
Language), SQL (Structured Query Language) or OQL (Object Query Language). The system
determines what portion of the query can be answered by which data source, ships local
160
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

queries to the appropriate data source and then combines the answers from the various data
sources to produce an answer to the global query. Kleisli is one example of this type of
integrative strategy.

The warehousing approach is distinguished by holding information in a common


location (the warehouse) instead of in multiple locations. Data warehousing requires a unified
data model that can accommodate the information in the component data sources; it also
needs a series of programs that fetch the information and transform it into the unified data
model [18]. Data warehouses are not easy to implement and administer, and keeping the
information up to date is further complicated if changes in the original data sources require
structural changes in the warehouse.

Kleisli and DiscoveryLink® can be considered platforms for data integration. Although
not fully integrated strategies, they address data integration under a broader perspective than
does an individual development.

Kleisli: Kleisli is a mediator system, encompassing a nested relational data model, a high-
level query language, and a query optimiser. It provides a high-level query language, sSQL,
which can be used to express complicated transformations across multiple data sources. The
sSQL module can be replaced with other high-level query languages. It is possible to add new
data types if an appropriate wrapper is available or if one can be added. Kleisli does not have
its own Data Base Management System (DBMS); instead, it has functionality to convert many
types of database systems into its nested relational data model. Kleisli does not require
schemata; its relational data model and its data exchange format can be translated by external
data databases [14]. Kleisli is thus a backbone that is not limited in application to the
biological domain.

DiscoveryLink®: DiscoveryLink® is also a mediator system. This product (and its


emerging successor, WebSphere® Information Integrator [http://www-
306.ibm.com/software/data/integration/partners.html]) address the problem of integrating
information from a much broader perspective via a technological platform that enables the
user to query a broad range of file types that are part of a federation. DiscoveryLink® exhibits

161
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

several advantages over other proposed solutions, basically because it relies on a de facto data
model for most commercial DBMSs. For example, users are able to carry out post-query
manipulations, and can use explicit SQL statements to define their queries; query optimisation
is also a feature of DiscoveryLink®. However, adding new data sources or analysis tools into
the system is not a straightforward process. DiscoveryLink® wrappers are written in C++,
which is not necessarily the most suitable programming language for wrappers [14]. Extensive
knowledge of SQL and of relational database technology is needed. DiscoveryLink® is built
over IBM’s DB2® technology from IBM, which is a commercial product. Although new data
sources can often readily be incorporated within the DiscoveryLink® federation, it may be
much more difficult to integrate DiscoveryLink® per se with non- DiscoveryLink®
environments.

Kleisli and DiscoveryLink® support queries across a heterogeneous federation using a


standard language. Both systems allow users to manipulate BLAST® results via SQL
statements, integrate flat files into the federation via a wrapper and query federated flat files
using SQL, either in a preformatted graphical way or via a command line. It is possible to
have Microsoft® Word and Excel® and also text files as part of the federation. XML files
and the PubMed database can also be accessed. Users typically need extensive knowledge of
the schemata that represent the information and a high-level understanding of the system.
Neither system is powerful in the hands of a naive end user.

Wrappers, such as those used in both Kleisli and DiscoveryLink®, mediate between
query system and specific data source (or type of data source). Thus, systems that wrap
multiple heterogeneous data sources thus translate their data into a common integrated data
representation [19]. Wrappers thus provide a kind of lingua franca through which two different
databases communicate and produce a result. Retrieval components in wrappers map queries
with common gateway interface (CGI) calls. Changes in data sources make it difficult to
maintain wrappers.

162
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

6.2.2 Developments

There have been many developments, some of which offer a framework over which
the integrating of analytical tools is possible, whereas others were designed to provide GUI
capabilities only for very specific algorithms. Commercial packages such as VectorNTI®
(http://www.invitrogen.com), Lasergene® (http://www.dnastar.com) and SeqWeb®
(http://www.accelrys.com/products/gcg/seqweb.html) basically offer additional
functionality (e.g. access to databases of plasmids sequences, tools for immediate visualisation
of 3-dimensional structure, etc.) and facilities for direct manipulation of specialised hardware
devices.

In this section we present some of the existing developments currently available. A


summary is given in Table 1. We do not pretend to cover all of the existing developments;
rather, we present some representative developments. A more detailed list of online tools is
available in the supplementary material at http://130.102.113.135/sup_material.html.

6.2.2.1 Sequence Retrieval System (SRS)

SRS is an information indexing and retrieval system designed for libraries with a flat-file
format such as the EMBL [20] Nucleotide Sequence Database, the Swiss-Prot [6] protein
sequence databank or the PROSITE [21] library of protein subsequence consensus patterns.
SRS was intended to be a retrieval tool that allows the user to access as many different
biological data sources as possible via a common GUI. It is relatively easy to integrate new
data into SRS. SRS wraps data sources via a specialised, built-in wrapping programming
language called ICARUS. It can be argued, however, that parsers should be written in a
general purpose language, rather than a language being built around a parser [22].

163
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

Features Online tools


SRS GeneCards Entrez Ensembl BioMOBY
Integration Allowed Possible but Yes by central Possible Via the registry
of new data requires administration service
sources recoding
Query Some query NA API SQL/Perl API
language language
capacities via
ICARUS
Query NA NA NA NA NA
optimisation
Data model No true data NA NA Oriented to NA
model solving a
specific
problem
Data Low Low Low Medium Low
exchange
Technology Perl, JavaScript, Perl / NA MySQL® / XML / SOAP /
involved ICARUS Glimpse Perl Perl / JavaTM
Graphical Intuitive, Intuitive, Intuitive, Intuitive, Early stages of
User functional functional functional, functional, development
Interface navigable, navigable
(GUIs) friendly
Level of High High High Medium Medium
link-driven
solution
Analysis Available NA Available Available Available
tools
available
Licensing Free for Free for Access over Free for Free, download
academic use, academic the Web academic use, from the MOBY
local installation use local website
installation
API = application program interface; NA = not available; SOAP = Simple Object Access Protocol;
SQL = Structured Query Language; SRS = Sequence Retrieval System.
Chapter 6 - Table 1. Some existing developments in database integration in molecular biology
SRS provides integration capacities by allowing the user to navigate through
information contained in different heterogeneous data sources, and to capture entries. As
such, it is a good example of a link-driven federation. With SRS, the user still has to know the
schema of each database, and formulate suitable and valid queries. Further operations or
transformations are not easy in SRS. To some extent, SRS may be seen as a GUI integration
approach; not only deeper integration, but also even the capacity for interoperation, is limited.

164
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

As a relatively popular attempt at a unified GUI to heterogeneous data sources, SRS may
provide important information about Human-Computer Interaction (HCI) in the
bioinformatics field.

Neither Kleisli, nor DiscoveryLink®, is deeply comparable with SRS. This is because
SRS focuses more on an integration of molecular data, whereas DiscoveryLink® and Kleisli
were, from their beginnings, products designed to allow the biomedical community to access
data in a wide variety of file formats.

6.2.2.2 GeneCards®

GeneCards® is a compendium of information on individual human genes. It has been


built using Perl, with Glimpse (Global IMPlicit Searh; http://webplimpse.net) as an indexing
system. GeneCards® may be seen a classical collection of human gene-related data,
implemented in a single information space and with some query capacities. It does not
provide a clear integration perspective, and should not be seen as a solution beyond its initial
purposes. Adding new data is not a straightforward process.

6.2.2.3 Entrez

Entrez [23, 24] is an integral part of the NCBI portal system, and as such is an
integrative solution within the NCBI problem framework. It provides a single portal for
access to most existing genomes, along with some analysis tools and database querying
capacities for genomic, protein and bibliographic information about specific genes. Graphical
displays for chromosomes, contig maps and integrated genetic and physical maps are also
available. Entrez also links each data element to neighbors of the same type [23].

6.2.2.4 Ensembl

Ensembl is not by itself a data integration effort, but rather an automatic annotation
tool. The task of annotation involves integrating information from different data sources, at
different levels and using different methods in concert. Emsembl provides general
visualisation tools and the ability to work with different data sources. Ensembl relies on open-

165
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

source projects such as BioJava and BioPerl. Raw data are loaded into the MySQL®-based
(http://www.mysql.com) internal schema of Ensembl and processed through its annotation
pipeline; results can be visualised using, for example, the Apollo [25] genome browser. Query
capacities in Ensembl are limited, but more flexible capacities may be achieved by addition of
Perl or Python scripts. Ensembl does not inherently provide alternative data models (or the
flexibility to supply alternative models), data exchange formats or substantial data exchange
capacities. Thus, Ensembl is not a data integration, but rather a solution that addresses a
specific problem (genome browsing and automatic annotation facility).

6.2.2.5 BioMOBY

BioMOBY (http://www.biomoby.org) proposes an integrative solution based on web


client/service/registry architecture. In BioMOBY, data repositories are treated as web
services. The BioMOBY project was established to address the problem of discovering and
retrieving related instances of biological data from multiple hosts and services via a
standardised query and retrieval interface that uses consensus object models. BioMOBY
allows users to relate information by using a priori knowledge of what has been previously
considered to be relevant information. The idea behind MOBY is simple: a registry acts as a
server in which knowledge of the different services is stored; different clients make use of this
central facility, thereby allowing the formulation of queries, and a client interacts with
different data sources regardless of the underlying schema. BioMOBY provides a common
format for the representation of retrieved data regardless of their origin, and in this way
eliminates the need for endless cutting and pasting [26]. Relying on the paradigm of universal
data discovery and integration (UDDI), BioMOBY integrates data by integrating the different
services in which these data are stored. At this point in its evolution, BioMOBY addresses a
specific type of problem, and does not provide a real framework over which other solutions
can be built.

166
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

6.2.2.6 myGrid

MyGrid [27] is a prototype grid environment for bioinformatics, aiming to integrate


web facilities in a grid of bioinformatics services. It addresses the problem of integration by
identifying the main sources of information: literature sources, analysis methods and
databanks. MyGrid provides high-level services for data and application integration such as
resource discovery, workflow enactment and distributed query processing. MyGrid combines
analysis and query processes in a concrete, executable workflows. It provides an integrated
ontology where data and analysis sources are described. MyGrid offers a high-level
middleware solution to support personalised in silico experiments over a service grid. The
myGrid project anticipates a service grid where data sources and analysis tools are
orchestrated into a service grid of reusable components. MyGRID recognises three different
categories of services: (i) services for forming experiments, (ii) services for discovery and
metadata management, and (iii) services for supporting e-science. While it is true that
workflows in bioinformatics interleave query and analysis processes, these are not a single
unified process, and administering such description over the service grid has proven to be
difficult.

6.2.2.7 Others

Other projects also provide different functional capabilities and integrate information
from heterogeneous sources for a particular purpose or for a specific community. FlyBase
and WormBase (http://www.wormbase.org/) are examples of such integrative efforts that
aim to provide ‘all’ the available information related to a particular organism.

6.3 SEMANTIC INTEGRATION OF INFORMATION IN MOLECULAR


BIOSCIENCE

Syntactic integration basically deals with heterogeneity of form - the structure but not
the meaning. Semantic integration, on the other hand, fundamentally deals with the meaning

167
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

of information and how it relates within a specific field. It addresses the problem of
identifying semantically related objects in different databases, then resolving the schematic
(schema-related) differences among them [28]. A simple scenario where semantic implications
matter is one in which a protein may be identified in a particular databank with a certain
accession number, but may have a different identifier, or even a different annotation, in
another database (i.e. may appear non-synonymous). In the case of bioinformatics, semantic
integration could (and we argue, should) be seen to encompass not only the taxonomy of
terms (controlled vocabulary) or the resolution of semantic disagreements (e.g. between
synonyms in different databases), but also the discovery of services (databases and/or analysis
algorithms).

Semantic integration of MBDBs thus focuses, at some level, on how a database entry
can be related to other information sources in a meaningful way. Our previous descriptions of
database integration (in Section 2) addressed the problem of querying, and extracting data
from, multiple heterogeneous data sources. If this can be done via a single query, the data
sources involved are considered interoperable. We have not so far considered how a
particular biological entity might be meaningfully related to others; only location and
accessibility have been at issue. In the same way, the complexity of a query would be largely a
function of how many different databases must be queried, and, from these, how many
internal subqueries must be formed and exchanged for the desired information to be
extracted. If a deeper layer that embeds semantic awareness were added, it is probable that
query capacities would be improved. Not only could different data sources be queried, but
(more importantly) interoperability would then arise naturally as a consequence of semantic
awareness. At the same time, it should be possible to automatically identify and map the
various entries that constitute the knowledge relationship, empowering the user to visualise a
more descriptive landscape.

A richer example of why semantics matters may be seen with the word ‘gene’, a term
that has different meanings in different databases. In GenBank® [29], a gene is a “region of
biological interest with a name that carries a genetic trait of phenotype” and includes non-

168
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

structural coding DNA regions including introns, promoters and enhancers. In the Genome
Database [30], a gene is just a “DNA fragment that can be transcribed and translated into a
protein”. The RIKEN Mouse Full-length cDNA Encyclopedia [31], which focuses on full-
length transcripts, refers to the transcriptional unit instead of the gene. Queries involving
databases that, among themselves, present such semantic issues necessarily have limited
operability. Ontology can provide a guiding framework within which the user can restrict the
query to the context that it makes sense within and can navigate intelligently across terms.
Semantic approaches thus depend heavily on ontology.

What is ontology? Notions of what ontology is, and how it should be implemented,
differ but include the following: (i) a system of categories accounting for a particular vision of
the world [32]; (ii) specification of conceptualisations [33]; (iii) a concise and unambiguous
description of principal relevant entities with their potential, valid relations to each other [34];
and (iv) a means of capturing knowledge about a domain, such that it can be used by both
humans and computers [35] In the molecular biosciences, an ontology should capture, in
axioms, the relations among concepts. These axioms might then be used to extract implicit
knowledge such as the transitive closure of relations (if an enzyme is a type of protein and a
protein a type of polypeptide, then an enzyme is a type of polypeptide) [36].

Ontology may also provide a framework for describing living systems in terms of
information. For example, metabolic pathways describe many different chains of reactions
that relate different biological entities. These complex networks reflect a deep layer of
concepts that describe the system and, if represented appropriately, could support
visualisation, querying, and implementation of further analyses.

Thus, we see that an ontology is not simply a controlled vocabulary, nor merely a
dictionary of terms. Controlled vocabulary per se describe neither relations among entities nor
relations among concepts, and consequently cannot support inference processes. Database
schemata describe categories, and provide an organisational model of a system, but do not
necessarily represent relations among entities. Database schemata can be derived from
ontologies, but the reverse step is not so straightforward. An ontology might better be

169
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

considered as a type of knowledge base in which concepts and relations are stored, and which
makes inference capacities available.

Ontologies, and metadata based on ontologies, are sometimes presented as means to


support the sharing and reuse of knowledge [37]. Application of an ontologically based
approach should be more powerful than simple keyword-based methods for information
retrieval. Not only can semantic queries be formed, but axioms that specify relations among
concepts can also be provided, making it possible for a user to derive information that has
been specified only implicitly [38]. In this way, relevant entries and text can be found even if
none of the query words is present (e.g. a query for “furry quadrupeds” might retrieve pages
about bears).

Data integration in MBDSs raises problems of syntactic and semantic heterogeneity.


Semantic integration is a difficult task. In the special case where a top-level concept matches
another concept in a different ontology, merging these two branches would require all derived
concepts to be checked; as yet, this cannot be automated. An ontology to be applied across
multiple databases might best be placed centrally so each database can be mapped to it
directly, not via other databases. General inference algorithms might then identify identical or
similar concepts in other databases [39].

Can the use of ontology improve query capacities? We believe it can, but much
optimisation of query logic will be required if full benefits are to be won. An ontological
veneer over existing databases will achieve little. With the current state of MBDBs, ontology
might be helpful primarily as a flexible guidance system, supporting the user in building
queries by relating concepts. To avoid the philosophically difficult question of what
constitutes a related concept, we prefer to think not in terms of related concepts in general,
but rather about restricting relations to a defined context.

Molecular biology has an emerging de facto standard ontology. The Gene OntologyTM
[40] (GO) consortium, established in 1998, provides a structured, precisely defined, common
controlled vocabulary for describing the roles of genes and gene products in eukaryotic cells.
GO embodies three different views of these roles: as functions, as processes and as cellular

170
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

components. Although developed as a taxonomy of concepts and their attributes for


eukaryotic cells, the GO framework is, in principle, extensible to other biological systems. GO
is organised as a directed acyclic graph: a child can have many parents. Although GO is not
the only ontology relevant to molecular biosciences, it has been used to annotate genes across
multiple organisms. Insofar as it embodies the definition of a standard for the annotation of
gene products, GO integrates genomic data and protein annotation from many sources [41].
Ideally we should be able to use high-quality annotation to explore a variety of hypothesis
about evolutionary relations and other comparative aspects of genomics.

Although the number of ontologies formally used in bioinformatics applications is still


small, where they have been used they span a wide range of proposes, subject areas and
representation styles. So far, ontologies have been used in three distinct areas: database
schema definition (e.g. EcoCyc [42]), annotation and communication (e.g. GO), and query
formulation (e.g. TAMBIS [43]).

TAMBIS is a special case of a point development. TAMBIS, in Figure 1, is modelled as


a knowledge base, and provides a single unified access to multiple data sources. It is a high-
level ontology-centric query facilitator in which queries are formulated against a canonical
representation of biomedical concepts, and where maps of terms relating this representation
are linked with external data sources [44, 45].

TAMBIS therefore provides a level of interaction between the user and the external
sources that removes the need for the user to be aware of the schema. It is based on a three-
layer mediator/wrapper architecture [46] and uses Kleisli as a backend database. TAMBIS is
intended to improve query capacities in MBDBs via its supplied conceptual model, a
knowledge-driven user interface and a flexible representation of biological knowledge that
supports inference processes over the relations among concepts. The representation is
implemented using GRAIL Description Logic (http://www.openclinical.org/
dld_galenGrail.html). With TAMBIS, the user is guided over an ontologically informed map
of concepts related to a given query. This is done by exposing the user to the terminological
model, and by providing a guided query formulation system implemented in a graphical tool.

171
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

Queries are written into a form-based interface positioned over a source-independent


ontology of bioinformatics concepts [46]. The user is therefore not navigating through links,
but moving in a conceptual space. TAMBIS builds queries and Kleisli executes them.

Chapter 6 - Figure 1. Schematic representation of the architecture of TAMBIS.


CPL = Combined Programming Language
Genotype-phenotype relations are modeled in PharmGKB (http://pharmgkb.org). The
Pharmacogenetics and Pharmacogenomics Knowledge Base [47] PharmGKB is an Internet-
based resource that integrates complex biological, pharmacological and clinical data.
PharmaGKB contains genomic, phenotypic and clinical information collected from ongoing
pharmacogenetic studies. PharmGKB is organised as a knowledge base, with an ontology that
contains a network of genetic, clinical and cellular phenotype knowledge interconnected by
relations and organised by levels of abstractions [48].

172
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

The relevance of semantics in bioinformatics extends beyond issues of the


representations of information stored in databases. As biological research moves increasingly
in silico and becomes increasingly less easily distinguishable from computational and
information science, semantically based web technologies are likely to contribute to the
unification of grid and workflow environments. myGrid is intended to provide an
environment in which defined workflows are reusable and shared by means of semantic
layers. Semantic description helps make the knowledge behind a workflow explicit and, thus,
shareable.

6.4 XML AS A DESCRIPTION OF DATA AND INFORMATION

XML is a language for describing data. It is becoming a standard document format in


areas well beyond the biosciences [49]. By itself, XML provides neither a description of the
data so formatted, nor an integrative solution. But because XML is becoming a de facto data
exchange format in molecular biosciences (i.e. the standard data format for exchange
purposes) we include a discussion on XML and the way it is being used.

Data migration between programming languages is a problem in bioinformatics. XML


offers a way to serve and describe data in a uniform and automatically parsable format.
Powerful XML-based query languages for semi-structured data have been developed (e.g.
Xpath [XML Path Language; http://www.w3.org/TR/xpath20], Xquery
[http://www.w3.org/TR/xquery] and XQL [XML Query Language;
http://www.w3.org/TandS/QL/QL98/pp/flab.txt and http://www.ibiblio.org/xql/]). As
the relational data model does not by itself adequately represent information in the molecular
biosciences, it may be useful to build around XML a framework for integrating tools in
molecular biology, for querying and for transforming results for analysis by appropriate agents
[50]. A robust, stable XML data integration and warehousing system does not yet exist.
However, once high-performance data stores become available [49], perhaps in a grid

173
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

environment, XML and XML-based tools may mature into an alternative data integration
platform comparable with Kleisli and DiscoveryLink®.

Today, researchers in the biosciences rely mostly on string matching and link analysis
for database searching; computer functionality (operations relevant to different data types) in
MBDBs is very limited. However, if computers were able to understand the semantic
implications not only of data but also of queries, then more functionality and accuracy might
be added to database searching operations. XML provides a general framework for such
tasks. Although not a descriptive language, XML provides a representational framework for
semantic values to be introduced in the description of content.

XML promotes interoperability between agents that agree on some standards.


Conversely, disparity in vocabulary undermines the potential use of XML in database
integration. Application of ontology, however, could provide a complete and conceptually
adequate way of describing the semantics of XML documents. By deriving document type
definitions (DTDs) from an ontology, document structure can be grounded on a true
semantic basis, making XML documents an adequate input for semantically based processing
[38]. In this way, conceptual terms can be used to retrieve facts.

An example of a software development making extensive use of XML is LabBook


(http://www.labbook.com). LabBook is a genomic XML viewer and browser based in
BSMLTM (Bioinformatics Sequence Markup Language; http://www.bsml.org); BSMLTM is an
open XML data representation and interchange format. Both sequence and browser provide
an intuitive graphical environment for visualising genomic information.

XML has been extensively used in bioinformatics as an exchange format, but complete
XML integrative solutions have not yet been developed. We believe that XML should be
understood as a powerful data model, since XML allows flexible definition of set of tags as
well as the hierarchical nesting of tags. BioXML came about as an effort to develop standard
biological XML schemata and DTDs. It was intended to become a central repository, part of
the Open Bioinformatics Foundation (http://www.open-bio.org). However, the BioXML
project appears to be inactive. BioMOBY inherited some of the desired features of BioXML;

174
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

within MOBY lightweight XML documents comprise a set of descriptor objects that are
passed from MOBY central to the clients.

6.5 GRAPHICAL USER INTERFACES (GUIS) AS INTEGRATIVE


ENVIRONMENTS

Querying databases is only one part of the research process in molecular biosciences.
Once relevant data have been retrieved, analysis must be undertaken. Very often the process
is not clear in advance, and the user must iteratively query, retrieve, analyze and compare until
the desired endpoint is attained. These steps are most easily carried out in an integrated
environment within which the functionality of MBDBs is brought together with appropriate
analysis tools, allowing the user to specify and carry out computational experiments, record
intermediate and final data, and annotate experiments.

Analysis tools in molecular bioscience are likewise heterogeneous, and may typically
include remote webtools, locally installed executables, and scripts in, for example, Perl,
Python and/or SQL. Especially among but even within application fields, interoperability
tends to be limited or nonexistent. Instead, the output of one program must usually be
reformatted for input into the next; this piping is mostly done using purpose-written Perl
parsers, of which there is no central library or listing.

The GCG® [51] and EMBOSS [52] suites are two well-known sequence analysis
software packages that group many methods commonly used in molecular biology. They are
both command-line driven, which requires users to have at least a basic familiarity with
UNIX® command-line syntax. Therefore, in recent years some groups have developed GUI
systems [53, 54] that suppress the syntactic complexity of UNIX® commands, thereby
promoting the coordinated use of the programs in these packages. A list of some of the
existing GUIs for EMBOSS [52] and GCG® is given in Table 2.

175
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

Feature/GUI Jemboss W2H Pise Perl2EMBOSS SeWeR


Program that EMBOSS EMBOSS / EMBOSS EMBOSS General
the GUI is GCG® GUI for
for web
resources
Scripting JavaTM Perl XML/Perl Perl HTML,
language JavaScript
used
Functionality Provides Provides Highly Provides GUI. GUI
/ remarks GUI, has data functional, No data allows
some data management highly management access to
management tools. customisable. utilities. different
utilities. Highly Easy to add analytical
functional, new tools; not
new methods. GCG® /
resources EMBOSS
can be focused.
added Limited
integration
with Pise
is
possible.
W3H;
Process limited
automation NA capability Perl API NA NA
API = application program interface; NA = not available.
Chapter 6 - Table 2. Some of the most commonly used Graphical User Interfaces (GUIs) for
EMBOSS and GCG®
Pise [54] is a particularly important example. Its applicability is based around Perl
modules generated from the XML description of a targeted program. Each module contains
all the information needed to generate the corresponding HTML form and CGI script. Pise
has some scripting and macro capabilities, and supports the specification of pipelines. An
example of a Pise macro may be: ClustalW* - DNADIST* - NEIGHBOR.* The output of
one program (ClustalW) becomes the input of another (DNADIST, from the PHYLIP
package), and then again the output from DNADIST is passed to and used as the input file
for the NEIGHBOR program.

The graph of all possible paths may be seen at the Pise website at http://www-
alt.pasteur.fr/~letondal/Pise/gensoft-map.html. Pise provides two different ways by which a

176
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

macro can be customised: either a form (supplied within Pise) can be filled out with
parameters supplied by the user or the user can save an entire procedure as a macro. The user
is presented with a ‘Register the whole procedure’ button that builds the scripts, and allows
users to repeat their actions [54]. G-PIPE* (http://if-web1.imb.uq.edu.au
/Pise/5.a/gpipe.html) is a development build on top of Pise. It provides an automatic
workflow generator using Pise as a GUI framework. The workflow descriptions are stored as
XML documents that can later be loaded and run across different G-PIPE servers. This
automatisation is possible thanks to a set of Perl modules that check the syntactic consistency
of the different files in order to evaluate them as possible input files for the different steps in
a given workflow.

W2H [53] is one of the oldest and most powerful GUIs in bioinformatics. In a sense, it
has evolved from a GUI into an environment, as it provides not only GUI capacities but also
some functionality for file handling. W2H was developed making extensive use of the
metadata files that describe the applications available in the former GCG® package. In W2H,
these files are used to generate on-the-fly HTML documents with forms for entering values
of command-line parameters [53]. W2H embodies a classical tool-oriented approach;
combinations of tools were not initially supported [55]. W2H now provides some problem-
oriented tools (a task framework), allowing users to define data workflows. For this, W2H
again makes use of metadata, as well as descriptions of workflow and dataflow. Workflow in
this context refers to the sequence of tasks (methods, programs) that are part of a user’s
analysis chain. Dataflow is basically the parsing of one output into the subsequent task.

Using the existing W2H, the dataflow description is used by the web interface to
dynamically create HTML input forms for the task data. With the given metadata, the web
interface can collect input from the user, determine if all minimum requirements are fulfilled,
and provide the data to the task system. The name given to this task framework is W3H [55].
W3H reduces the amount of necessary programming skills, however the definition of tasks is
not an automatic process. Shareability of tasks among other GUI environments is not
possible under W3H. This feature was considered from the beginning in the design of

177
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

Pise/G-PIPE, where the whole task or workflow can be exported as an XML file (which may
later be loaded into the same system as Pise/G-PIPE) or as a Perl script that can be
customised by the user. Both W3H and Pise make extensive use of Bioperl. W3H is
immersed within the HUSAR (Heidelberg Unix Sequence Analysis Resources) environment,
and has pre-built parsers that enable connectivity with different datasets available on the local
HUSAR installation of SRS, GeneCards® and other databases/facilites.

PATH (Phylogenetic Analysis Task in HUSAR) was developed within the framework
provided by W3H [56]. Dependencies among applications, descriptions of program flow, and
merging of individual outputs (or parts thereof) into a common output report are provided to
the system. cDNA2Genome [57] is another task developed under the W3H framework. It
allows high-throughput mapping and characterisation of cDNAs. cDNA2Genome can be
divided into three main categories: database homology searches, gene finders and sequence
features’ predictors (i.e. start/stop codons, open reading frames). cDNA2Genome is available
at http://genius.embnet.dkfz-heidelberg.de/menu/biounit/open-husar/.

Perl2EMBOSS (http://bioinfo.pbi.nrc.ca/~lukem/EMBOSS/) is an example of a


simple Perl-based GUI to EMBOSS. Unlike W2H and Pise, Perl2EMBOSS cannot be used
as it stands for GCG® or any other package, and does not provide task definition capacities
or an application program interface (API). Perl2EMBOSS is easy to install and provides
forms (Perl scripts) for the building of input files. As its source code is well documented and
well structured, this GUI is easy to administer and simple for the user. It may be considered a
‘ light’ solution.

SeWeR [58] (SEquence analysis using WEb Resources) is a GUI for a different scenario
from EMBOSS or GCG®. It was designed to make extensive use of JavaScript and dynamic
HTML (DHTML), and its capacities thus provide a very lightweight solution. It presents a
uniform interface to most common services in bioinformatics, including polymerase chain
reaction-related analyses, sequence alignment, database searching, protein structure prediction
and sequence assembly. It provides for several levels of customisation to the interface and is
highly amenable for batch processing and automation.

178
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

Jemboss [59] is yet another a GUI for use with EMBOSS. In this case, a web launch
tool (Java Web Start) must be installed on the client’s computer. The user is presented with an
intuitive window that gives access to his or her assigned area on the server on which
EMBOSS is running. Jemboss uses SOAP (Simple Object Access Protocol;
http://www.w3org/TR/soap/), reducing security risks by allowing the user to access the
EMBOSS application as a client. The display area gives the user complete control over the
environment; analyses are run on the defined EMBOSS server. Via a job manager, it is
possible to run and monitor batch processes [59].

All of the GUIs described above provide graphical access to a specific set of analysis
tools; they do not provide integration between retrieval systems such as those in GeneCards®
or SRS. The coded functionality available on W3H in this respect is very limited; sequences
are identified and users can get intermediate access to databank entries. However, this limited
integration is not enough, as even simple operations such as automatic presentation of
analysis options over a set of previously identified sequences are not available. Such
integration (context menus embedded within the GUI) would define an environment within
which query capacities and analytical tools coexist in a single, unified working area. The
selection of one of these GUIs over another depends entirely on the problem at hand. All
provide in essence the same features, and source code is available for each. The definition of
analysis pipelines remains limited among these solutions.

6.6 METABOLIC PATHWAY DATABASES AS AN EXAMPLE OF


INTEGRATION

A pathway can be defined as a linked set of biochemical reactions, such that the
product of one reaction is a reactant of, or an enzyme that catalysis, a subsequent reaction
[60]. A MPDB is a bioinformatics development that describes biochemical pathways and their
reactions, components, associated experimental conditions and related relationships. To the
extent that it is sufficiently comprehensive, a MPDB can be seen as a description of an

179
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

organism at the metabolic level. In the same way, other aspects of an organism can be
described in gene regulation databases, protein-protein interaction databases, signal
transduction databases and so forth.

The techniques used for building metabolic pathways range from manual analysis to
automated computational methods. The resulting databases differ in the types of information
they contain and in the software tools they make available for queries, visualisation and
analysis [42]. Quality is assured by following combinations of manual and automatic curation
processes. This is the case for BRENDA® [61] (Braunschweiger Enzyme Database); this
manually curated database contains information for all those molecules that have been
assigned an Enzyme Commission (EC) number. By querying the database it is possible to
retrieve information about an enzyme for all the organisms in which it is present.
BRENDA® is rich in literature references; these are parsed for relevant key phrases directly
from PubMed and are then associated with the corresponding enzymes.

Another example of an MPDB is the KEGG [62] (Kyoto Encyclopedia of Genes and
Genomes) pathway database. This database aims to link genomic information with higher
order functional information by computerisation of current knowledge on cellular processes
and by standardising gene annotations [63]. Within KEGG, genomic information is stored in
the GENES database (a collection of gene catalogues), while higher order functional
information is stored in the PATHWAY database. The WIT [64] (What is There) database is
another example of an MPDB. WIT has been designed to extract functional content from
genome sequences and organise it into a coherent system. It supports comparative analysis or
sequenced genomes and generates metabolic reconstructions based on chromosomal
sequences and metabolic modules from the Enzymes and Metabolic Pathways Database
(EMP)*/Metabolic Pathways Database (MPW) family of databases. WIT provides a set of
tools for the characterisation of gene structures and functions. After genes have been assigned
initial functions, they are then ‘attached’ to pathways by choosing templates form the
metabolic database (MPW) that best incorporate all observed functions. When this basic
model has been created, a (human) curator evaluates this model against biochemical data and

180
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

phenotypes known from the literature. Textual and graphical representation are fully linked
with underlying data.

The Pathway Tools software [65, 66] constitutes an environment for creating a
metabolic pathway database for a given organism or genome. Pathway Tools has three
components: PathoLogic, which facilitates the creation of new pathway/genome databases
from GenBank® entries; Pathway/Genome Navigator, for query, visualisation and analysis;
and Pathway/Genome Editor, which provides interactive editing capabilities. Some of the
computationally derived pathway/genome databases today are AgroCyc (Agrobacterium
tumefaciens; http://biocyc.org/AGRO/organism-summary?object=AGRO), MpneuCyc
(Mycoplasma pneumoniae; http://biocyc.org/MPNEU/organism-summary?object=MPNEU),
Human-Cyc (Homo sapiens; http://humancyc.org/); a more detailed list can be found at
http://www.biocyc.org.

Pathway data are beginning to become more abundant as a consequence of genomic


sequencing, the spread of high-throughput analytical methods and the growth of systems
biology. There is therefore a need to organise these data in a rational and formalised way (i.e.
to model our knowledge of metabolic data). The first step necessarily relates to storage and
recovery of information. The complexity of this type of data, and in particular the fact that
some information is held in the relationship between the biological entities rather than in the
entities themselves, complicates their selection and recovery. Further complication is added
by the need to model higher level interactions, the model needs to be clearly delimited in
advance. Moreover, our knowledge is often incomplete: elements may be missing, or
pathways may be different or totally unknown in a newly sequenced organism (or indeed in
unknown organisms [e.g. from environmental genomics projects or surveys]). Constructing
databases that can provide inference mechanisms to assist the discovery process, based on
incomplete information of this sort, remains a significant challenge.

181
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

Chapter 6 - Figure 2. Valine biosynthetic pathway in Escherichia coli


Illustrating the relationships among biological data types (reproduced from EcoCyc website, [68]
with permission). The position of the transcription start site of ilvC has more recently been
modified to reflect an updated version of the E. coli genome sequence, U00096.2

182
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

Ideally, MPDBs should integrate information about the genome and the metabolic
networks of a particular organism. The metabolic network can be described in terms of four
bio-object types: the pathways that compose the network, the reactions that compose the
pathway, metabolic compounds, and the enzymes that catalyze the reactions [42]. Literature
citations are typically provided for most of the information units. However, very often this
information is incomplete; extraction of information from GenBank® and PubMed in order
to assist systematic annotation of gene functions is not a trivial process. Figure 2 exemplifies
the relationships among these four biological data types.

6.7 SUMMARY, CONCLUSIONS AND UNSOLVED PROBLEMS

In this review we have argued that the integration of information in molecular


bioscience (and, by extension, in other technical fields) is a deeper issue than access to a
particular type of data (sequences, structures) or record (GenBank® accession number). It
requires the technical (IT-level) integration among heterogeneous, probably federated, data
sources, for which platforms such as Kleisli and DiscoveryLink® have been designed. For
specific problems, point solutions such as SRS and GeneCards® provide sufficient
connectivity to deliver a product to the end user. Graphical interfaces and related
developments such as W2H and Pise support the use of specific tools, sometimes in
combinations or pipelines. TAMBIS adds a conceptual model, a flexible logic set and an
ontological frame of reference. XML is currently a formatting standard but has potential to be
used in deeper ways. Metabolic pathway databases exemplify the role of higher order
relationships as biologically relevant information. Ontologies have been used as conceptual
models for query formulation in TAMBIS; myGrid makes more extensive use of existing
ontology, not only as a unification factor but also as a means to establish automatic discovery
of bioinformatics services.

Semantic and syntactic issues are equally important, but defining the boundaries
between them is not a simple task. Projects such as myGrid and BioMOBY require well

183
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

defined ontologies in order to support automatic discovery of services in bioinformatics.


Capturing the semantics of biological data types is difficult, in part because they are highly
contextual, but also because of the lack of expressivity of knowledge representation languages
such as the OWL Web Ontology Language (http://www.w3.org/TR/owl-features). There is
no solution for the old common question “What can I do with this piece of information?”,
since, for a given data type, there may be more than a hundred services capable of dealing
with it. Which one to use, and how to define workflows over it, depends on the context. By
providing accurate descriptions and having a more detailed knowledge about the business
logic behind research operations, more and more semi-autonomous agents with some
intelligence may assist users in defining programmatic structures for their in silico experiments.
Ontologies in bioinformatics have focused on descriptions of ‘things’, but very few of them
actually describe research processes; moreover, no ontologies describe the relationship
between the processes for studying the ‘thing’ and the thing itself. These types of ontologies
may facilitate the implementation of the vision of the semantic web in bioinformatics.

Information in the molecular biosciences is fragmented and disperse, yet highly


interconnected, semantically rich and very contextual. Queries on these data can be highly
conceptual, and the context is user-dependent and often highly subjective. All of these factors
make data representation a substantial challenge. It is not obvious that concepts of, for
example, retrieval efficiency developed in domains such as commerce or finance can be
usefully applied with data in molecular bioscience. We are heading towards a semantic web in
bioinformatics, but the role of ontologies and agent technologies in practical implementations
has not yet been properly defined. The vision of the semantic web in bioinformatics remains
fragmentary; current technology is far from providing real semantic capabilities even in
domains such as word processing. For two words such as ‘purpose’ and ‘propose’,
Microsoft® Word (for example) advises on syntactic issues, but no guidance is given about
the context of the words. Semantic issues in bioinformatics workflows are more complex, and
it is not clear whether the existing technology can effectively overcome these problems.

184
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

A relevant discussion within the bioinformatics community has addressed the


disjunction between semi-structured data and fully structured data. Because of the nature of
scientific data, and the processes the community undertakes to produce them, systems relying
on semi-structured frameworks seem to be a better option. XML emerges then as a clear
option because it provides mechanisms for self-description of the information it is
representing. By providing semantically valid tags, XML documents make it easy to identify
functional operations depending on the data type. Certain analytical methods may be
automatically identified and presented to the users directly from the XML file, and
identification of processes relevant to the different data types may be encompassed with
automatic discovery of services. Ontologically grounded XML schemata, and complete XML-
based solutions, are not yet available within this community.

A sequence or structure with no descriptors can be only an isolated and probably


meaningless unit of information. It is through annotation that we can capture the way in
which the role of a biological entity is to be understood in a given context. Complex queries
go beyond the simple fact of mounting SQL queries over heterogeneous data sources; instead
they involve context-dependent concepts and relations. Annotation plays a key role in
representing and modeling this information precisely, because this is the way in which the role
is to be understood in a given context. Representing how genes and proteins relate to the cell
cycle, cell death, diseases, health status or more generally to any type of biological process
requires not only functional annotation per se, but also the integration of relevant literature
annotations. At this point, querying MBDBs, and mining or categorising literature, are distinct
operations; no software tools are openly available that combine the two steps (e.g. by allowing
users to retrieve relevant literature for a given sequence or structure query beyond that wich
happens to be cited within the entries retrieved).

Finally, analytical tools have not yet been fully integrated with indexing, data and
workflow management systems. As discussed in Section 5, GUIs are available for a wide
variety of implementations of diverse analytical methods, but we are far from having access to
a unified, platform-independent analytical environment. One bioinformatics company, LION

185
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

bioscience, has taken some steps in this direction with its SRS version 6.0. We continue to
believe, however, that real information integration in molecular bioscience requires a unified
analytical and data-handling environment for users. A relevant analogy may be, for example,
the diversity of operations available for a particular data type within the Windows® operating
system environment; all of the possible operations are presented to the user via a contextual
menu displayed at the time the user right-clicks on the icon of interest. In the same way,
operations over biological data types should be identified in advance, presented to the user
and then executed; currently all these operations are done either by coding them or by
copy/paste procedures. Simplification of coding operations should be enabled within GUI
frameworks (e.g. direct manipulation interfaces). We think that concepts from projects such as
Haystack [67] should be more carefully considered in bioinformatics.

Automation of data handling and knowledge extraction, along with tools that support
the interpretation of extracted knowledge, are likewise not yet available to bioinformaticians.
Such a set of tools should support the selection and planning of ‘wet’ experition ments in the
laboratory. Computer models are increasingly used to complement laboratory experiments,
and tools that extract and integrate knowledge would be powerful adjuncts to these models at
all stages of their implementation and use. Biological knowledge is spread not only over many
databases but also (and in a more complicated way) across thousands of papers, patents and
technical reports. It is in these latter documents that facts are described in the context in
which the underlying biological entities have been studied; real integration, therefore, should
consider conceptual queries over fully integrated views of relevant data sources.

Ideally, future biological information systems (BISs) will require neither frequent (and
difficult) data and software updates nor local data integration (warehouses); they should allow
semantically based data integration through ontologies (improving data integration) and
should support monitoring of the evolution of information sources. Future BISs should also
allow each researcher to ask questions within the context of his or her own problem domain
(and between domains), unconstrained by local or external data repositories. They should
proactively inform the user about new, relevant information, based on individual needs, and

186
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

support collaboration by matching researchers who have relevant expertise and/or interests.
Achieving this level of integration – as data in the molecular biosciences continue inexorably
to increase and diversify – will continue to provide challenges on many levels.

6.8 ACKNOWLEDGMENTS

We thank Dr. Limsoon Wong and the reviewers for extremely helpful suggestions.
Financial support for ARC Discovery Project DP0342987 and the ARC Centre in
Bioinformatics CE0348221 is acknowledged.

The authors have no conflicts of interest that are directly relevant to the content of this
review.

6.9 REFERENCES

1. Sirotkin, K., NCBI: Integrated Data for Molecular Biology Research. 1999, Norwell,
MA: Kluwer Academic Publishers.

2. Karp, P.D. and S. Paley, Integrated access to metabolic and genomic data. Journal
of Computational Biology, 1996. 3(1): p. 191-212.

3. Keen, G., et al., The Genome Sequence DataBase (GSDB): Meeting the challenge
of genomic sequencing. Nucleic Acids Research, 1996. 24(1): p. 13-16.

4. Benson, D.A., et al., GenBank. Nucleic Acids Research, 1997. 25(1): p. 1-6.

5. Brooksbank, C., et al., The European Bioinformatics Institute's data resources.


Nucleic Acids Research, 2003. 31(1): p. 43-50.

6. Bairoch, A. and R. Apweiler, The SWISS-PROT protein sequence database: Its


relevance to human molecular medical research. Journal of Molecular Medicine-
Jmm, 1997. 75(5): p. 312-316.

187
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

7. Hubbard, T., et al., The Ensemble genome database project. Nucleic Acids Research,
2002. 30(1): p. 38-41.

8. Spellman, P., et al., Design and implementation of microarray gene expression


markup language (MAGE-ML). Genome Biology, 2002. 3(9): p. 0046.1-0046.9.

9. Lord, P., et al. PRECIS: An automated pipeline for producing concise reports about
proteins. in IEEE International Symposium on Bio-informatics and Biomedical
engineering. 2001. Washington: IEEE press.

10. Gaasterland, T. and C.W. Sensen, MAGPIE: Automated genome interpretation.


Trends in Genetics, 1996. 12(2): p. 76-78.

11. Etzold, T. and P. Argos, Transforming a Set of Biological Flat File Libraries to a Fast
Access Network. Computer Applications in the Biosciences, 1993. 9(1): p. 59-64.

12. Zdobnov, E.M., et al., The EBI SRS server - recent developments. Bioinformatics,
2002. 18(2): p. 368-373.

13. Davidson, S., et al., BioKleisli: A digital library for biomedical researchers.
International Journal of Digital Libraries, 1997. 1: p. 36--53.

14. Wong, L., Kleisli, a Functional Query System. Journal of Functional Programming,
2000. 10(1): p. 19-56.

15. Haas, L., et al., DiscoveryLink: A system for integrated access to life sciences data
sources. IBM Systems Journal, 2001. 40: p. 489-511.

16. Davidson, S.B., et al., K2/Kleisli and GUS: Experiments in integrated access to
genomic data sources. Ibm Systems Journal, 2001. 40(2): p. 512-531.

17. Rebhan, M., et al., GeneCards: a novel functional genomics compendium with
automated data mining and query reformulation support. Bioinformatics, 1998.
14(8): p. 656-664.

18. Stein, L.D., Integrating biological databases. Nature Reviews Genetics, 2003. 4(5): p.
337-345.

19. Lacroix, Z., Biological Data Integration: Wrapping Data and Tools. IIIE Transactions
on Information Technology in Biomedicine, 2002. 6(2): p. 123-128.

188
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

20. Stoesser, G., et al., The EMBL Nucleotide Sequence Database: major new
developments. Nucleic Acids Research, 2003. 31(1): p. 17-22.

21. Sigrist, C., et al., PROSITE: A documented database using patterns and profiles as
motif descriptors. Briefings in Bioinformatics, 2002. 3: p. 265-274.

22. Chenna, R., SIR: a simple indexing and retrieval system for biological flat file
databases. Bioinformatics, 2001. 17(8): p. 756-758.

23. Macauley, J., H.J. Wang, and N. Goodman, A model system for studying the
integration of molecular biology databases. Bioinformatics, 1998. 14(7): p. 575-582.

24. Tatusova, T.A., I. Karsch-Mizrachi, and J.A. Ostell, Complete genomes in WWW
Entrez: data representation and analysis. Bioinformatics, 1999. 15(7-8): p. 536-543.

25. Lewis, S.E., et al., Apollo: a sequence annotation editor. Genome Biology, 2002.
3(12): p. 1-14.

26. Willkinson, M. and M. Links, BioMOBY: An Open Source Biological Web Services
Proposal. Briefings in Bioinformatics, 2002. 3: p. 331 - 341.

27. Stevens, R., J. Robinson, and C. Goble, myGrid: personalised bioinformatics on the
information grid. Bioinformatics, 2003. 19: p. 302 - 304.

28. Kashyap, V. and A. Sheth, Semantic similarities between objects in multiple


databases. 1999, San Francisco: Morgan Kaufmann.

29. Benson, D.A., et al., GenBank. Nucleic Acids Research, 1999. 27(1): p. 12-17.

30. Attwood, T.K. and C.J. Miller, Which craft is best in bioinformatics? Computers &
Chemistry, 2001. 25(4): p. 329-339.

31. Okazaki, Y., et al., Analysis of the mouse transcriptome based on functional
annotation of 60,770 full-length cDNAs. Nature, 2002. 420(6915): p. 563-573.

32. Guarino, N. Some Ontological Principles for Designing Upper Level Lexical
Resources. in the First International Conference on Language Resources and
Evaluation. 1998. Granada, Spain.

33. Gruber, T.R., A Translation Approach to Portable Ontology Specifications.


Knowledge Acquisition, 1993. 5(2): p. 199-220.

189
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

34. Schulze-Kremer, S. OntoCell: An ontology of Cellular Biology. in Third Pacific


Symposium on Biocomputing. 1998: AAAI Press.

35. Stevens, R., bio-ontology page, in Bio-ontology, R. Stevens, Editor. 2005,


http://www.cs.man.ac.uk/~stevensr/ontology.html: Manchester, UK.

36. Blaschke, C., L. Hirschman, and A. Valencia, Information extraction in Molecular


Biology. Briefings in Bioinformatics, 2002. 3: p. 154-165.

37. Friedman, N.N. and C.D. Hafner, The State of the Art in Ontology Design: A Survey
and Comparative Review. AI Magazine, 1997. 18: p. 53-74.

38. Erdmann, M. and R. Studer. Ontologies as Conceptual Models for XML Documents.
in 12th Workshop on Knowledge Acquisition, Modeling and Management (KAW-99).
1999. Banff, Canada.

39. Köhler, J. and S. Schulze-Kremer, The Semantic Metadatabase (SEMEDA): Ontology


Based Integration of Federated Molecular Biological Data Sources. In Silico
Biology, 2002. 2: p. 0021.

40. Ashburner, M., et al., Gene Ontology: tool for the unification of biology. Nature
Genetics, 2000. 25(1): p. 25-29.

41. Yeh, I., et al., Knowledge acquisition, consistency checking and concurrency
control for Gene Ontology (GO). Bioinformatics, 2003. 19(2): p. 241-248.

42. Karp, P.D., EcoCyc:The Resource and the Lessons Learned. Bioinformatics
Databases and Systems, 1999: p. 47-62.

43. Stevens, R., et al., TAMBIS: Transparent access to multiple bioinformatics


information sources. Bioinformatics, 2000. 16(2): p. 184-185.

44. Baker, P.G., et al. TAMBIS: Transparent Access to Multiple Bioinformatics


Information Sources. in Sixth Conference on Intelligent Systems for Molecular Biology,
ISMB98. 1998. Montreal, Canada: ISMB.

45. Wiederhold, G., Integration of Knowledge and Data Representation. IIIE Computers,
1992. 21: p. 38-50.

46. Paton, N.W., et al. Query Processing in the TAMBIS Bioinformatics Source
Integration System. in 1th Int. Conf. on Scientific and Statistical Database
Management (SSDBM). 1999: IEEE Press.

190
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

47. Klein TE, Chang JT, Cho MK, et al. Integrating genotype and phenotype
information: an overview of the PharmGKB project. Pharmacogenomics J 2001;
1:167-70

48. Rubin DL, Farhad S, Oliver DE, et al. Representing genetic sequence data for
pharmacogenomics: an evolutionary approach using ontological and relational
models. Bioinformatics 2002; 18: 207-15

49. Wong, L., Technologies for Integrating Biological Data. Briefings in Bioinformatics,
2002. 3(4): p. 389-404.

50. Bry, F. and P. Kröger, A Computational Biology Database Digest: Data, Data
Analysis, and Data Management. International Journal of Distributed and Parallel
Databases, 2003. 13: p. 7 - 42.

51. Devereux, J., P. Haeberli, and O. Smithies, A Comprehensive Set of Sequence-


Analysis Programs for the Vax. Nucleic Acids Research, 1984. 12(1): p. 387-395.

52. Rice, P., I. Longden, and A. Bleasby, EMBOSS: The European molecular biology
open software suite. Trends in Genetics, 2000. 16(6): p. 276-277.

53. Senger, M., et al., W2H: WWW interface to the GCG sequence analysis package.
Bioinformatics, 1998. 14(5): p. 452-457.

54. Letondal, C., A Web interface generator for molecular biology programs in Unix.
Bioinformatics, 2001. 17(1): p. 73-82.

55. Ernst, P., K.H. Glatting, and S. Suhai, A task framework for the web interface W2H.
Bioinformatics, 2003. 19(2): p. 278-282.

56. Del Val, C., et al., PATH: a task for the inference of phylogenies. Bioinformatics,
2002. 18(4): p. 646-647.

57. Del Val, C., K.H. Glatting, and S. Suhai, cDNA2Genome: A tool for mapping and
annotating cDNAs. BMC Bioinformatics, 2003. 4: p. 39.

58. Malay, K.B., SeWeR: a customizable and integrated dynamic HTML interface to
bioinformatics services. Bioinformatics, 2001. 17: p. 577-578.

59. Carver, T.J. and L.J. Mullan, Website update: A new graphical user interface to
EMBOSS. Comparative and Functional Genomics, 2002. 3(1): p. 75-78.

191
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

60. Karp, P.D., Pathway databases: A case study in computational symbolic theories.
Science, 2001. 293(5537): p. 2040-2044.

61. Schomburg, I., A. Chang, and D. Schomburg, BRENDA, enzyme data and metabolic
information. Nucleic Acids Research, 2002. 30(1): p. 47-49.

62. Ogata, H., et al., KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids
Research, 1999. 27(1): p. 29-34.

63. Kanehisa, M. and S. Goto, KEGG: Kyoto Encyclopedia of Genes and Genomes.
Nucleic Acids Research, 2000. 28(1): p. 27-30.

64. Overbeek, R., et al., WIT: integrated system for high-throughput genome sequence
analysis and metabolic reconstruction. Nucleic Acids Research, 2000. 28(1): p. 123-
125.

65. Karp, P.D., et al., The MetaCyc database. Nucleic Acids Research, 2002. 30(1): p. 59-
61.

66. Karp, P.D., S. Paley, and P. Romero, The Pathway Tools Software. Bioinformatics,
2002. 18: p. 225-232.

67. Quan, D., D. Huynh, and D.R. Karger. Haystack: A Platform for Authoring End User
Semantic Web Applications. in 2nd International Semantic Web Conference. 2003.
Sanibel Island, Florida: Springer-Verlag, Heidelberg.

68. EcoCyc. E. coli K-12 pathway: valine biosynthesis [online]. Available from URL:
http://biocyc.org/ECOLI/new-image?type=PATHWAY&object=VALSYN PWY [Accessed
2005 Sep 30]

192
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

Workflows in bioinformatics: meta-analysis


and prototype implementation of a workflow
generator

In the previous chapter it has been argued that with a wider application of ontologies
and Semantic Web technologies in bioinformatics it will be possible to overcome some issues
when integrating information. Workflows are identified as a fundamental component when
integrating information in molecular biosciences, as researchers need to interleave
information access and algorithm execution in a problem-specific workflow. Within this
problem-specific workflow there are syntactic issues as well as semantic ones. Allowing the
concrete execution of the workflow is a syntactic problem, describing this in silico experiment
is, however, a semantic one for which an ontology of similar characteristics as those presented
in chapter 5 section 5.2, is required. The benefit of having well defined syntactic and
semantics not only easies some technical aspects, but also allows for better reusability of the
workflow in a larger context – a community of users.

This chapter addresses the problem of workflows in bioinformatics, more specifically


supporting workflows for the Pasteur Institute Software Environment (PISE). Both syntactic
and semantics, are investigated. From this meta-analysis, syntactic structures and algebraic
operators common to many workflows in bioinformatics were identified. The workflow
components and algebraic operators can be assimilated into re-usable software components.
Semantic issues were also investigated; the MGED-RSBI ontology was adapted for this
specific set of biological investigations. Other semantic aspects when developing workflows
systems were explored. GPIPE, a prototype implementation of this framework, provides a
GUI builder to facilitate the generation of workflows and integration of heterogeneous
analytical tools.

Alex Garcia was responsible for the conceptualisation, initial investigation and
finalisation of the research described in this chapter. Alex Garcia conceived the workflow

193
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

generator, graphical user interface, and semantic structures. He also participated in the
development of the tool, and wrote the corresponding papers.

AUTHORS' CONTRIBUTIONS

Alex Garcia Castro was responsible for design and conceptualisation, took part in
implementation, and wrote a first draft of the manuscript. Samuel Thoraval was the main
developer of G-PIPE. Leyla Jael Garcia Castro assisted with server issues and FCA. Mark A.
Ragan supervised the project and participated in writing the manuscript.

PUBLISHED PAPER ARISING FROM THIS CHAPTER

Garcia Castro A, Thoraval S, Garcia LJ, Chen Y-PP, Ragan MA: Bioinformatics
workflows: G-PIPE as an implementation. In: Network Tools and Applications in Biology
(NETTAB), 5-7 October 2005, Naples, Italy, pages 61-64

Garcia Castro A, Thoraval S, Garcia LJ, Ragan MA: Workflows in bioinformatics:


meta-analysis and prototype implementation of a workflow generator. BMC Bioinformatics 6:87.

194
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

7 Chapter VII - Workflows in bioinformatics:


meta-analysis and prototype
implementation of a workflow generator

Abstract. Computational methods for problem solving need to interleave information access and algorithm
execution in a problem-specific workflow. The structures of these workflows are defined by a scaffold of
syntactic, semantic and algebraic objects capable of representing them. Despite the proliferation of GUIs
(Graphic User Interfaces) in bioinformatics, only some of them provide workflow capabilities; surprisingly, no
meta-analysis of workflow operators and components in bioinformatics has been reported. We present a set
of syntactic components and algebraic operators capable of representing analytical workflows in
bioinformatics. Iteration, recursion, the use of conditional statements, and management of suspend/resume
tasks have traditionally been implemented on an ad hoc basis and hard-coded; by having these operators
properly defined it is possible to use and parameterise them as generic re-usable components. To illustrate
how these operations can be orchestrated, we present G-PIPE, a prototype graphic pipeline generator for
PISE that allows the definition of a pipeline, parameterisation of its component methods, and storage of
metadata in XML formats. This implementation goes beyond the macro capacities currently in PISE. As the
entire analysis protocol is defined in XML, a complete bioinformatics experiment (linked sets of methods,
parameters and results) can be reproduced or shared among users. Availability: http://if-
web1.imb.uq.edu.au/Pise/5.a/gpipe.html (interactive), ftp://ftp.pasteur.fr/pub/GenSoft/unix/misc/Pise/
(download). From our meta-analysis we have identified syntactic structures and algebraic operators common
to many workflows in bioinformatics. The workflow components and algebraic operators can be assimilated
into re-usable software components. G-PIPE, a prototype implementation of this framework, provides a GUI
builder to facilitate the generation of workflows and integration of heterogeneous analytical tools.

7.1 BACKGROUND

Computational approaches to problem solving need to interleave information access


and algorithm execution in a problem-specific workflow. In complex domains like molecular
biosciences, workflows usually involve iterative steps of querying, analysis and optimisation.
Bioinformatics experiments are often workflows; they link analytical methods that typically
accept an input file, compute a result, and present an output file. Most tool-driven integration
approaches have so far addressed the problem of providing a single GUI for a set of
analytical methods. Combining methods into a flexible framework is usually not considered.
Analytical workflows provide a path to discover information beyond the capacities of simple
query statements, but are much less easy to implement within a common environment.

195
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

Workflow management systems (WFMS) are basically systems that control the
sequence of activities in a given process [1]. In molecular bioscience, these activities can be
divided among those that address query formulation, and those that focus more on analysis.
At this abstract level, WFMS could serve to control the execution of both query and analytical
procedures. All of these procedures involve the execution of activities, some of them manual,
some automatic. Dependency relationships among them can be complex, making the
synchronisation of their execution a difficult problem.

One dimension of the complexity of workflows in molecular biosciences is given by the


various transformations performed on the data. Syntactic (operational) interoperability
establishes the possibility for data to be piped from one method into another. Semantic issues
(another dimension) arise from the fact that we need to separate domain knowledge from
operational knowledge. We should be able to describe a task of configuring a workflow from
its primary components according to a required specification, and implement a program that
realises this configuration independently of the workflow and components themselves.

Biologists provide rich descriptions of their experiments (materials and methods) so


they can be easily replicated. Once techniques have been standardised, usually this knowledge
is encapsulated in the form of an analytical protocol. With in silico experiments as well,
analytical protocols make it possible for experiments to be replicated and shared, and (via
meta-information) for the knowledge behind these workflows to be captured. These
protocols should be reproducible, ontology-driven, internally accurate, and annotated
externally.

Systems such as W2H/W3H [2] and PISE [3] provide some tools that allow methods
to be combined. W3H is a task framework that allows the methods available under W2H [4]
to be integrated; however, those tasks have to be hardcoded. In the case of PISE, the user can
either define a macro using Bioperl http://www.bioperl.org, or use the interface provided and
register the resulting macro. In either case, it is assumed that the user can program, or script
in Perl. Macros cannot be exchanged between PISE and W2H, although these two systems
provide GUIs for more or less the same set of methods (EMBOSS [5]). Indeed, macros

196
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

cannot be easily shared even among PISE users. Biopipe http://www.biopipe.org, on the
other hand, provides integration for some analytical tools using Bioperl API (Application
Programming Interface) using MySQL to store results as well as the workflow definition; in
this way, users are able to store results in MySQL and monitor the execution of the pre-
defined workflow.

The TAVERNA http://taverna.sourceforge.net project provides similar capabilities to


those offered by G-PIPE. However, on one hand inclusion of new analytical methods is not
currently possible since no GUI generator is provided, and on the other hand as TAVERNA
is part of myGrid [6] it follows a different integrative approach (Web Services). Pegasys [7] is
a similar approach, going beyond analytical requirements and providing database capacities.

G-PIPE provides a real capacity for users to define and share complete analytical
workflows (methods, parameters, and meta-information), substantially mitigating the syntactic
complexity that this process involves. Our approach addresses overall collaborative issues as
well as the physical integration of tools. Unlike TAVERNA, G-PIPE provides an
implementation that builds on a flexible syntactic structure and a set of algebraic operations
for analytical workflows. The definition of operators as part of the workflow description
allows a flexible set-up when executing it; operators also facilitate the reproducibility of the
workflow as they allow researchers to share experimental conditions in the form of
parameters.

Although G-PIPE was not conceived as an environment for testing usability aspects in
the design of bioinformatics tools, empirical observations allowed us to see how the
disposition of the functional objects in the interface (e.g. interfaces to algorithms and the
workflow representation) was simpler and easier for researchers than in the one provided by
TAVERNA. An important issue that was raised from these observations was the high-level of
complexity involved in the parameterisation, as researchers usually run algorithms with default
settings. Unlike G-PIPE, TAVERNA assumes users have an understanding of web services,
part of the necessary steps when defining a workflow in TAVERNA involves the selection of
the algorithm as a web service. Another interesting aspect we could observe was the

197
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

importance of having a tool in which fewer steps were involved in the definition and
execution of the workflow. TAVERNA requires too many details and involves too many
steps when defining and executing the workflow; some of the required information is
technical and thus more related to the operational domain then to the domain of knowledge;
this causes an unnecessary stressing factor in the researcher. Surprisingly, there are no
usability methods for bioinformatics nor is there usability studies performed throughout the
software development process in bioinformatics; the application of usability engineering
could potentially benefit the development of bioinformatics tools by bringing them closer to
the needs of end-users. The facility provided by G-PIPE for the generation of the workflow
aims to hide the complexity of the workflow by allowing researchers to concentrate on the
minimal necessary procedural details of the workflow (e.g. input files, parameters, where to
pipe).

For testing purposes we provide a simple example of a workflow (inference of a


phylogeny of rodents) that involves piping among three methods. Although here their
execution takes place on a common server, it is equally possible to distribute the process over
a grid using G-PIPE. The definition of the workflow as well as the corresponding input files
are available at Appendix 5.

7.2 RESULTS

Our workflow follows a task-flow model; in bioinformatics, tasks can be understood as


analytical methods. If workflow models are represented as a directed acyclic graph (DAG),
analytical methods then appear as nodes, and state information is represented as conditions
attached to the edges. Our syntactic structure and algebraic operators can be used to represent
a large number of analytical workflows in bioinformatics; surprisingly, there are no other
algebraic operators reported in the literature capable of symbolising the different operations
required for analytical workflows in bioinformatics (or, indeed, more broadly in e-science,
although they are widely used in the analysis of business processes). Different groups have

198
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

developed a great diversity of GUIs for EMBOSS and GCG, but a meta-analysis of the
processes within which these analytical implementations are immersed is not yet fully
available. Some of the existing GUIs have been developed to make use of grammatical
descriptions of the analytical methods, but there exists no standard meta-data framework for
GUI and workflow representation in bioinformatics.

Chapter 7 - Figure 1. Syntactic components describing bioinformatics analysis workflows.

7.2.1 Syntactic and algebraic components

Our workflow conceptualisation (Figure 1) closely follows those of Lei and Singh [8]
and Stevens et al. [9]. We have adapted these meta-models to processes in bioinformatics
analysis. We consider an input/output data object as a collection of input/output data. For us
a transformer is the atomic work item in a workflow. In analytical workflows, it is an
implementation of an analytical algorithm (analytical method). A pipe component is the entity
that contains the required input-output relation (e.g. information about the previous and
subsequent tasks); it assures syntactic coherence. Our workflow representation has tasks,
stages, and experimental conditions (parameters). In our view, protocols are sets of
information that describe an experiment. A protocol contains workflows, annotations, and
information about the raw data; therefore we understand a workflow to be a group of stages
with interdependencies. It is a process bound to a particular resource that fulfils the analytical
necessities.

199
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

We identify needs common to analytical workflows in bioinformatics:

• Flexibility in structuring and modelling (open-ended, sometimes ad hoc workflow


definition, allowing decisionmaking whilst a workflow is being executed).
• Support for workflows with a complex (or nested) inner structure of individual steps
(such that multi-level modeling becomes appropriate). Biological workflows may be
complex not simply because of the discrete number of steps, but due to the highly
nested structure of iteration, recursion and conditional statements that, moreover,
may involve interaction with non-workflow systems.
• Distribution of workflow execution over grid environments.
• Management of failures. This particular requirement is related to conditional
statements: where the service will be executed should be evaluated based on
considerations of availability and efficiency made previous to the execution of the
workflow. In situations where a failure halts the process, the system should either
recover it, or dispatch it somewhere else without requiring intervention by the user.
• System functionality features such as browsing and visualisation, documentation, or
coupling with external tools, e.g. for analysis.
• A semantic layer for collaborative purposes. This semantic layer has many other
features, and may be the foundation for intelligent agents that facilitate collaborative
research.

Chapter 7 - Figure 2. Syntactic components and algebraic operators.

200
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

OPERATOR: ITERATION (I) I [TRANSFORMER](CC1, CC2...CCN)


OPERATOR: RECURSION (R) R [TRANSFORMER: PARAMETER](PARM_SPACE)
OPERATOR: CONDITION (C) C [FUNCTIONAL CONDITION (TRUE: PATH;
FALSE: PATH; VALUE: PATH)]
OPERATOR: SUSPENSION/RESUMPTION (S) S [RE-TAKE JOBS: EXECUTION]
Chapter 7 - Table 1. Algebraic operators
Executing these bioinformatics workflows further requires:

• Support for long-running activities with or without user interaction.


• Application-dependent correctness criteria for execution of individual and
concurrent workflows.
• Integration with other systems (e.g. file managers, database management systems,
product data managers) that have their own execution/correctness requirements.
• Reliability and recoverability with respect to data.
• Reliable communication between workflow components and processing entities.
Among these types of requirements, we focus our analysis only on those closely related
to workflow design issues, more specifically (a) the piping of data, (b) the availability of
conditional statements, (c) the need to iterate one method over a set of different inputs, (d)
the possibility of recursion over a parameter space for a method, (e) and the need for
stop/break management. Algebraic operators can accurately capture the meaning of these
functional requirements. To describe an analytical workflow, it is necessary to consider both
algebraic operators and syntactic components. In Table 2 we present the definition of those
algebraic operators we propose and in Figure 2 we illustrate how these operators and syntactic
elements together can describe an analytical workflow.

Iteration is the operator that enables processes in which one transformer is applied
over a multiple set of inputs. A special case for this operator occurs when it is applied over a
blank transformer; this case results in replicates of the input collection. Consider an analytical
method, or a workflow, in which the same input is to be used several times; the first step
would be to use as many replicates of the input collection as needed. The recursion operation
takes place when one transformer is applied with parameters defined not as a single value, but
as a range or as a set of values. The conditional operator has to do with the conditioned
execution of transformers. This operation can be attached to a function evaluated over the

201
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

application of a recursion, or of an iteration; if the stated condition is true, then the workflow
executes a certain path. Conditional statements may also be applied to cases where an
argument is evaluated on the input; the result affects not a path, but the parameter space of
the next stage. The suspension/resumption operation stands for the capacity of the workflow
to stop and re-capture the jobs.

Formal Concept analysis (FCA) is a mathematical theory based on ordered sets and
complete lattices. Numerous investigations have shown the usefulness of concept lattices for
information retrieval combining query and navigation, learning and data-mining, visual
constructors and visual programming [10]. FCA helps one to define valid objects, and identify
behaviours for them. We are currently working on a complete FCA for biological data types
and operations (database and analytical). Here we define operators in terms of pre- and post-
conditions, as a step toward eventual logical formalisation. We focus on those components of
the discovery process not directly related to database operations; a good integration system
will "hide" the underlying heterogeneity, so that one can query using a simple language (which
views all data as if they are already in the same memory space). Selection of the query
language depends only on the data model. For the XML "data model", XML-QL, XQL, and
other XML query languages are available. For the nested relational model there are nested
relational calculi and nested relational algebras. For the relational model SQL, relational
algebras and so on are available. For database operations, the issues that arise are lower-level
(e.g. expression of disk layout, latency cost, etc. in the context of query optimisation), and it is
not clear that any particular algebra offers a significant advantage.

A more-detailed example involves the inference of molecular phylogenetic trees by


executing software that implements three main phylogenetic inference methods: distance,
parsimony and maximum likelihood. Figure 3 illustrates how our algebraic operators and
syntactic components define the structure of this workflow.

202
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

Operator: Iteration (I):


I[Transformer, (CC1, CC2, …, CCn)]: (CC1’, CC2’, …, CCn’)
Pre-condition:
T = Transformer T ! blank
C = {CC1, CC2, …, CCn} such that CCi {Biological data types}
Post-condition:
C’ = {CC1’, CC2’, …, CCn’} such that CCi’ = T(CCi) 1 " i " n
Operator: Iteration (I):
I[blank: num, (CC1, CC2, …, CCn)]: (CC1, CC2, …, CCn)1 , (CC1, CC2, …, CCn)2, … , (CC1, CC2, …,
CCn)num
Pre-condition:
num , num = number of replicates
C = {CC1, CC2, …, CCn} such that CCi {Biological data types}
Post-condition:
C’ = {CC1’, CC2’, …, CCn’} such that CCi’ = T(CCi) 1 " i " n
Operator: Recursion (R):
R[Transformer: Parameter, (Parm_Space)]: Parm_Space’
Pre-condition:
P = Parameter such that P Parm_Space (Parm_Space = {Parm_Values})
T = Transformer
Post-condition:
Parm_Space’ = T(Parm_Space)
Operator: Condition (C):
C[Functional_Condition]: PATH
Pre-condition:
FC = Functional_Conditional
Post-condition:
PATH = true false false
Operator: Suspension/Resumption (S):
S[re-take, jobs]: Execution
Pre-condition:
(re-take = true) (re-take = false jobs = Set of jobs which should be suspended)
Post-condition:
(re-take = true ((Execution = true Previously suspended jobs are re-taken) (Execution =
true There were no suspended jobs))) (re-take = false (Execution = true j such that j
jobs, j is suspended))

Chapter 7 - Table 2. Operator specifications.

203
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

Chapter 7 - Figure 3. Phylogenetic analysis workflow


In collaboration with CIAT (International Center for Tropical Agriculture, Cali,
Colombia) we have implemented an annotation workflow using standard technology (G-
PIPE/PISE) and web services (TAVERNA). Our case workflow is detailed in Figure 4.

Implementation of both of these workflows was a manual process. GUI generation was
facilitated by using PISE as our GUI generator, and this simplified the inclusion of new
analytical methods as needed. Database calls had to be manually coded in both cases.
Choreographing the execution of the workflow was not simple, as neither has a real workflow
engine. It proved easier to give users the ability to manipulate parameters and data with
PISE/G-PIPE, partly due to wider range of methods within BioPerl partly because algebraic
operators were readily available as part of PISE/G-PIPE. From this experience we have
concluded that, due to the immaturity of current available web service engines, it is still most
practical to implement simple XML workflows that allow users to manipulate parameters, use
conditional operators, and carry out write and read operations over databases. This balance

204
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

will, of course, presumably shift as web services mature in the bioinformatics applications
domain.

Chapter 7 - Figure 4. Case workflow

7.2.2 Workflow generation, an implementation

We have developed G-PIPE, a flexible workflow generator for PISE. G-PIPE extends
the capabilities of PISE to allow the creation and sharing of customised, reusable and
shareable analytical workflows. So far we have implemented and tested G-PIPE over only the
EMBOSS package, although extension to other algorithmic implementations is possible
where there is an XML file describing the command-line user interface.

Workflows automate businesses procedures in which information or tasks are passed


between conforming entities according to a defined set of rules; some of these business rules
are defined by the user, and in our implementation are managed via G-PIPE. For our
purposes, the conforming entities are analytical methods (Clustal, Protpars, etc.). Syntactic

205
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

rules drive the interaction between these entities (e.g. to ensure syntactic coherence between
heterogeneous file formats). G-PIPE also assures the execution of the workflow, and makes it
possible to distribute different jobs over a grid of servers. G-PIPE addresses these
requirements using mostly Bioperl.

In G-PIPE, each analysis protocol (including any annotations, i.e. meta-data) is defined
within an XML file. A Java applet provides the user with an exploratory tool for browsing and
displaying methods and protocols. Synchronisation is maintained between client-side display
and server-side storage using Javascript. Server-side persistence is maintained through
serialised Perl objects that manage the workflow execution. G-PIPE supports independent
branched tasks in parallel, and reports errors and results into an HTML file. The user selects
the methods, sets parameters, defines the chaining of different methods, and selects the
server(s) on which these will be executed. G-PIPE creates an XML file and a Perl script, each
of which describes the experiment. The Perl file may later be used on a command-line basis,
and customised to address specific needs. The user can monitor the status of workflow
execution, and access intermediary results. A workflow built with G-PIPE can distribute its
analyses onto different, geographically disperse G-PIPE/PISE servers.

7.3 ARCHITECTURAL DETAILS

The overall architecture of G-PIPE is shown in Figure 5. A Java applet provides the
user with an exploratory tool for browsing and displaying methods and protocols. The user
interacts with the HTML forms to define a protocol. Synchronisation is maintained between
client-side display and server-side storage using Javascript. Server-side persistency is
maintained through serialised Perl objects that describe the experiment. The object is
translated into two user-accessible files: an XML file to share and reload protocols, and a Perl
script. A new lightweight PISE/Bioperl module, PiseWorkflow, lets workflows to be built
and run atop PiseApplication instances. This module supports independent branched tasks in
parallel, and report errors and results into an HTML file.

206
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

Chapter 7 - Figure 5. G-PIPE Architecture.

7.4 SEMANTIC AND SYNTACTIC ISSUES

The representation of the workflow, as illustrated in Appendix 5, is grounded in both


semantic and syntactic elements. The semantics encoded within the proposed XML makes it
easier for developers to understand the meaning of each element in the XML file. This
chapter proposes a set of valid workflow constructs for bioinformatics. These are technically
valid not only because they have been identified as common to many bioinformatics
workflows, but also because they have are structured in an XML file within a well-encoded
semantics. This facilitates the incorporation of the constructs to larger efforts such as the
RSBI -see chapter 5, section 2.

Within the context of a biological investigation, for which bioinformatics might


facilitate the design of SNPs (Single Nucleotide Polymorphism) primers, there is an associated
207
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

computational workflow, as illustrated in Figure 6. In this case, syntactic elements have been
defined by a clear semantics that allows developers to manipulate the constructs depending
on the needs of the application; thus there is a semantics scaffold from which syntactic
aspects make sense. For the case of G-PIPE it is enough just to allow users to manipulate
parameters, transformers, pipe components, and data collections. However, when annotating complete
biological investigations the design of SNPs, or any computational method involved, is just a
small part within the context of a larger effort. For these cases the annotation is not only
about those identified constructs; the workflow is part of the whole, then, the workflow
constructs have to be annotated within the new context.

Chapter 7 - Figure 6. Designing SNPs.

208
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

The RSBI ontology, in principle, allows this integration. The collection component as
understood by Garcia et al [1], and discussed in section 7.2 can be assimilated to the same
concept of Biomaterial. The transformer can be assimilated to the assay. Figure 7 illustrates
how, for a particular segment of the workflow presented in Figure 6, the RSBI ontology
together with the workflow constructs; represent in a meaningful way the use of TBLAST. It
is important to notice that the larger the effort, the more complex the annotation. Biomaterial
is an elusive concept, as for every assay, being it computational, in vivo, or in vitro, there is the
potential to fragment or even mutate – transform - the biomaterial; however there is always
the need to trace back the sample to its original source allowing researchers to inspect the
process at different levels of detail.

Chapter 7 - Figure 7. Mapping the RSBI

209
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

7.5 DISCUSSION

The syntactic and algebraic components we introduce above make it possible to


describe analytical workflows in bioinformatics precisely yet flexibly. Detailed algebraic
representation for these kinds of processes have not previously been used in this domain,
although they are commonly used to represent business processes. Since open projects such
as Bioperl or Biopipe contain the rules and logic for bioinformatics tasks, we believe that
having an algebraic representation could contribute importantly to the development of a
biological "language" that allows developers to avoid the tedious parsing of data and
analytical methods so common in bioinformatics. The schematic representation for
workflows in bioinformatics that we present here could evolve to cover other tool-driven
integrative approaches such as those based on web services. Workflows in which concrete
executions take place over a grid of web services involve basically the same syntactic structure
and algebraic operators; however, a clear business logic needs to be defined beforehand for
those web services in order to deepen the integration beyond simply the fact of remote
execution. A higher level of sophistication for the pipe component as well as for the
conditional operator may be needed, since remote execution requires (for example)
assessment and availability of the service for the job to be successfully dispatched and
processed. For our implementation we use two agents, one on the client side and the other on
the server side, with the queue handled by PBS (Portable Batch System). It is possible to add a
semantic layer, thereby allowing conceptual selection of the transformers; clear separation
between the operational domain and the knowledge domain be would then be achieved
naturally.

Semantic issues are particularly important with these kinds of workflows. An example
may be derived from Figure 3, where three different phylogenetic analysis workflows are
executed. These may be grouped as equivalent, but are syntactically different. Selection should
be left in the hands of the user, but the system should at least inform about this similarity.

210
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

Despite agreement on the importance of semantic layers for integrative systems, such a
level of sophistication is far from being achieved. Lack of awareness of the practical
applications of such technologies is well illustrated with a traditional and well-studied product:
Microsoft Word®. With Word, syntactic verification can take place as the user composes text,
but no semantic corroboration is done. For two words like "purpose" and "propose", Word
advises on syntactic issues, but gives no guidance concerning the context of the words.
Semantic issues in bioinformatics workflows are more complex, and it is not clear if existing
technologies can effectively overcome these problems.

Transformers and grid components are intrinsically related because the services are de
facto linked to a grid component. It has been demonstrated that the use of ontologies
facilitates interoperability and the deployment of software agents [11]; correspondingly, we
envision semantic technology supporting the agents to form the foundation of future
workflow systems in bioinformatics. The semantic layer should make the agents more aware
of the information.

More and more GUIs are available in bioinformatics; this can be seen in the number of
GUIs for EMBOSS and GCG alone. Some of them incorporate a degree of workflow
capability, more typically a simple chaining of analytical methods rather than flexible
workflow operations. A unified metadata model for GUI generation is lacking in the
bioinformatics domain. Web services are relatively easy to implement, and are becoming
increasingly available as GUI systems are published as web services. However, web services
were initially developed to support processes for which the business logic is widely agreed
upon, well-defined and properly structured, and the extension of this paradigm to
bioinformatics may not be straightforward.

Automatic service discovery is an intrinsic feature of web services. The accuracy of the
discovery process necessarily depends on the ontology supporting this service. Systems such
as BioMoby and TAVERNA make extensive use of service discovery; however, due to the
difficulty in describing biological data types, service discovery is not yet accurate. It is not yet
clear whether languages such as OWL can be developed to describe relations between

211
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

biological concepts with the required accuracy. Integrating information is as much a syntactic
as a semantic problem, and in bioinformatics these boundaries are particularly ill defined.

Semantic and syntactic problems were also identified from the case workflow described
in Figure 3. There, we saw that to support the extraction of meaningful information and its
presentation to the user, formats should be ontology-based and machine-readable, e.g. in
XML format. Lack of these functional features makes manipulation of the output a difficult
task that is usually addressed by use of parsers specific to each individual case. For workflow
development, human readability can be just as important. Consider, for example, a ClustalW
output where valid elements could be identified by the machine and presented to the user
together with contextual menus including different options over the different data types. In
this way the user would be able to decide what to do next, where to split a workflow, and
over which part of the output to continue or extend the analysis. Inclusion of this
functionality would allow the workflow to become more concretely defined as it is used.

Failure management is an area in which we can see a clear difference between the
business world and bioinformatics. In the former, processes rarely take longer than an hour
and are not so computationally intensive, whereas in bioinformatics, processes tend to be
computationally intensive and may take weeks or months to complete. How failures can be
managed to minimise losses will clearly differ between the two domains. Due to the
immaturity of both web services and workflows in bioinformatics, it is still in most cases
more practical to hard-code analytical processes. Improved failure management is one of the
domain-specific challenges that face the application of workflows in bioinformatics.

212
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

Chapter 7 - Figure 8. G-PIPE.


http://if-web1.imb.uq.edu.au/Pise/5.a/gpipe.html
http://gene3.ciat.cgiar.org/Pise/5.a/gpipe.html
So far we have intentionally referred to GUIs and workflows as more-or-less
independent. A glimpse into the corresponding metadata reveals that GUIs are themselves
components of workflow systems. In the bioinformatics domain this relationship is
particularly attractive, since algebraic operations are usually highly nested. The interface
system should therefore provide a programming environment for non-programmers. The
language as such is not complex, but makes extensive use of statements such as while...do,
if...then...else, and for...each. The representation should be natural to the researcher,
separating the knowledge domain from the operational domain.

213
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

7.6 CONCLUSION

We have developed G-PIPE, a flexible workflow generator that makes it possible to


export workflow definitions either as XML or Perl files (which can later be handled vi a the
Bioperl API). Our XML workflow representation is reusable, execution and edition of those
generated workflows is possible either via the BioPerl API or the provided GUI. Each
analysis is configurable, as users are presented with options to manipulate all available
parameters supported by the underlying algorithms. Integration of new algorithms, and Grid
execution of workflows, are also possible. Most available integrative environments rely on
parsers or syntactic objects, making it difficult to integrate new analytical methods into
workflow systems. We are planning to develop a more wide-ranging algebra that includes
query operations over biological databases as well as different ontological layers that facilitate
data interoperability and integration of information where possible for the user. We do not
envision G-PIPE to be a complete virtual laboratory environment; future releases will provide
a content management system for bioinformatics with workbench capacities developed on
top of ZOPE http://www.zope.org. We have tested our implementation over SUSE and
Debian Linux, and over Solaris 8.

7.7 ACKNOWLEDGEMENTS

We gratefully acknowledge the collaboration of Dr Fernando Rodrigues (CIAT) in


developing the case study outlined in Figure 4, and Dr Lindsay Hood (IMB) for valuable
discussions. ST thanks Université Montpellier II for travel support. This work was supported
by ARC grants DP0344488 and CE0348221.

214
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

7.8 REFERENCES

1. Hollingsworth D: The workflow reference model. [http://www.wfmc.org /standards


/docs/tc003v11.pdf].

2. Ernst P, Glatting K-H, Shuai S: A task framework for the web interface W2H.
Bioinformatics 2003, 19:278-282.

3. Letondal C: A Web interface generator for molecular biology programs in Unix.


Bioinformatics 2001, 17:73-82.

4. Senger M, Flores T, Glatting K-H, Ernst P, Hotz-Wagenblatt A, Suhai S: W2H: WWW


interface to the GCG sequence analysis package. Bioinformatics 1998, 14:452-457.

5. Rice P, Longden I, Bleasby A: EMBOSS: the European Molecular Biology Open


Software Suite. Trends Genet 2000, 16:276-277.4.

6. Stevens R, Robinson AJ, Goble C: myGrid: personalised bioinformatics on the


information grid. Bioinformatics 2003, 19:302i-304i.

7. Shah SP, He DYM, Sawkins JN, Druce JC, Quon G, Lett D, Zheng GXY, Xu T, Ouellette
BFF: Pegasys: software for executing and integrating analyses of biological
sequences. BMC Bioinformatics 2004, 5:40.

8. Lei K, Singh M: A comparison of workflow meta-models. Workshop on behavioural


modelling and design transformations: Issues and opportunities in conceptual modelling.
Los Angeles 1997. ER'97, 6–7 November 1997

9. Stevens R, Goble C, Baker P, Brass A: A classification of tasks in bioinformatics.


Bioinformatics 2001, 17:180-188.

10. Ganter B, Kuznetsov SO: Formalizing hypothesis with concepts. In 8th International
Conference on Conceptual Structures, ICCS Conceptual Structures: Logical, Linguistic,
and Computational Issues. Darmstadt, Germany. Lecture Notes in Computer Science
1867 Edited by: Mineau G, Ganter B. Springer-Verlag; 2000:342-356. August 14–18
2000

11. Sowa FJ: Top-level ontological categories. International Journal of Human Computer
Studies 1995, 43:669-685.

215
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

8 Conclusions and Discussion

8.1 SUMMARY

This thesis has primarily dealt with the actual how to build ontologies, and how to gain
the involvement of the community. This research presents a detailed description of three
different but complementary ontology developments, along with all those issues surrounding
the development of the corresponding ontologies. As ontologies are living systems, constantly
evolving, the maintenance and life cycle of the ontology has also been investigated in order to
have a consistent methodology. It has been largely accepted by the biological community that
ontologies play a prominent role when integrating information, however very few studies had
focused on the relationship between the syntactic structure and the semantic scaffold. This
thesis has also explored this relationship.

The introductory chapters have investigated methodological aspects about building


ontologies; these have ranged from the role of the domain expert to similarity between the
biological and the semantic web scenario. Within this context the role of concept maps during
knowledge elicitation when building conceptual models has been established. Other aspects
related to the use of concept maps have also been reported ( e.g. argumentative structure,
maintenance).

As integration of information has different edges, this present work has covered the
workflow nature of bioinformatics. Within this context a syntactic structure was proposed in
order to allow in silico experiments to be replicable and reproducible. Also, and more
importantly, from this experience it was possible to study the relationship between a syntactic
and semantics. This research is based upon real cases in which researchers were involved; this
allowed the author to gain from this direct relationship not only with the subject of study but
also with the context in which solutions were expected to play a role.

216
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

The discussion and conclusions are organised as follows; initially a summary of the
thesis is presented. In Section 6.2 and 6.3 the similarity between the semantic web and that of
biology is illustrated within the context of information systems. Issues related to the
construction of biological ontologies are discussed in Section 6.4 and, finally, references are
given in Section 6.5.

8.2 BIOLOGICAL INFORMATION SYSTEMS AND ONTOLOGIES.

In this investigation we have argued that the integration of information in molecular


bioscience (and, by extension, in other technical fields) is a deeper issue than access to a
particular type of data (sequences, structures) or record (GenBank® accession number).
Integration of information in bioinformatics has to support research endeavors in such a way
that it facilitates the formulation and testing of biological hypothesis. For instance, a biological
hypothesis may state that “genes from the Yellow Stripe Like (YSL) family may be used for the
fortification of rice grains as they are responsible for the uptake and long-distance
transportation of iron-chelates in rice” within a bio-fortification project. In this context not
only information about genes, proteins and metabolic pathways is needed. Researchers also
need to correlate all the information they have and can access through the Internet. In this
way a relationship between YSL genes and the concentration of iron in rice grains may be
found, and consequently tested in a laboratory. The connection between Laboratory
Information Management Systems (LIMS) and external information is thus critical.

LIMS are a special kind of biological information systems as they in principle organise
the information produced by laboratories. Once this information has been organised the
analysis process takes place, discovering relations becomes more and more important. Within
the plant context, plant-related descriptors such as those provided by Plant Ontology (PO) [1]
and Gramene [2] are being consumed by object models in a variety of software systems such
as LIMS in order to support annotation of sequences and experiments. These object models

217
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

are meant to support an integrative approach so therefore the use of orthogonal ontologies is
essential.

Relating phenotypic information to its corresponding genotypes and vice versa should,
in principle, be possible. For instance, with a saline stress related query one should be able to
retrieve not only sequences but experimental designs and conditions, location, morphological
features of the plants involved, etc. Common biological descriptors should be identified and
ontologies addressing specific needs for the plant community need to be developed.
Molecular information may be described independently from the domain, for instance the
Gene Ontology (GO) [3], however, phenotypic information is highly specific to the type of
organism being described.

Ideally LIMSs should consume core and domain-specific terminology in order to allow
for the annotation of experiments; these vocabularies should be shared across the community
so exchanging information might be a simpler task. In order for information to be shared the
vocabulary used should be independent from the LIMS; different LIMS should be able to
share a standard vocabulary. This ensures the independence between both the conceptual and
the functional model – researchers may use different LIMS but still name things with a
consistent vocabulary. In the same vain this may allow to share experiments in the form of
customizable “templates”.

Accurate annotation of experimental processes and their results with well-structured


ontologies allows for the semantic integration and querying across disparate data sets and data
types. This sort of large-scale data integration can be achieved by the use of a data integration
engine based on graph theory. Furthermore, reasoning engines can be constructed to perform
automated reasoning over the data annotated with these types of ontologies. The result will be
a better understanding of the meaning of the results of a wide variety of experiments and the
increased ability to develop further hypothesis in silico [14].

Some attempts have been made in order to define what an investigation is, what the
difference between a test and an assay is, how we can classify experiments and how to annotate
research endeavours in order to facilitate contextualised information retrieval. One of the first

218
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

ontologies addressing the problem of describing experiments was the MGED Ontology
(henceforth MO) [4]; it was developed as a collaborative effort by members of the MGED
Ontology working group in order to provide those descriptors required to interpret
microarray experiments. Although these concepts were derived from the MicroArray and
Gene Expression Object Model (MAGE-OM), which is a framework to represent gene
expression data and relevant annotations [5], in principle any software capable of consuming
the ontology can use these descriptors. There is thus a separation between both the functional
and the declarative (ontological) model.

Throughout this thesis, the need to support integrative approaches on rich and useful
graphical environments has been clearly stated. Designing these environments is a research
topic not sufficiently studied within the context of bioinformatics. There have been very few
Human-Computer Interaction (HCI) evaluations over bioinformatics tools; moreover, HCI
and cognitive aspects are rarely considered when designing biological information systems.
Information foraging, which refers to activities associated with assessing, seeking, and
handling information sources [6] has also not been considered in bioinformatics. Such search
is adaptive to the extent that it makes optimal use of knowledge about the expected value of
the information, and the expected costs of accessing and extracting it. Humans seeking
information adopt different strategies when gathering information and extracting knowledge
from the results of their searches. A relationship between the user and the information is then
built. The relation is easy if the data are presented to the user in a clear way, if the information
provides extraction tools, and especially if the information is understandable, structured, and
immersed in the right context. Value and relevance are not intrinsic properties of
information-bearing representations, but can be assessed only in relation to the environment
in which the task is embedded.

Graphical User Interfaces (GUIs) should facilitate managing and accessing the
information. Graphical environments should relieve users from high learning curves, and
difficulties accessing command line based interfaces. There is a need to establish a clear
separation between the operational domain and the domain of knowledge. For a researcher,

219
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

finding a protein defined by some specific features along with all the relevant bibliographic
references should not be a daunting task. Integrative approaches should therefore be integral.
Ontologies may help in creating coherent visual environments, as has already been shown by
Stevens et al. with the TAMBIS [7] project.

8.3 TOWARDS A SEMANTIC WEB IN BIOLOGY

The field of bio-ontology development has been surprisingly active in recent years;
partly because of the premise that it will encourage and enable knowledge sharing and reuse,
but also because the biological community is gradually adopting a holistic approach for which
context is critical - shifting paradigm, some would say. In order to achieve this “holistic view”,
it is indispensable to develop ontologies that accurately describe the reality of the world.
Different groups will develop this ontological corpus –as is currently happening. Those
efforts already in place are independent from one another, and made in response mostly to ad
hoc necessities. Ironically for the biological community, we may be re-writing an old story:
database integration in molecular biology has long been a problem, partly due to the fact that
most approaches to data integration have been driven by necessity. By the same token
biological ontologies have been developed as a momentary response to a particular need. This
has lead the bio-communities to describe their worlds from their particular perspective, not
taking into account that at a later stage these ontologies are needed to describe the “big
picture”. This approach also carries negative implications for the maintenance and evolution
of the ontologies.

This situation should change within the coming years, not only because of those lessons
learned, but also because the “big picture” will drive biology more and more, making it
necessary to have articulated descriptions by using well-harmonised ontologies. More
importantly, ontologies are being, slowly but firmly, separated from object models. This
independence should allow ontologies to be used across a wide range of applications. For
instance, any Laboratory Information Management System should be able to use the same

220
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

descriptors for those processes for which it was designed, thereby enabling data sharing and
to some extent knowledge sharing. Full experiments could then be easily replicated.
Ontologies should be independent from computer realisations.

Are we heading towards a semantic web (SW) in bioinformatics? It was Tim Berners-Lee
who initially presented a vision of a unified data source where, as a consequence of highly
integrated systems, complex queries could be formulated [8]. It has been long since this vision
was presented, and many different approaches have been developed in order to make it
operative; so far it is still hard to define what semantic web really means. The SW may be seen as
a knowledge base where semantic layers allow reasoning and discovery of hidden relations,
contextualising the information, thereby delivering personalised services. In the development
of the semantic web there is, thus, a pivotal role for ontologies to play, since they provide a
representation of a shared conceptualisation of a particular domain that can be communicated
between people and applications.

In the particular field of bioinformatics, interoperability and integration of information


has been at issue since some of the first databanks started to be publicly accessible. Most
previous attempts in database integration have addressed the problem of querying, and
extracting data from, multiple heterogeneous data sources from the syntactic perspective. If
the process can be done via a single query, the data sources involved are considered
interoperable and integrated. These approaches do not consider how a particular biological
entity might be meaningfully related to others, how it is immersed in different processes, or
how it is related to relevant literature sources within the context of a given research; only
location and accessibility have been at issue.

In the same way, the complexity of a query would be largely a function of how many
different databases must be queried, and from these how many internal sub-queries must be
formed and exchanged, for the desired information to be extracted. If a deeper layer that
embeds semantic awareness were added, it is probable that query capacities would be
improved. This can be envisioned as the provision, within an ontological layer, of just enough
connective tissue to allow semi-intelligent agents or search engines to execute simplified

221
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

queries against hundreds of sites (McEntire, 2002). Not only could different data sources be
queried, but also (more importantly) interoperability would then arise naturally as a
consequence of semantic awareness. At the same time, it should be possible to automatically
identify and map the various entries that constitute the knowledge relationship, empowering
the user to visualise a more descriptive landscape. What is semantic integration of molecular
biology databases? What does it mean to have a semantic web for the biological domain?

Not so surprisingly, as has been already mentioned, as well as discussed in more detail
in Chapter 2, the biological community is heading towards a semantic web. However, this is not
new as the biological community was facing all those problems the syntactic web has always
had. The semantic web in biology poses an interesting, not so well known, challenge to the
semantic web community: that of knowledge representation within communities of practices.
Representing and formalising knowledge for semantic web purposes has usually been studied
within closed-complete contexts – Amazon3, insurance companies, administrative
environments for which a business logic is not only known in advance, but also for which the
communities are more prone to follow rules. The biological community is different, and these
idiosyncratic factors must be taken in to account. Moreover, it is not clear what constitutes
knowledge in a broad sense for the biological community. One could say that a database entry
may be considered to be data; however as the database entry is annotated with meaningful
information that places it within a valid context for the researcher, the boundaries between
data, information and knowledge become difficult to see.

As we are heading towards a semantic web in bioinformatics it is important to have the


community fully involved. Policies from those consortia gathered to promote the
development of bio-ontologies should facilitate this engagement. These consortia, in close
collaborations with computer and cognitive scientists, should ideally also address the
technological component such an engagement has. A more insightful description of some of
the situations this lack of understanding has generated has been presented in Chapter 7.

3 http://www.amazon.com

222
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

Simple guidance and criteria such as how to best define the class structure within the
bio-domain may make a huge difference. Names should not matter much, as they will
proliferate; ontology classes on the other hand should offer a more enduring structure; ideally
the class structure should follow axes based on time and space, continuants and ocurrents.
There is thus the need to disentangle the meanings from the names; in this way we may
achieve a modular-accurate description of the world, based on facts and evidence rather than
perceptions. Quoting Barry Smith4:

“As Leibniz pointed out several centuries back in his criticism of John Locke's statement (roughly
summarised): Since names are arbitrary and our understanding of the world is based on the names we give to
things, our understanding of the world is arbitrary. Leibniz agreed names are arbitrary but our description of
the real world is based on our best effort to describe facts as we see them - not on names. It's these aspects of
Leibniz epistemology that have been used to great effect by the evo-devo researchers who have developed the
concept of modularity/complementarity when describing the constraints on evolution - to wit - the possibilities -
the search space in which evolution functions - are not limitless, but are in fact constrained by the limits of
POSSIBLE interactions amongst the many constituent entities.”

Evidence is difficult to gather and represent when developing bio-ontologies. Evidence


is often part of the discussion amongst domain experts; it is available as unstructured text, not
always related to a particular class or property. This makes it hard for knowledge engineers to
properly manage evidence. Ideally, an Integrated Community Development Environment for
Ontologies (ICDEO) should also offer support for maintaining the evolution of the
ontologies, and part of that task is to preserve evidence in a reusable way. Some of these
issues are addressed in Chapters 2, 3,4 and 5.

As the need to have integral descriptions for research endeavours grows, so does the
effort to cope with such a task. FuGO [9] has started to address issues related to modularity
and ontology integration. Different groups such as FuGO within the functional genomics
context and the Generation Challenge Program (GCP) [10] within the plant world are

4 FuGO mailing list, http://fugo.sourceforge.net/lists/list.php

223
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

currently evaluating some existing standards for ontologies and metadata. Dublin Core [11],
SKOS [12], ISO/IEC [13] and others have been part of this assessment. Some guidance will
be available for bio-communities so there is a unified criterion to define classes, properties
and metadata in general. This is certainly a step in the right direction, but it is yet too soon as
to predict future outcomes.

8.4 DEVELOPING BIO-ONTOLOGIES AS A COMMUNITY EFFORT.

This thesis has demonstrated the role of communities of domain experts when
developing ontologies; as ontologies imply the contribution and agreement of a community,
they may be understood as “social agreements” for describing the reality of a particular
domain of knowledge. The whole process resembles in many ways an exercise of participatory
design, and even more interestingly it follows the main precept of user-centric design that
states that designs should always focus on user’s perceptions. Chapters 2 and 4 not only
present methodological aspects about building community ontologies, but also important
details of those processes in which these methods were applied.

Within the context of designing technology for biological researchers, what is the role
of the domain expert? An interesting parallel may be drawn from the field of designing
children’s technology. Three main methodologies have been applied: User Centric Design
(UCD) [14], Participatory Design (PD) [15], and Informant Design (ID) [16]. They all focus
on describing the kind of relation between children and designers, which affects the input
obtained. Interestingly, the relationships described by these authors as well as the dynamics
that emerges from the relationship between the children and the designer proved to be
applicable when designing technology for the biological community.

The UCD approach involves children in the design process as testers. This is the
traditional role of children as end-users of technology; where they are placed in a reacting role
in order to give feedback to designers about the product [17]. In this approach the designers

224
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

define what it is suitable for children; they get to an advanced point in the design process
before getting input from the users.

The fundamental assumption in the PD approach is that users and designers can view
each other as equals. Both therefore take on active roles in the design [17]. Following the
same line of thinking, Druin and Soloman [18] have proposed to have children as part of the
design team, particularly suggesting metaphors for the designer, and sharing, in some way, in
responsibilities and decision making.

On the other hand, the ID perspective considers children’s input to play a fundamental
role in the design process, thus seeing children not just as testers of technology. The
participation of children in this process is defined according to the different phases of design
and their goals. This approach is placed somewhere between UCD and PD; children are
informants but cannot be considered as co-designers [17].

When developing technology within the biological domain, the predominant approach
has been to use the domain expert as an informant on requirements, as well as a tester of the
end product. Research, in order to determine what the role of the domain expert should be
when developing his/her technology, is therefore sorely needed as the current approach has
proven not to be very successful. From our experiences, as reported in Chapters 2 and 4, the
constant input and interaction of the domain experts is crucial for the success of the
information system. Domain experts should be involved throughout the entire process. This
involvement is not only needed when developing ontologies within biological communities,
but also during software development. Participatory design is thus the most suitable
methodology as the control is shared by all of the design team members, and their research
agendas are open to changes and redefinitions. The position of the designers is that of
someone who is interested in knowing about domain experts, someone who is willing to re-
shape his/her own ideas. This perspective supports a closer relationship, where everyone is
learning. Designers working within an ID approach assume a position mediated by the goals
of the different stages of the process. The research agenda is defined according to the
informants' input across the process, hence in those stages where domain experts take part as

225
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

informants, the relationship resembles that promoted by the PD; designers want to know
facts they do not know about domain experts. The “control” should therefore be shared
whenever possible; domain experts should “lead” the whole process by establishing an
equalitarian relationship.

8.5 REFERENCES

1. Jaiswal P, Avraham S, Ilic K, et al.: Plant Ontology (PO): a controlled vocabulary of


plant structures and growth stages. Comparative and Functional Genomics 2005,
6(7-8):388-397.

2. Jaiswal P, Ware-D N-J, Chang K, Zhao W, Schmidt S, X. P, Clark K, Teytelman L,


Cartinhour S, Stein L et al: Gramene: development and integration of trait and gene
ontologies for rice. Comparative and Functional Genomics 2002, 3(2):132-136.

3. Ashburner M, Ball C, Blake J, Botstein D, Butler H, Cherry J, Davis A, Dolinski K, Dwight


S, Eppig J et al: Gene ontology: tool for the unification of biology. The Gene
Ontology Consortium. Nature Genetics 2000, 25(1):25-29.

4. Stoeckert CJ, Parkinson H: The MGED ontology: a framework for describing


functional genomics experiments. Comparative and Functional Genomics 2003,
4:127-132.

5. Spellman P.T., Miller M. SJ, Troup C., Sarkans U., Cher-vitz S., Bernhart D., Sherlock
G., Ball C., Lepage M., Swiatek M. et al: Design and implementation of microarray
gene expression markup language (MAGE-ML). Genome Biology 2002.

6. Pirolli P, Card KS: Report on Information Foraging. In. Palo Alto: Palo Alto Research
Center; 2006.

7. Stevens R, Baker P, Bechhofer S, Ng G, Jacoby A, Paton NW, Goble CA, Brass A:


TAMBIS: Transparent access to multiple bioinformatics information sources.
Bioinformatics 2000, 16(2):184-185.

8. Berners-Lee T: Weaving the Web: HarperCollins; 1999.

9. Functional Genomics Investigation Ontology [http://fugo.sourceforge.net/]

226
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

10. The Generation Challenge Program [http://generationcp.org]

11. The Dublin Core [http://dublincore.org/]

12. WD-swbp-skos-core-guide [http://www.w3.org/TR/2005/WD-swbp-skos-core-guide-


20050510/]

13. ISO/IEC Metadata Standards [http://metadata-standards.org/]

14. Norman D, Draper S: User centered system design: new perspectives on Human-
Computer Interaction. New Jersey: Lawrence Erlbaum Associates; 1986.

15. Schuler D, Namioka A: Participatory design: principles and practices. New Jersey:
Lawrence Erlbaum Associates; 1993.

16. Scaife M, Rogers Y, Aldrich F, Davies M: Designing for or designing with? Informant
design for interactive learning environments. In: Conference on human factors in
computing systems: 1997; Atlanta, Georgia, USA: ACM; 1997.

17. Scaife M, Rogers Y: Kids as informants: telling us what we didn't know or


confirming what we knew already. The design of Children's Technology 1999:28-49.

18. Druin A, Solomon C: Designing multimedia envoronments for children. New York:
John Wiley; 1996.

227
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

9 Future work

9.1 BIO-ONTOLOGIES: THE MONTAGUES AND THE CAPULETS, ACT TWO,


SCENE TWO: FROM VERONA TO MACONDO VIA LA MANCHA.

9.1.1 Introduction

As the need for integrated biological research grows, ontologies become more and
more important within the life sciences. The biological community has a need not only for
controlled vocabularies but also for guidance systems for annotating experiments, better and
more reliable literature mining tools, and most of all a consistent, shared understanding of
what the information means. Ontologies, thus, should be understood as a starting point, not
as an end by themselves. Although several efforts to provide biological communities with
these required ontologies are currently in progress some of them have thus far proven to be
too slow, too expensive, and too error-prove to meet the demand.

The difficulties in these developments are due not only to the ambiguity of natural
language and the fact that biology is a highly fragmented domain of knowledge, but is also
due to the lack of consistent methodologies for ontology building in loosely centralised
environments such as the biological domain [1, 2]. Biologists need methodologies and tools in
the same vein that computer scientists need real life problems to work on. Collaboration
would thus be the easiest way to move forward. However, such interaction has proven
difficult as the two “houses” of Biology and Computer Science continue to fight each other in
the field of windmills where Don Quixote is pointing a way towards the horizon.

Our three houses have been accurately described by Goble and Wroe [3]. Firstly “The
Montagues”: “one, comforted by its logic’s rigour/Claims ontology for the realm of pure”. Goble and
Wroe define this house as the one of computer science, knowledge management, and
Artificial Intelligence (AI). This community essentially works with well-scoped and behaved
problems; they work with generalisations, and expect to have broadly applicable results. Their

228
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

interest in ontologies lays in testing the boundaries of knowledge representation,


expressiveness of languages, and the suitability of reasoning engines. For this community
building ontologies is a matter of research, which is lead by research interests. The community
involvement in these developments is minimal and domain experts are mainly informants, the
role of the knowledge engineer is predominant. Their ontologies are mostly deployed on a
once-off basis.

Our second house, “The Capulets”: “The other, with blessed scientist’s vigour/acts hastily on
models that endure.” As Goble and Wroe defined it, this is the house of Life Sciences. Within
this community the purpose of bioinformatics is to support their research endeavours. This is
a community with a pragmatic and lead-by-need vision of computer science, with a strong
application pull. Ontologies for the Capulets are basically controlled vocabularies, taxonomies
that allow them to classify things, very much in accordance with a very old tradition in this
domain, one that started with the likes of Aristotle and Linnaeus. Within this house, the role
of the knowledge engineer is that of someone who promotes collaboration in a loosely
centralised environment. Biologists are thus not only leading the process but also designing
the ontology and the software that will ultimately utilise the ontology. Their ontologies are
living entities, constantly evolving.

Following Goble and Wroe’s analogy (henceforth Act 1, Prologue) we also have a
third house: The Philosophers. For narrative purposes it has been decided to name this as the
house of Don Quixote. For this house the essence of the "things" is important, as they seek a
single model of truth itself. Some tangible contributions of this house are those studies of the
part/whole relationship, how to model time, criteria for distinguishing among mutations,
transformations, perdurance, and endurance. Thanks to the heavy emphasis on theory, their
work this house has provided us with a conceptual corpus for understanding ontologies.

The same story will be used as a baseline for the remaining of this paper. Although
their houses endure we may be shifting acts and scenarios. As the Montagues and Capulets
dig deeper into their discrepancy, are we moving from Verona, via La Mancha, to Macondo,
where we all may face a hundred years of solitude? This literary analogy introduces thus a

229
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

possible ending point, Macondo, where we all may find the land of endogenous agreements.
A brief history of this drama is presented in Section 6.1.2; Section 6.1.3 presents some of the
duels between the two main houses of our narrative. The disjunctive marriage or poison is
discussed in Section 6.1.4. We argue in this section the potential danger of heading towards
Macondo versus remaining in Verona and finally living happily ever after.

9.1.2 Some background information

Act 1, Scene 1: “Verona. A public place.” This is indeed true for both of our dignified
households. Computer scientists and biologists actively promote open source initiatives. The
Capulets have a long-standing tradition where sharing code is an everyday activity. The
OpenBio initiatives are a clear example of this fact; however these initiatives are not resources
for workbench biologists, they are meant to support bio-programmers and bioinformaticians.
The Montagues also have an interesting record of collaborative efforts, the development of
the Linux kernel and KDE (K Desktop Environment) to name two. Sharing code is,
however, different from sharing knowledge. Unfortunately little attention has been paid as to
exactly “how” these communities have carried out the process of knowledge management in
their corresponding projects [4].

Act 1, Scene 2: “Halls and rooms in our households’ houses.” During the last several
years The Capulets have been developing different ontologies: The Gene Ontology (GO), the
Microarray Gene Expression Data (MGED) Ontology (henceforth, MO) [5], a
comprehensive list is provided by OBO [6]. By the same token The Montagues have several
ontological initiatives such as OpenCyc [7] that is a general knowledge base, and SUO
(Standard Upper Ontology) [8]. The SUO WG (working group) is developing a standard,
which aims to specify an upper ontology to support computer applications such as data
interoperability, information search and retrieval, automated inference, and natural language
processing.

Act 1, Scene 3: “A lane by the wall of our household’s orchard.” While the Capulets
focus on standardising the words and their meanings, our Montagues embrace relationships,

230
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

logical descriptions, agent technology, and give serious considerations to some of the insights
from the House of Don Quixote such as Time, Matter, Substance, Mutability and many other
essential properties of a concept. The Montagues and the Quixotes tend to see the Capulets’
ontologies as dictionaries, and are prone to point out those deficiencies with accuracy [9, 10].
The Capulets defend their efforts with passion; a valid point in their favor is the lack of
knowledge that Montagues and Quixotes have about the biological community.

9.1.3 The Duels and the duets.

Act 2, Scene 2: “What's in a name? That which we call a rose by any other name would
smell as “sweet”. It was over two years ago that Hunter [11] responded to Brenner’s [12]
comment in Genome Biology. The interesting issue in those discussions was that the
fundamental question both were trying to address was never stated explicitly: What is the role
of ontologies in the life sciences?

Since ontologies may also be understood as social agreements, arguing that they are for
programs and not for people, the way Hunter responds to Brenner is completely descriptive
of their purpose. It is also true that Brenner misses the point by portraying ontologies as
solely taxonomies of words. Conceptual objects have concrete representations, they are in a
way tangible objects, and therefore should enable computational tasks.

Act 2, Scene 2, part 2: “A fertile and dangerous playground” Recently Soldatova et al.
[10] published a series of shortcomings related to MO. It did not take a long time for
Stoeckert et al. to respond [13]. One interesting issue in this scene is that neither party actually
addressed a key point about an ontology that should provide the conceptual scaffold for
describing microarray experiments. For instance, such an ontology should provide not just
those minimal descriptors but also the logical constraints so that inference will be possible.
Ontologies are not just controlled vocabularies; they should also provide support for
reasoning processes. In order to describe a biological investigation it is necessary to use many
different ontologies; how can we integrate these orthogonal ontologies so the final narrative
makes sense?

231
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

So far ontologies have focused on describing things and processes. However, the
relationship between the entity in question and the process by which it is studied has not yet
been fully explored. A “thing” is immersed in a context in which it is informative. The
fragmentation of processes and things should be consistent (i.e. definition of whole/part of
relations) so agents are able to “cut through” the myriad of information annotated with
ontologies. How shall we encompass those apparently unrelated descriptions? The other side
of the coin deals with the context due to the broader picture where the “thing” is of interest.
For instance, when studying a disease we need to gather information not only about those
experimental processes, but also the different responses of the system to the alterations we
have caused. We don’t only need to describe the “thing” we are studying but also the context
in which it is being studied. A disease may be seen as alteration of one or more metabolic
pathways, with the subsequent molecular implications. It may be described as a series of
objects with individual states, individual disease instances, and with relationships between
particular objects. Disease representation requires capturing individual object states as well as
the relationships between different objects. For example one can use the GO term 0005021:
vascular endothelial growth factor receptor as a partial descriptor of gene FLT3: fms-related tyrosine
kinase. This also allows the implication of an ATP binding activity to this gene, as is
understood from [14].

This, however, says nothing about the circumstances of the gene/protein product in a
disease state, or in an individual disease instance. The same can be said for disease objects,
which can also be effectively described by ontologies, but without state or relationship
provision. Is it possible with existing ontologies to accurately describe a disease from both
phenotypic and genotypic perspectives? Since ontologies offer what Brenner defines as
dictionaries, such a representation is not yet possible. To some extent, the solution seems to
be a linguistic exercise, utilising curated data sources and the biomedical text to first define the
relevant objects as they are both officially and commonly expressed, and then to both define
and determine the syntactic and semantic relationships between objects. To us, the existing
and emerging ontologies play a key role in tethering the objects to an objective structure.
However, the object states and relationships are what truly represent disease states and
232
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

instances. In our view, the dynamic nature of individualised disease states requires a more
flexible conceptual model, which encompasses the bridging of separate ontologies through
relationships.

Act 2, Scene 3: “Lingua Franca: a sanctorum by the orchard: He discovered then that he could
understand written English and that between parchments he had gone from the first page to the last of the six
volumes of the encyclopedia as if it were a novel.” As Brenner states explicitly in his comment, we
need to be fluent in our own language - but what does it mean to be fluent in one’s own
language? If someone learned a few phrases so that they could read menus in restaurants and
ask for directions on the street, would you consider them fluent in the language? Certainly
not. That type of phrase-book knowledge is equivalent to the way most people use computers
today. Is such knowledge useful? Yes. But it is not fluency. To be truly fluent in a foreign
language, you must be able to articulate a complex idea or tell an engaging story; in other
words, you must be able to “make things” with language. Analogously, being digitally fluent
involves not only knowing how to use technological tools, but also knowing how to construct
things of significance with those tools [15]. Learning how to use Protégé does not make you
an ontologist; by the same token knowing GO does not make you a biologist. Respect and
understanding for others motivations, contributions and needs are fundamental for a
successful marriage.

9.1.4 Marriage, Poison, and Macondo

“The world was so recent that many things lacked names, and in order to indicate them it was necessary
to point”. Verona and Macondo are an apt metaphor for Bio-ontologies today. On one hand
Verona represents the possible starting point from which we may all do business and thus
engage in win-win situations, as described by Stein [16]. Alternatively Macondo represents the
undesired possible arrival point, a magical realism where man's astonishment before the
wonders of the real world are expressed in isolation. Since the real world encompasses
different views, codes of practice, rules, values, and areas of interest, we should focus on our
common point: fostering interdisciplinary collaboration and communication and thus
engaging in business. GONG (Gene Ontology Next Generation) [17] and FuGO (Functional

233
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

Genomics Investigation Ontology) [18] may illustrate how to work together. They may
eventually show us important lessons not only from the ontological perspective but also from
the community perspective. However, it is yet too soon as to fairly evaluate those lessons
learnt. Some practical realism would consequently come in quite handy if we all are to avoid a
hundred years of solitude.

9.1.5 References

1. Pinto HS, Staab S, Tempich C: Diligent: towards a fine-grained methodology for


Distributed, Loosely-controlled and evolving engineering of ontologies. In:
European conference on Artificial Intelligence: 2004; Valencia, Spain; 2004: 393-397.

2. Garcia Castro A, Rocca-Serra P, Stevens R, Taylor C, Nashar K, Ragan MA, Sansone


S: The use of concept maps during knowledge elicitation in ontology development
processes - the nutrigenomics use case. BMC Bioinformatics 2006, 7:267.

3. Goble C, Wroe C: The Montangues and the Capulets. Comparative and Functional
Genomics 2004, 5:623-632.

4. Hemetsberger A, Reinhardt C: Sharing and Creating Knowledge in Open-Source


Communities The case of KDE. In: Fifth European Conference on Organizational
Knowledge, Learning and Capabilities: 2004; Innsbruck, Austria; 2004.

5. Stoeckert CJ, Parkinson H: The MGED ontology: a framework for describing


functional genomics experiments. Comparative and Functional Genomics 2003,
4:127-132.

6. OBO [http://obo.sourceforge.net/]

7. OpenCyc [http://www.opencyc.org/]

8. IEEE: http://suo.ieee.org/. 2006.

9. Smith B, Williams J, Schulze-Kremer S: The Ontology of the Gene Ontology. In: AMIA
Symposium: 2003; 2003.

10. Soldatova N, Larisa., King D, Ross.: Are the current ontologies in biology good
ontologies? Nature Biotechnology 2005, 23:1095-1098.

11. Hunter L: Genome Biology. Genome Biology 2002, 3(6).

234
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

12. Brenner S: Life sentences: Ontology recapitulates philology. Genome Biology 2002,
3(4).

13. Stoeckert CJ, Ball C, Brazma A, Brinkman R, Causton H, Fan L, Fostel J: Wrestling
with SUMO and bio-ontologies. Nature Biotechnology 2006, 24:21-22.

14. fms-related tyrosine kinase 3 [http://www.ncbi.nih.gov /entrez /query.fcgi?db=gene


&cmd=Retrieve &dopt=summary &list_uids=2322]

15. Resnick M: Rethinking Learning in the Digital Age. In The Global Information
Technology Report: Readiness for the Networked World: Oxford University Press;
2002.

16. Stein L: Creating a bioinformatics nation. Nature 2002, 417:119-120.

17. Gene Ontology Next Generation [http://gong.man.ac.uk/]

18. Functional Genomics Investigation Ontology [http://fugo.sourceforge.net/]

235
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

APPENDIXES

GLOSSARY

Activity: A constituent task of a process

Ad hoc developments: Solutions addressing a specific problem via specific software


development.

Application ontologies: application ontologies are specialisations of domain and task


ontologies as they form a base for implementing applications with a concrete domain and
scope.

Communities of practice: Communities of practice are the basic building blocks of a social
learning system because they are the social ‘containers’ of the competences that make up such
a system. Communities of practice define competence by combining three elements. First,
members are bound together by their collectively developed understanding of what their
community is about and they hold each other accountable to this sense of joint enterprise. To
be competent is to understand the enterprise well enough to be able to contribute to it.
Second, members build their community through mutual engagement. They interact with one
another, establishing norms and relationships of mutuality that reflect these interactions. To
be competent is to be able to engage with the community and be trusted as a partner in these
interactions. Third, communities of practice have produced a shared repertoire of communal
resources—language, routines, sensibilities, artefacts, tools, stories, styles, etc. To be
competent is to have access to this repertoire and be able to use it appropriately.

Competency questions: Are understood here as those questions for which we want the
ontology to be able to provide support for reasoning and inferring processes.

Concept maps: A concept map is a diagram showing the relationships among concepts.
Concepts are connected with labelled arrows, in a downward-branching hierarchical structure.

236
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

The relationship between concepts is articulated in linking phrases, e.g., "gives rise to",
"results in", "is required by," or "contributes to".

Domain analysis: Is the process by which a domain of knowledge is analysed in order


to find common and variable components that best describe that domain

Domain expert: A domain expert or subject matter expert (SME) is a person with
special knowledge or skills in a particular area. Domain experts are individuals who are both
knowledgeable and extremely experienced with application domains

Domain ontologies: Ontologies that describe specific vocabulary.

GCG: Formally known as the GCG Wisconsin Package, the GCG contains over 140
programs and utilities covering the cross-disciplinary needs of today’s research environment.

KAON: KAON is an open-source ontology management infrastructure targeted for


business applications. It includes a comprehensive tool suite allowing easy ontology creation
and management and provides a framework for building ontology-based applications.

Knowledge: Knowledge is a mix of framed experience, values, contextual information,


expert insight and grounded intuition that provides an environment and framework for
evaluating and incorporating new experiences and information. It originates and is applied in
the minds of knowers. In organisations, it often becomes embedded not only in documents
or repositories but also in organisational routines, processes, practices and norms

Knowledge elicitation: Is the process of collecting from a human source of knowledge,


information that is relevant to that knowledge

Life cycle: Is a structure imposed on the development of a software product.

MGED: Microarray and Gene Expression Data (MGED). This is an international


organisation of biologists, computer scientists, and data analysts that aim to facilitate the
sharing of data generated using the microarray and other functional genomics technologies
for a variety of applications including expression profiling.

237
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

Method: An orderly process or procedure used in the engineering of a product or


performing a service.

Methodology: Comprehensive integrated series of techniques or methods creating a


general system theory of how a class of thought-intensive work ought to be performed.

MOBY: The MOBY system for interoperability between biological data hosts and
analytical services.

Onthology: An ontology is a non-necessarily complete, formal classification of types of


information structured by relationships defined by the vocabulary of the domain of
knowledge and by the canonical formulations of its theories.

Platform: A general, non-purpose specific solution. A platform for data integration


offers a technological framework within which it is possible to develop point solutions;
usually platforms provide non-proprietary languages, data models and data
exchange/exporting systems, and are highly customizable.

Process: Function that must be performed in the software life cycle.

Protégé: Ontology editor.

Relevant scenarios: Scenarios in which they considered the term were going to be used.

Semantic Web (SW): The semantic web is an evolving extension of the World Wide
Web in which web content can be expressed not only in natural language, but also in a format
that can be read and used by software agents, thus permitting them to find, share and
integrate information more easily.

Task: Is the atomic unit of work that may be monitored, evaluated and/or measured. A
task is a well defined work assignment for one or more project member. Related tasks are
usually grouped to form activities.

Task ontologies: Those ontologies that describe vocabulary related to tasks, processes,
or activities.

238
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

TAVERNA: The Taverna project aims to provide a language and software tools to
facilitate easy use of workflow and distributed compute technology within the Science
community.

Technique: Technical and managerial procedure used to achieve a given objective.

Terminology extraction: Terminology extraction, term extraction, or glossary


extraction, is a subtask of information extraction. The goal of terminology extraction is to
automatically extract relevant terms from a given corpus.

Text mining: Text mining, sometimes alternately referred to as text data mining, refers
generally to the process of deriving high quality information from text.

Text2ONTO: Software that allows the extraction of terminology.

UNIX: is a computer operating system.

Workflow: Workflow is a reliably repeatable pattern of activity enabled by a systematic


organization of resources, defined roles and mass, energy and information flows, into a work
process that can be documented and learned. Workflows are always designed to achieve
processing intents of some sort, such as physical transformation, service provision, or
information processing.

W2H: W2H is a free WWW interface to sequence analysis software tools like the GCG-
Package (Genetic Computer Group), EMBOSS (European Molecular Biology Open).
Software Suite) or to derived services (such as HUSAR, Heidelberg Unix Sequence Analysis
Resources).

W3H: workflow system for W2H.

239
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

ACRONYMS

BLAST: Basic Local Alignment Search Tool.

BRENDA: Braunschweiger Enzyme Database.

BSML: Bioinformatics Sequence Markup Language.

CODATA: Committee on Data for Science and Technology.

CPL: Combined Programming Language.

DTD: Document Type Definition.

EMBOSS: European Molecular Biology Open Software Suite.

GBIF: Global Biodiversity Information Facility.

GO: Gene Ontology.

G-PIPE: Graphical Pipe.

GUI: Graphical user Interface.

HTML: HyperText Markup Language.

HUSAR: Heidelberg Unix Sequence Analysis Resources.

ICIS: International Crop information System.

IEEE: Institute of Electrical and Electronics Engineers.

Jemboss: Java EMBOSS.

KEGG: Kyoto Encyclopedia of Genes and Genomes.

MAGE: MicroArray and Gene Expression.

MGED: Microarray and Gene Expression Data (MGED).

MIAME: Minimum Information About a Microarray Experiment.

MO: Microarray Ontology.

240
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

OQL: Object Query Language.

PATH: Phylogenetic Analysis Task in HUSAR.

PISE: Pasteur institute software environment.

PO: Plant Ontology.

PSI: Proteomics Standards Initiative.

RSBI: Reporting Structure for Biological Investigations.

SOAP: Simple Object Access Protocol.

SNP: Single Nucleotide Polymorphism

SQL: Structured Query Language).

SRS: Sequence Retrieval System.

SW: semantic web.

TAMBIS: Transparent Access to Multiple Bioinformatics Information Sources.

XML: Extensible Markup Language

Xpath: xml Path

XQL: XML Query Language

241
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

APPENDIX 1 – RSBI ONTOLOGY

This version of the RSBI ontology represents high-level concepts usually found in the
description of biological investigations. Protégé, version 3.1., was the ontology editor software
used during the development of this ontology.

242
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

Appendix 1 - Figure 1. Identified properties for the RSBI ontology.

243
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

Appendix 1 - Figure 2. RSBI ontology

244
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

Appendix 1 - Figure 3. A concept map for RSBI ontology.

245
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

APPENDIX 2 – EXTRACTED TERMINOLOGY

This list of terms was gathered by using Text2Onto, part of the KAON ontology
framework. In total, eight documents were scanned with this software.

c-value num- occurrences term alternate term alternate term


words
-14,62642133 1 36 time
-14,62642133 1 31 reference
-12,62642133 1 15 computer computing
-12,62642133 1 12 nucleus
-12,62642133 1 14 foundation
-14,62642133 1 30 tissue
-20,62642133 1 98 change
-12,62642133 1 12 error
-12,62642133 1 17 bank
-12,62642133 1 15 sib
-13,62642133 1 22 country
-16,62642133 1 56 format formation
-12,62642133 1 11 speed
-21,62642133 1 101 input
-14,62642133 1 39 way
-12,62642133 1 13 instln
-12,62642133 1 10 comment
-22,62642133 1 113 function functionality
-12,62642133 1 14 cop
-14,62642133 1 32 note
-13,62642133 1 26 mouse
-12,62642133 1 15 dsn dsns
-13,62642133 1 24 pointer
-15,62642133 1 45 option
-33,62642133 1 228 number
-12,62642133 1 12 guest
-12,62642133 1 12 purdy
-13,62642133 1 23 space spacing
-32,62642133 1 216 gms
-12,62642133 1 18 measure measurement
-25,62642133 1 140 sf
-13,62642133 1 23 f
-12,62642133 1 14 breeder
-12,62642133 1 13 query
-12,62642133 1 18 containing
-14,62642133 1 36 update

246
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

-12,62642133 1 10 research researcher


-13,62642133 1 24 launcher
-14,62642133 1 32 selector
-15,62642133 1 47 gms_success
-19,62642133 1 89 attribute
-14,62642133 1 36 year
-12,62642133 1 11 title
-12,62642133 1 13 if
-27,62642133 1 169 s
-14,62642133 1 37 germplasm_id
-14,62642133 1 30 isolation
-28,62642133 1 177 entry
-20,62642133 1 96 string
-17,62642133 1 69 gms_error
-12,62642133 1 11 highlight
-29,62642133 1 185 line
-35,62642133 1 243 seed
-18,62642133 1 73 search searching
-14,62642133 1 31 drb
-18,62642133 1 75 it
-13,62642133 1 23 fopt
-14,62642133 1 39 row
-23,62642133 1 123 species
-12,62642133 1 14 storage
-12,62642133 1 13 r
-12,62642133 1 19 place
-12,62642133 1 18 mode
-14,62642133 1 36 step
-14,62642133 1 33 acquisition
-13,62642133 1 24 press
-12,62642133 1 15 subdirectory
-17,62642133 1 61 code
-15,62642133 1 42 zero
-12,62642133 1 13 user_id
-13,62642133 1 21 store
-12,62642133 1 17 design designation
-27,62642133 1 167 location
-16,62642133 1 55 address
-12,62642133 1 19 variate variation
-12,62642133 1 19 view viewing
-12,62642133 1 18 haploid
-46,62642133 1 351 cross crossing
-12,62642133 1 16 routine

247
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

-12,62642133 1 12 ascii
-17,62642133 1 66 n
-12,62642133 1 11 production
-13,62642133 1 28 call
-17,62642133 1 68 crop
-13,62642133 1 20 development
-22,62642133 1 116 cf
-14,62642133 1 39 tree
-14,62642133 1 37 increase
-68,62642133 1 576 germplasm
-12,62642133 1 17 origin
-12,62642133 1 17 genesis
-12,62642133 1 10 polyploid
-24,62642133 1 137 icis
-12,62642133 1 11 evaluation
-13,62642133 1 22 second
-12,62642133 1 12 range
-25,62642133 1 146 gen
-12,62642133 1 12 case
-12,62642133 1 11 parentage
-12,62642133 1 14 see
-15,62642133 1 44 output
-12,62642133 1 15 descent
-33,62642133 1 224 population
-12,62642133 1 13 standardisation standard
-25,62642133 1 146 parent
-12,62642133 1 10 log
-13,62642133 1 26 starting start
-12,62642133 1 11 key
-15,62642133 1 43 command
-12,62642133 1 13 ltype
-12,62642133 1 15 dll
-12,62642133 1 14 display
-16,62642133 1 50 g
-12,62642133 1 10 ntype
-15,62642133 1 45 text
-15,62642133 1 47 gidx
-21,62642133 1 108 c
-12,62642133 1 14 column
-12,62642133 1 10 lgms
-12,62642133 1 12 session
-14,62642133 1 36 male
-17,62642133 1 66 half

248
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

-12,62642133 1 11 copy
-36,62642133 1 250 database
-13,62642133 1 24 genealogy genealogies
-12,62642133 1 11 exe exes
-15,62642133 1 49 browse
-12,62642133 1 17 inger
-19,62642133 1 87 derivative derivation
-13,62642133 1 24 mating mate
-16,62642133 1 53 menu
-12,62642133 1 13 instance
-18,62642133 1 74 cultivar
-12,62642133 1 10 aim
-13,62642133 1 20 pollination
-14,62642133 1 33 order
-12,62642133 1 10 factor
-15,62642133 1 41 directory
-12,62642133 1 14 purification
-17,62642133 1 62 dsp
-14,62642133 1 30 link
-13,62642133 1 20 abbreviation
-16,62642133 1 57 syntax
-15,62642133 1 46 general generation generative
-26,62642133 1 155 gid gids
-12,62642133 1 10 k
-13,62642133 1 20 will
-19,62642133 1 82 date
-12,62642133 1 18 meth
-21,62642133 1 102 self selfing selfs
-12,62642133 1 19 variety
-17,62642133 1 65 default
-12,62642133 1 17 gmsinput
-14,62642133 1 34 cv
-12,62642133 1 17 print
-23,62642133 1 123 plant planting
-13,62642133 1 21 end
-23,62642133 1 121 information
-27,62642133 1 163 o
-45,62642133 1 349 list listing
-14,62642133 1 39 variable variability
-25,62642133 1 144 field
-13,62642133 1 23 top
-12,62642133 1 15 gene
-19,62642133 1 87 installation

249
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

-17,62642133 1 60 pedigree
-20,62642133 1 97 integer
-15,62642133 1 41 section
-12,62642133 1 15 descriptor
-12,62642133 1 14 tool
-14,62642133 1 36 d
-13,62642133 1 27 history
-12,62642133 1 13 double doubling
-23,62642133 1 127 value
-13,62642133 1 22 release
-12,62642133 1 15 export exporting
-12,62642133 1 10 methn
-13,62642133 1 29 expansion
-18,62642133 1 79 collection
-13,62642133 1 23 point
-32,62642133 1 219 user
-14,62642133 1 30 l
-20,62642133 1 96 structure
-29,62642133 1 182 data
-12,62642133 1 16 gidy
-17,62642133 1 67 diallel
-28,62642133 1 179 source
-12,62642133 1 11 destination
-26,62642133 1 155 ids id
-12,62642133 1 17 help
-15,62642133 1 48 ini
-12,62642133 1 15 mass
-13,62642133 1 21 odbc
-12,62642133 1 13 mixture
-12,62642133 1 13 reason
-12,62642133 1 13 listbox
-12,62642133 1 11 szbuffer
-17,62642133 1 65 character
-12,62642133 1 11 item
-12,62642133 1 16 form
-12,62642133 1 11 h
-16,62642133 1 52 check checking
-31,62642133 1 200 click
-13,62642133 1 27 import
-16,62642133 1 50 tester
-12,62642133 1 19 auto
-13,62642133 1 25 password
-13,62642133 1 25 right

250
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

-14,62642133 1 30 convention
-30,62642133 1 197 type
-13,62642133 1 25 site
-13,62642133 1 20 run running
-17,62642133 1 62 bulk bulking
-18,62642133 1 78 man
-16,62642133 1 56 fertilising fertilisation
-14,62642133 1 31 culture
-21,62642133 1 106 figure
-14,62642133 1 33 find_next
-12,62642133 1 13 return
-13,62642133 1 25 element
-12,62642133 1 14 entity
-13,62642133 1 23 identification
-12,62642133 1 14 t
-15,62642133 1 47 administrator administration
-13,62642133 1 23 length
-16,62642133 1 51 use
-13,62642133 1 24 level
-23,62642133 1 128 clone
-12,62642133 1 12 termination terminal terminator
-22,62642133 1 110 group
-13,62642133 1 28 implementation
-12,62642133 1 15 parse parsing
-12,62642133 1 16 day
-12,62642133 1 18 ir
-13,62642133 1 20 target
-12,62642133 1 18 replacement
-12,62642133 1 10 array
-12,62642133 1 12 follows
-12,62642133 1 14 material
-12,62642133 1 10 multiple multiplication
-26,62642133 1 154 access accessing accession
-12,62642133 1 11 initialisation
-13,62642133 1 23 germplsm
-13,62642133 1 28 wheat
-12,62642133 1 10 test testing
-14,62642133 1 31 cytoplasm
-12,62642133 1 13 cd
-12,62642133 1 12 bw bws
-12,62642133 1 17 fieldbook
-32,62642133 1 215 method
-23,62642133 1 128 record recording

251
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

-18,62642133 1 74 example
-16,62642133 1 54 system
-27,62642133 1 164 description
-13,62642133 1 25 identifier
-12,62642133 1 18 back
-12,62642133 1 13 cycle
-51,62642133 1 405 table
-12,62642133 1 13 mutation
-12,62642133 1 15 works working work
-17,62642133 1 67 file
-12,62642133 1 18 a
-13,62642133 1 20 size
-13,62642133 1 21 month
-12,62642133 1 10 buffer
-17,62642133 1 62 e
-12,62642133 1 10 download
-13,62642133 1 23 match matching
-12,62642133 1 12 find_first
-12,62642133 1 11 definition
-20,62642133 1 93 argument
-12,62642133 1 12 batch
-13,62642133 1 23 dialog
-13,62642133 1 24 button
-12,62642133 1 19 char
-12,62642133 1 16 term
-23,62642133 1 124 process processing
-13,62642133 1 26 female
-12,62642133 1 19 i
-13,62642133 1 26 status
-12,62642133 1 15 local
-64,62642133 1 537 name naming
-13,62642133 1 24 maintenance
-16,62642133 1 54 management manager
-15,62642133 1 44 application
-21,62642133 1 104 set setting
-16,62642133 1 55 progenitor progenitors
-18,62642133 1 74 programming program
-12,62642133 1 11 part
-12,62642133 1 18 layout
-13,62642133 1 26 box
-30,62642133 1 197 window
-13,62642133 1 24 spp
-13,62642133 1 23 gms_germplasm

252
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

-27,62642133 1 165 selection


-16,62642133 1 50 landrace
-18,62642133 1 76 breeding
-12,62642133 1 11 m ms
-12,62642133 1 11 template
-12,62642133 1 10 privilege
-12,62642133 1 18 problem
7,624618986 2 11 location ltype
0 2 12 line cf
9,704060528 2 14 name search name searching
13,16979643 2 19 entry code
6,931471806 2 10 import clone
9,010913347 2 13 recurrent parent
13,86294361 2 20 population sf
11,78350207 2 17 group source
8,317766167 2 12 existing list
16,63553233 2 24 generative process
16,63553233 2 34 half diallel
6,931471806 2 10 cf acquisition
21,4875626 2 31 male parent
13,86294361 2 20 long integer
9,010913347 2 13 list entry
13,86294361 2 20 recurrent selection
11,78350207 2 17 full diallel
8,317766167 2 12 complex top
8,317766167 2 12 foundation seed
0 2 22 plant selection
6,931471806 2 11 type database
2,772588722 2 14 weedy spp
10,39720771 2 15 local installation
22,18070978 2 32 list selector
31,88477031 2 46 gen s
33,96421185 2 49 germplasm table
13,16979643 2 19 group id group ids
11,09035489 2 16 data source
6,931471806 2 11 element name
6,931471806 2 10 ie
11,09035489 2 16 clone s
9,704060528 2 14 man g
13,16979643 2 33 self fertilisation self fertilising
9,704060528 2 14 random bulk
0 2 12 cultivar line
12,47664925 2 18 inbred line

253
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

27,03274004 2 53 argument type


10,39720771 2 29 use description
11,78350207 2 17 germplasm id
8,317766167 2 12 user name
7,624618986 2 11 full sib
8,317766167 2 12 certified seed
13,86294361 2 20 table row
6,931471806 2 10 wild spp
7,624618986 2 11 purdy cross
0,01 2 16 plant s
22,87385696 2 33 female parent
45,74771392 2 66 local database
8,317766167 2 13 field name
4,852030264 2 10 structure element
5,545177444 2 20 tester line
7,624618986 2 11 method number
6,931471806 2 24 right click
10,39720771 2 15 landrace population
8,317766167 2 12 location information
9,010913347 2 13 germplasm bank
7,624618986 2 11 preferred abbreviation
23,56700414 2 34 gms_success gms_error
7,624618986 2 11 breeding method
13,86294361 2 20 gms database
7,624618986 2 11 bulk selection
40,89568365 2 59 central database
9,010913347 2 13 string containing
13,86294361 2 29 type use
6,931471806 2 10 long output
8,317766167 2 12 store seed
13,86294361 2 20 number integer
6,238324625 2 26 derivative method
16,63553233 2 24 fertilised species
13,86294361 2 20 cross expansion
0 2 13 terminated string
9,704060528 2 14 collection population
6,931471806 2 10 tester population
8,317766167 2 12 list manager list management
8,317766167 2 22 diallel cross
0 2 14 seed descent
12,47664925 2 18 source germplasm
0 2 12 heterozygous plant
48,52030264 2 70 gen o

254
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

17,32867951 2 25 single cross


8,317766167 2 12 derivative process
9,704060528 2 14 different entry
0 2 12 sf seed
24,26015132 2 57 single plant
8,317766167 2 12 accession number
9,010913347 2 25 seed increase
0 2 14 window right
8,317766167 2 24 fertilising species
13,86294361 2 20 population cf
11,09035489 2 16 line sf
9,704060528 2 14 pure seed
1,386294361 2 16 single seed
25,64644568 2 37 local gms
2,772588722 2 14 spp population
0 2 17 unknown derivative
11,09035489 2 16 landrace cultivar
9,010913347 2 13 description type
17,32867951 2 25 browse window
11,09035489 2 16 fertilising crop
6,931471806 2 11 database field
9,010913347 2 23 cross fertilising cross fertilisation
15,24923797 2 22 man o
7,624618986 2 11 dialog box
19,40812106 2 28 germplasm list
11,09035489 2 16 mass selection
30,49847594 2 44 central gms
13,86294361 2 20 man s
9,010913347 2 13 search string
22,87385696 2 33 germplasm record
19,40812106 2 28 tissue culture
13,86294361 2 20 man c
0 2 12 line population
11,09035489 2 16 pure line
12,47664925 2 18 sf acquisition
28,4190344 2 41 preferred name
8,317766167 2 12 bulk sf
9,010913347 2 13 local germplasm
11,09035489 2 16 new germplasm
10,39720771 2 15 selected set
13,16979643 2 19 output address
11,09035489 2 16 cross history
9,010913347 2 13 layout file

255
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

28,4190344 2 41 long input


8,317766167 2 12 generative method
6,931471806 2 10 gms_germplasm output
6,931471806 2 13 name table
10,39720771 2 15 user ids user id
6,931471806 2 10 database administrator
10,39720771 2 15 current germplasm
11,09035489 2 16 way cross
25,64644568 2 38 name type
10,39720771 2 15 germplasm data
8,317766167 2 12 sf collection
10,39720771 2 15 naming convention
14,55609079 2 21 random mating
8,317766167 2 12 change record
13,18334746 3 12 heterozygous plant s
9,887510598 3 11 database field name
13,18334746 3 12 sf seed increase
9,887510598 3 11 name type database
6,591673732 3 11 field name table
18,67640891 3 17 unknown derivative method
10,98612289 3 10 half diallel cross
9,887510598 3 11 element name type
10,98612289 3 10 cross fertilising species
15,38057204 3 14 self fertilising species
15,38057204 3 14 window right click
5,493061443 3 10 structure element name
9,887510598 3 11 type database field
13,18334746 3 12 tester line cf
0 3 29 type use description
14,28195975 3 13 null terminated string
15,38057204 3 14 single seed descent
10,98612289 3 10 weedy spp population
24,16947035 3 22 single plant selection
0 3 29 argument type use
13,18334746 3 12 cultivar line population
0 4 11 database field name table
1,386294361 4 11 element name type database
0 4 11 name type database field
0 4 10 structure element name type
0 4 11 type database field name
40,20253647 4 29 argument type use description
16,09437912 5 10 structure element name type
database
17,70381704 5 11 name type database field name

256
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

17,70381704 5 11 element name type database field


17,70381704 5 11 type database field name table

257
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

APPENDIX 3 – GMS BASELINE ONTOLOGY (VERSION 1)

This version of the GMS ontology corresponds to the work done by Patrick Ward and
Mark Wilkinson. Domain experts from The International Centre for Tropical Agriculture
(CIAT) worked with this version at a later stage. Protégé, version 3.1., was the ontology editor
software used during the development of this ontology.

Appendix 3 - Figure 1. A portion of the first version of the GMS ontology, Germplasm.

258
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

Appendix 3 - Figure 2. The Germplasm Method section of the first version of the GMS ontology.

259
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

Appendix 3 - Figure 3. The Germplasm Identifier section of the first version of the GMS ontology.

260
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

APPENDIX 4 - GMS BASELINE ONTOLOGY (VERSION 2)

This version of the GMS ontology mostly corresponds to the work done together with
domain experts from the Australian Centre for Plant Functional Genomics and The
International Center for Tropical Agriculture (CIAT). Protégé, version 3.1., was the ontology
editor software used during the development of this ontology.

261
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

Appendix 4 - Figure 1. Identified properties for the RSBI ontology.

262
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

Appendix 4 - Figure 2. Genetic Constitution, as understood by the GMS ontology.

Appendix 4 - Figure 3. Germplasm Breeding Stock, a portion of the second version of the GMS
ontology.

Appendix 4 - Figure 4. Naming convention according to the second version of the GMS ontology.
263
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

Appendix 4 - Figure 5. Plant Breeding Method according to the second version of the GMS
ontology.

Appendix 4 - Figure 6. PlantPropagationProcesses according to the second version of the GMS


ontology.

264
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

Appendix 4 - Figure 7. Some of the parent classes in the RSBI ontology.

265
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

APPENDIX 5 – PROTOCOL DEFINITION FILE GENERATED BY G-PIPE


<?xml version="1.0" encoding="UTF-8" ?>
- <!--
Protocole definition file generated by GPipe
-->
- <protocol>
<annotation>A workflow for a rodent phylogeny. The exon 28 of gene evWF (von
Willebrand Factor) from different rodents are used for the analysis.</annotation>
- <stage>
<annotation>Exon 28 from vWF genes are aligned using clustalw</annotation>
- <task id="1" email="a.garcia@imb.uq.edu.au">
<annotation>progressive multiple sequence alignment</annotation>
<transformer name="clustalw" version="1.82"
server="http://kun.homelinux.com/cgi-bin/Pise/5.a/clustalw.pl" />
+ <pipe_component>
+ <parameter name="phylip_alig">
+ <parameter name="gapopen">
+ <parameter name="gapdist">
+ <parameter name="pairgap">
+ <parameter name="newtree1">
+ <parameter name="newtree2">
+ <parameter name="bootlabels">
+ <parameter name="outfile">
+ <parameter name="gde_lower">
+ <parameter name="transweight">
+ <parameter name="hgap">
+ <parameter name="loopgap">
+ <parameter name="negative">
+ <parameter name="pwdnamatrix">
+ <parameter name="helixgap">
+ <parameter name="infile_data">
+ <parameter name="outorder">
+ <parameter name="maxdiv">
+ <parameter name="ktuple">
+ <parameter name="seqnos">
+ <parameter name="strandendin">
+ <parameter name="pwgapext">
+ <parameter name="newtree">
+ <parameter name="pwmatrix">
+ <parameter name="bootstrap">
+ <parameter name="nosecstr1">
+ <parameter name="nosecstr2">

266
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

+ <parameter name="window">
+ <parameter name="tossgaps">
+ <parameter name="strandgap">
+ <parameter name="helixendin">
+ <parameter name="helixendout">
+ <parameter name="terminalgap">
+ <parameter name="hgapresidues">
+ <parameter name="strandendout">
+ <parameter name="secstrout">
+ <parameter name="pwgapopen">
+ <parameter name="pgap">
+ <parameter name="actions">
+ <parameter name="endgaps">
+ <parameter name="output">
+ <parameter name="seed">
+ <parameter name="gapext">
+ <parameter name="matrix">
+ <parameter name="kimura">
+ <parameter name="dnamatrix">
+ <parameter name="quicktree">
+ <parameter name="outputtree">
+ <parameter name="topdiags">
</task>
</stage>
- <stage>
<annotation>The alignment result from previous step is used to build two phylogeny
using two different methods from the Phylip set of methods</annotation>
- <task id="2" email="a.garcia@imb.uq.edu.au">
<annotation>Parsimony method</annotation>
<transformer name="dnapars" version="3.6a2"
server="http://kun.homelinux.com/cgi-bin/Pise/5.a/dnapars.pl" />
+ <pipe_component>
+ <parameter name="print_steps">
+ <parameter name="print_treefile">
+ <parameter name="use_threshold">
+ <parameter name="indent_tree">
+ <parameter name="outgroup">
+ <parameter name="printdata">
+ <parameter name="print_tree">
+ <parameter name="threshold">
+ <parameter name="use_transversion">
+ <parameter name="replicates">
+ <parameter name="seqboot_seed">

267
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

+ <parameter name="method">
+ <parameter name="print_sequences">
+ <parameter name="jumble">
+ <parameter name="user_tree">
+ <parameter name="weights">
+ <parameter name="seqboot">
+ <parameter name="consense">
+ <parameter name="times">
+ <parameter name="jumble_seed">
</task>
- <task id="3" email="a.garcia@imb.uq.edu.au">
<annotation>Distance method</annotation>
<transformer name="dnadist" version="3.6a2"
server="http://gpipe.majorlinux.com/cgi-bin/Pise/5.a/dnadist.pl" />
+ <pipe_component>
+ <parameter name="matrix_form">
+ <parameter name="ratio">
+ <parameter name="printdata">
+ <parameter name="replicates">
+ <parameter name="gamma">
+ <parameter name="seqboot_seed">
+ <parameter name="method">
+ <parameter name="distance">
+ <parameter name="weights">
+ <parameter name="one_category">
+ <parameter name="seqboot">
+ <parameter name="empirical_frequencies">
</task>
</stage>
</protocol>

268
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

INDEX

A F

Acronyms 241 Feasibility study and milestones 63

Activity 54, 237


Ad hoc developments 237 G
Application ontologies 35, 237
GBIF 241
GCG 175, 176, 177, 178, 191, 199, 211, 215, 238, 240
B GO xvi, 31, 32, 71, 170, 171, 190, 219, 231, 233, 234,

BLAST 162, 241 241

BRENDA 180, 192, 241 G-PIPE xxvi, xxvii, 177, 178, 194, 195, 197, 198, 204,

BSML 241 205, 206, 207, 213, 214, 241


GUI 158, 159, 163, 164, 175, 176, 177, 178, 179, 186,

C 193, 195, 197, 199, 204, 211, 214, 241

CODATA 158, 241 H


Communities of practice xviii, 237
Competency questions 86, 93, 97, 132, 237 HTML 99, 176, 177, 178, 191, 206, 241

Concept maps 237 HUSAR 178, 240, 241, 242

Control xxviii, 60, 62


CPL 160, 172, 241 I
ICIS xxvii, 130, 133, 136, 140, 241
D IEEE 41, 52, 54, 58, 59, 72, 73, 74, 77, 78, 106, 188,

Documentation processes 60 190, 235, 241

Domain analysis 64, 90, 94, 101, 112, 238 Inbound-interaction 62

Domain expert 50, 57, 64, 65, 87, 90, 93, 94, 97, 100,
110, 125, 133, 140, 226, 238, 259 J
Domain ontologies 35, 238
Jemboss 176, 179, 241
DTD 241

K
E
KAON 135, 142, 144, 247
EMBOSS 175, 176, 178, 179, 191, 196, 199, 205, 211,
KEGG 180, 192, 241
215, 240, 241
Knowledge xviii, xxvii, xxviii, 33, 34, 36, 42, 43, 46,
evolution xi, 44, 45, 49, 50, 56, 57, 72, 74, 75, 87, 88,
47, 51, 52, 53, 63, 71, 76, 77, 78, 87, 106, 107, 108,
116, 119, 120, 121, 155, 158, 166, 186, 221, 224

269
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

116, 119, 127, 131, 142, 143, 151, 172, 189, 190, Protégé xxvi, xxvii, 81, 95, 96, 99, 100, 101, 104, 110,
235, 238 111, 113, 114, 115, 116, 118, 120, 127, 132, 133,
Knowledge acquisition 63, 71, 77, 190 142, 147, 234, 239, 243, 259, 262
Knowledge elicitation 36, 42, 43, 46, 47, 63, 119, 238 PSI 147, 151, 158, 242

L R
Life Cycle 77, 238 Relevant scenarios 239
RSBIxxi, xxv, xxvii, 80, 92, 93, 99, 105, 108, 129, 145,
M 146, 147, 148, 149, 150, 193, 242, 243, 244, 245,
263, 266
MAGE xxix, 158, 188, 220, 227, 241
MAGPIE 158, 188
S
Mailing lists 119
Management processes 62 Scheduling 60, 61, 62
Method 76, 239, 260, 265 SOAP 164, 179, 242
MGEDxvi, xix, xx, xxviii, xxix, 77, 80, 92, 93, 94, 105, SQL 160, 162, 164, 175, 185, 202, 242
108, 125, 129, 145, 146, 147, 148, 150, 158, 193, SRS 158, 160, 163, 164, 165, 178, 179, 183, 186, 188,
220, 227, 231, 235, 238, 241 242
MIAME xxi, xxix, 91, 145, 146, 150, 158, 241 SW 44, 49, 50, 57, 73, 75, 87, 88, 119, 222, 239, 242
MO xvi, xix, 49, 61, 94, 125, 220, 231, 232, 241
MOBY 164, 166, 175, 239 T
TAMBIS 171, 172, 183, 190, 221, 227, 242
O
Task 35, 178, 239, 242
On-the-Ontology comments 62 Task ontologies 35, 239
Onthology 239 TAVERNA 197, 204, 211, 240
OQL 160, 242 Technique 240
Outbound-interaction 63 Terminology extraction 85, 86, 132, 240
Text mining 240
P Text2ONTO 134, 240
The Bernaras methodology 36, 40
PATH 178, 191, 201, 203, 242
The DILIGENT methodology 36
PISE xxvii, 193, 195, 196, 204, 205, 206, 242
The Enterprise Methodology 36
PO 32, 52, 218, 227, 242
The METHONTOLOGY methodology 36, 41
PRECIS 158, 188
Process 41, 78, 176, 239
U
UNIX 175, 240

270
H H
F-XC A N GE F-XC A N GE
PD PD

!
W

W
O

O
N

N
y

y
bu

bu
to

to
k

k
lic

lic
C

C
w

w
m

m
w w
w

w
o

o
.d o .c .d o .c
c u-tr a c k c u-tr a c k

W X
W2H 176, 177, 178, 183, 191, 196, 215, 240 XML 99, 154, 156, 162, 164, 173, 174, 176, 177, 178,
W3H 176, 177, 178, 179, 196, 240 183, 185, 190, 195, 202, 204, 205, 206, 212, 214,
wiki pages 61, 62 242
WIT 180, 192 Xpath 173, 242
Workflow 177, 196, 205, 240 XQL 173, 202, 242

271

You might also like