Enhanced Ontological Searching of Medical Scientific Information

University of Manchester School of Computer Science Degree Programme of Advanced Computer Science
Enhanced Ontological Searching of Medical Scientic Information

Christos Karaiskos
A dissertation submitted to The University of Manchester for the degree of Master of Science in the Faculty of Engineering and Physical Sciences
Masters Thesis 2013
Contents
Abstract Declaration Intellectual Property Statement Acknowledgements List of Abbreviations List of Tables List of Figures 1 Introduction 1.1 1.2 1.3 1.4 Problem Context . . . . . . . . . . . . . . . . . . . . . . . . . . . Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . 7 9 11 13 15 17 19 25 25 26 27 29 31 31 33 34 34
2 Ontologies 2.1 2.2 2.3 Modern Ontology Denition . . . . . . . . . . . . . . . . . . . . . Ontology vs. Terminology . . . . . . . . . . . . . . . . . . . . . . Notable Biomedical Ontologies and Terminologies . . . . . . . . . 2.3.1 SNOMED CT . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.3.2 2.3.3 2.3.4 2.3.5
NDF-RT . . . . . . . . . . . . . . . . . . . . . . . . . . . . ICD-10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . MedDRA . . . . . . . . . . . . . . . . . . . . . . . . . . . NCI Thesaurus . . . . . . . . . . . . . . . . . . . . . . . .
35 36 37 38 39 39 40 41 41 41 41 42 42 43 43 43 44 44 44 45 45 45 45 48 52 52 55 55 56
3 Similarity Metrics 3.1 3.2 Similarity Metric vs. Distance Metric . . . . . . . . . . . . . . . . Lexical Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Character-based Similarity Measures . . . . . . . . . . . . Longest Common Substring . . . . . . . . . . . . . . . . . Hamming Similarity . . . . . . . . . . . . . . . . . . . . . Levenshtein Similarity . . . . . . . . . . . . . . . . . . . . Jaro Similarity . . . . . . . . . . . . . . . . . . . . . . . . Jaro-Winkler Similarity . . . . . . . . . . . . . . . . . . .
N-gram Similarity . . . . . . . . . . . . . . . . . . . . . . . 3.2.2 Word-based Similarity Measures . . . . . . . . . . . . . . . Dice Similarity . . . . . . . . . . . . . . . . . . . . . . . . Jaccard Similarity . . . . . . . . . . . . . . . . . . . . . . . Cosine Similarity . . . . . . . . . . . . . . . . . . . . . . . Manhattan Similarity . . . . . . . . . . . . . . . . . . . . . Euclidean Similarity . . . . . . . . . . . . . . . . . . . . . 3.3 Ontological Semantic Similarity . . . . . . . . . . . . . . . . . . . 3.3.1 Intra-ontology Semantic Similarity . . . . . . . . . . . . . Distance-based Metrics . . . . . . . . . . . . . . . . . . . . Information-Based Metrics . . . . . . . . . . . . . . . . . . Feature-Based Measures . . . . . . . . . . . . . . . . . . . 3.3.2 Inter-ontology Semantic Similarity . . . . . . . . . . . . .
4 Search Interfaces 4.1 4.2 Information Seeking Models . . . . . . . . . . . . . . . . . . . . . Query Specication . . . . . . . . . . . . . . . . . . . . . . . . . . 4
4.3 4.4
Presentation of Search Results . . . . . . . . . . . . . . . . . . . . Query Reformulation . . . . . . . . . . . . . . . . . . . . . . . . .
60 62 65 65 69 69 70 72 76 76 77 79 80 83 83 83 88 88 88 89 91 91 94 96
5 Requirements 5.1 Feature Specication . . . . . . . . . . . . . . . . . . . . . . . . .
6 Design 6.1 Stage I: Access to Medical Ontologies . . . . . . . . . . . . . . . . 6.1.1 6.1.2 6.2 Database and Table Creation . . . . . . . . . . . . . . . . Populating the Database Tables . . . . . . . . . . . . . . .
Stage II: Computation of Semantic Similarity . . . . . . . . . . . 6.2.1 6.2.2 Term Neighborhoods . . . . . . . . . . . . . . . . . . . . . Semantic Similarity Calculation . . . . . . . . . . . . . . . . . . . . . . . . . .
6.3 6.4
Stage III: Interface Design Data Presentation
Summary of Technology Choices . . . . . . . . . . . . . . . . . . .
7 Implementation 7.1 7.2 7.3 Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Search Entry Form . . . . . . . . . . . . . . . . . . . . . . . . . . Handling the Input Query . . . . . . . . . . . . . . . . . . . . . . 7.3.1 7.3.2 7.3.3 7.3.4 7.3.5 7.4 7.5 7.6 Typing Speed . . . . . . . . . . . . . . . . . . . . . . . . . Querying the Database . . . . . . . . . . . . . . . . . . . . Ranking and Grouping of Search Results . . . . . . . . . . Return-key or Mouse-click Search . . . . . . . . . . . . . . Auto-completion Search . . . . . . . . . . . . . . . . . . .
Error Correction . . . . . . . . . . . . . . . . . . . . . . . . . . . Term Information Presentation . . . . . . . . . . . . . . . . . . .
Navigation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 103
8 Evaluation 8.1 8.2
Testing the Failed Queries . . . . . . . . . . . . . . . . . . . . . . 103 Comparison to BioPortal Search Services . . . . . . . . . . . . . . 109 5
8.2.1 8.2.2 8.2.3 8.2.4 8.3
Auto-completion . . . . . . . . . . . . . . . . . . . . . . . 109 Results Ranking . . . . . . . . . . . . . . . . . . . . . . . . 111 Error Correction . . . . . . . . . . . . . . . . . . . . . . . 113 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . 114
Comments from an AstraZeneca Search Specialist . . . . . . . . . 117 121
9 Conclusions and Future Work 9.1 9.2
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 123
Bibliography Number of Words in the Document: 25648
University of Manchester School of Computer Science Degree Programme of Advanced Computer Science Author: Title: Christos Karaiskos Enhanced Ontological Searching of Medical Scientic Information ABSTRACT OF MASTERS THESIS
Supervisors: Prof. Andrew Brass (University of Manchester) Dr. Jennifer Bradford (AstraZeneca) Abstract: An enormous amount of biomedical knowledge is encoded in narrative textual format. In an attempt to discover new or hidden knowledge, extensive research is being conducted to extract and exploit term relationships from plain text, with the aid of technology. A common approach for the identication of biomedical entities in plain text involves usage of ontologies, i.e., knowledge bases which provide formal machine-understandable representations of domains of variable specicity. In addition to term extraction, ontologies may be used as controlled vocabularies or as a means for automatic knowledge acquisition through their inherent inference capabilities. Visualization of the content of ontologies is, thus, very important for researchers in the biomedical domain. Unfortunately, many of these researchers nd it dicult to deal with formal logic and would prefer that ontology search interfaces completely hide any structural or functional references to ontologies. This thesis proposes a strategy for building a web-based ontology search application that exploits ontologies behind the scene, transparently from the end user, and presents relevant concept information in such a way that searchers can successfully and quickly nd what they are looking for. The proposed search interface features various search tools for enhanced ontological searching, including term auto-completion, error correction, clever results ranking, and similar term visualizations based on semantic similarity metrics. Evaluation of the developed application shows that its features can improve enterprise-strength ontology search applications, such as BioPortal. Keywords: search interface design, ontology hiding, biomedical ontology, semantic similarity, usability, data integration
Declaration
No portion of the work referred to in the dissertation has been submitted in support of an application for another degree or qualication of this or any other university or other institute of learning.
10
Intellectual Property Statement

i. The author of this dissertation (including any appendices and/or schedules to this dissertation) owns certain copyright or related rights in it (the Copyright) and he has given The University of Manchester certain rights to use such Copyright, including for administrative purposes. ii. Copies of this dissertation, either in full or in extracts and whether in hard or electronic copy, may be made only in accordance with the Copyright, Designs and Patents Act 1988 (as amended) and regulations issued under it or, where appropriate, in accordance with licensing agreements which the University has entered into. This page must form part of any such copies made. iii. The ownership of certain Copyright, patents, designs, trade marks and other intellectual property (the Intellectual Property) and any reproductions of copyright works in the dissertation, for example graphs and tables (Reproductions), which may be described in this dissertation, may not be owned by the author and may be owned by third parties. Such Intellectual Property and Reproductions cannot and must not be made available for use without the prior written permission of the owner(s) of the relevant Intellectual Property and/or Reproductions. iv. Further information on the conditions under which disclosure, publication and commercialisation of this dissertation, the Copyright and any Intellectual Property and/or Reproductions described in it may take place is 11
available in the University IP Policy (see http://documents.manchester.ac.

uk/display.aspx?DocID=487), in any relevant Dissertation restriction decla-
rations deposited in the University Library, The University Librarys regulations (see http://www.manchester.ac.uk/library/aboutus/regulations) and in The Universitys Guidance for the Presentation of Dissertations.
12
Acknowledgements
I am deeply grateful to my supervisors, Prof. Andrew Brass (University of Manchester) and Dr. Jennifer Bradford (AstraZeneca), for their invaluable guidance and support throughout the duration of this project. I have greatly beneted from experiencing the dierent perspectives of academia and industry, which have both contributed to shaping the nal outcome of this project. I would like to thank Sebastian Philipp Brandt (University of Manchester), for his suggestions on making the search application even better. Also, I would like to express my gratitude to Julie Mitchell (AstraZeneca), for taking the time to evaluate the application, and Paul Metcalfe (AstraZeneca), for his advice on improving the performance and security of the application. Finally, I would like to thank Matina for her patience and love, and my parents, Ioannis and Stavroula, for always being there.
13
14
List of Abbreviations
AI AJAX API CSS DAG HLGT HLT HTTP IC ICD JDBC JSON LCS MedDRA NCIT NDF-RT Articial Intelligence Asynchronous JavaScript and XML Application Programming Interface Cascading Style Sheets Directed Acyclic Graph High Level Group Term High Level Term Hypertext Transfer Protocol Information Content International Classication of Diseases Java Database Connectivity JavaScript Object Notation Least Common Subsumer Medical Dictionary for Regulatory Activities National Cancer Institute Thesaurus National Drug File Reference Terminology 15
NHS NLP OBO OWL PHP PT RDF RDF-S REST RF2 SNOMED CT SNOMED RT
UK National Health System Natural Language Processing Open Biomedical Ontologies Web Ontology Language PHP Hypertext Preprocessor Preferred Term Resource Description Framework Resource Description Framework Schema Representational State Transfer Release Format 2 Systematized Nomenclature of Medicine Clinical Terms Systematized Nomenclature of Medicine Reference Terminology
SOC UMLS URI URL UX VA WHO XHTML XML
System Organ Class Unied Medical Language System Uniform Resource Identier Uniform Resource Locator User Experience U.S. Department of Veterans Aairs World Health Organization Extensible HyperText Markup Language Extensible Markup Language 16
List of Tables
5.1 5.2 6.1 6.2 6.3 7.1 7.2 7.3 7.4 8.1 Documented failed queries and suggested reasons for failure. . . . 66
Documented failed queries and suggested reasons for failure (cont.). 67 Ontologies database table structure . . . . . . . . . . . . . . . . Examples of URI formats for BioPortal RESTful services. . . . . . Technology choices for the project. . . . . . . . . . . . . . . . . . PHP les used in the search application. . . . . . . . . . . . . . . XHTML les used in the search application. . . . . . . . . . . . . CSS les used in the search application. . . . . . . . . . . . . . . 71 73 81 85 85 86 86
JavaScript les used in the search application. . . . . . . . . . . .
Testing previously failed queries. . . . . . . . . . . . . . . . . . . . 105
17
18
List of Figures
2.1 The structure of the MedDRA terminology comprises a xed-depth hierarchy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 4.2 The google search engine entry form. . . . . . . . . . . . . . . . . Facebook uses grayed-out descriptive text to help in the formulation of user queries. . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Bings search interface features a powerful dynamic search suggestion, where prexes are highlighted with grayed-out font and the remaining text is in bold. 4.4 . . . . . . . . . . . . . . . . . . . . . . 58 57 37 57
The Safari browsers embedded search interface explicitly states which queries are suggestions and which belong to the users recent search history. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.5
The Firefox browsers embedded search interface contains recent queries on top, and separates them from suggestions using a solid line. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.6
Googles search results page is a typical scrollable vertical list of captions. Metadata facets, that restrain results to a particular type of information, are also present in the interface (e.g. Images tab). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.7
Amazons search interface provides facets as a left panel to the results page, helping the user dynamically rene the initial search. 19 62
4.8
Pubmeds results page includes term expansion in two ways. On the right of the screen, there is a Related searches panel that preserves the initial query and adds a new related term to it. Also, right below the entry form there is a See also feature which suggests complete or partial modications in the initial query. . . . . 64
6.1
A part of the XML response for the get all terms query of Table 6.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.2 6.3
The provided methods of the ontoCAT API Adamusiak et al. (2011). 75 Populating the Ontologies database is performed with the help of the ontoCAT API. . . . . . . . . . . . . . . . . . . . . . . . . . . 75
7.1
The organization of the les that comprise the web application. These les are responsible for the presentation, styling and interactive behavior of the web application. . . . . . . . . . . . . . . . 84
7.2
The main window of the search application. The search box is placed at the top of the screen, with central horizontal alignment. A submit button labeled Search is also provided, to assist users that prefer mouse-clicking. . . . . . . . . . . . . . . . . . . . . . . 87
7.3
Once the user clicks inside the search box, the grey help message disappears and a blinking cursor takes its place. . . . . . . . . . . 87
7.4
Terms, that would appear on their own table row, are grouped under a more lexically-matching term to the query, when their semantic similarity to that term is higher than a threshold. . . . . 90
7.5
Pressing the Return key or clicking the Search button submits the query to index.php and a table of search results is added to the interface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
7.6
Part of the JSON response from performQuery.php, for the input query rash. Each JSON object represents a term matching the query, and contains information that can be used for its presentation. 93 20
7.7
Pressing any other key except Return submits the query through AJAX to performQuery.php and an auto-completion pop-up menu is created from the JSON response. . . . . . . . . . . . . . . . . . 93
7.8
Error correction when input query is lyng. The closest term is suggested, as a clickable link. . . . . . . . . . . . . . . . . . . . . 95
7.9
When the user places the mouse cursor on a circle, a tooltip immediately appears, containing the full term name and the semantic similarity score with the viewed term. . . . . . . . . . . . . . . . . 97
7.10 Presentation page for the NCIT term Recurrent NSCLC. On the left side, the basic term information is shown, along with an XML representation of highly similar terms. On the right side, a visualization of highly similar terms is provided, using the D3 JavaScript library. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.11 Presentation page for the MedDRA term Rash. The term has very close relations with terms that are not in the hierarchy. This is illustrated using blue color. . . . . . . . . . . . . . . . . . . . . 100 7.12 The XML representation of a term. It includes basic term information and highly similar terms. . . . . . . . . . . . . . . . . . . 101 7.13 Help is provided through tooltips that activate on mouse-over. . . 101 8.1 The term DIHS is not found, but this is normal, since it is not part of any of the supported ontologies. Instead, the term DIOS is proposed, in case the user had mispelt the query. . . . . . . . . 106 8.2 The term NMDA Antagonist is not found, but this is normal, since it is not part of any of the supported ontologies. No soundex match is found, so no error corrections are suggested. . . . . . . . 106 8.3 8.4 8.5 The term Hepatotoxicity is shown in the auto-completion dialogue.106 The term NSCLC is shown in the auto-completion dialogue. . . . 106 The term DRESS syndrome is shown in the auto-completion dialogue. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 21 99
8.6
The query LHRH produces two dierent 100%-matching results. Unlike in the previous search application, the user can now see that Gonadotropin Releasing Hormone is a preferred term for LHRH. 107
8.7
The results for the query VEGFR, illustrate a semantic grouping of 4 similar terms, namely VEGFR, Vascular Endothelial Growth Factor Receptor 1, Vascular Endothelial Growth Factor Receptor 2, Vascular Endothelial Growth Factor Receptor 3. The latter three are grouped under the parent term. . . . . . . . . . . . . . . 108
8.8
The BioPortal interface is a simple text box, similar to this projects main page. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
8.9
BioPortal also oers advanced options to improve the search results.110
8.10 Only NCIT, MedDRA and ICD9CM are chosen for searching, out of the 353 ontologies oered by BioPortal, so that comparisons to this projects work are achievable. . . . . . . . . . . . . . . . . . . 111 8.11 Auto-completion pop-up menu of BioPortal NCIT widget when the user has typed nsc. Only preferred terms are shown. The user might be confused when seeing the term Becatecarin in the results, since it does not contain nsc. . . . . . . . . . . . . . . . . 112 8.12 Auto-completion pop-up menu of this projects search application when the user has typed nsc. . . . . . . . . . . . . . . . . . . . . 112 8.13 Searching for Denatonium Benzoate through its preferred term name. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 8.14 Searching for Denatonium Benzoate through its synonym THS839. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 8.15 Searching for Denatonium Benzoate through its synonym WIN 16568. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 8.16 BioPortal search results rankings for nsclc. All terms are grouped according to the ontology they belong to, under the preferred name of the most lexically-relevant term to the query. . . . . . . . . . . 114 22
8.17 This projects search results rankings for nsclc. Terms in the results are rearranged into groups that show high semantic similarity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 8.18 BioPortal returns no search results for the erroneously spelt term nsclca. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 8.19 BioPortal returns no search results for the erroneously spelt term caancer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 8.20 This projects search application returns a search suggestion of nsclc for the erroneously spelt term nsclca. . . . . . . . . . . . 116
8.21 This projects search application returns a search suggestion of cancer for the erroneously spelt term caancer. . . . . . . . . . 116
8.22 BioPortal uses a graph to visualize hierarchical relations. Edges are annotated with a description of the relationship between the connected nodes (e.g. subclassOf). . . . . . . . . . . . . . . . . . 116 8.23 This projects application focuses on inexperienced users and attempts to completely hide any formal-logic relationships that might confuse the user. . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 8.24 Search results depicting causal associations between smoking and cancer, as presented by the I2E text mining application. . . . . . 118 8.25 Search results for the term MEK inhibitor in NCIT, when the I2E application is used. . . . . . . . . . . . . . . . . . . . . . . . . 119
23
24
Chapter 1 Introduction
Ontologies are knowledge bases which provide formal machine-understandable representations of domains of variable specicity. Given a domain of discourse, concepts that belong to the domain are well documented in formal logic, along with their inter-relations. Ontologies, as representations, cannot perfectly capture the part of the world that they attempt to describe Davis et al. (1993). They are based on the open world assumption, which states that if something is not represented in a knowledge base, it does not mean that it does not exist in the real world Hustadt et al. (1994). As our knowledge about a domain increases, ontologies are updated and they become more complex. This has become evident in the biomedical domain, where ontologies have already attained a high degree of specicity, and has led to their quick adoption for data integration and knowledge discovery purposes.
1.1
Problem Context
Within biomedicine, ontologies can help researchers communicate, by promoting consistent use of biomedical terms and concepts. The construction of an ontology itself involves mediating across multiple views and requires that a number of domain experts reach a consensus that reects the diverse viewpoints of the 25
CHAPTER 1. INTRODUCTION community. Ontologies are viewed as tools that provide opportunities for new knowledge acquisition, due to the complex semantic relations that they model. Inferences in a huge ontology may reveal connections that the human eye would bypass. This is especially important in the pharmaceutical sector, where drug discovery has slowed down signicantly as a process and in the biological sector, where attempts to demystify genome patterns associated with disease are still at initial stage. Another common use for ontologies in the biomedical domain is as controlled vocabularies that feed ltered terms into computer applications. Finally, ontologies may be used to connect terms found in plain text to their semantic representations. Term extraction with the help of ontologies is a hot topic in biomedicine, due to the vast amounts of medical information stored in plain text. Due to the importance of ontologies, it is usual for researchers in the biomedical eld to require access to their content.
1.2
Motivation
In the past, AstraZeneca employees were provided with a web-based search form that enabled them to look for concepts in one or more biomedical ontologies and select the most suitable from a list of search results. The chosen concepts were, in turn, conveyed to a text mining application. Understanding the results required the user to be familiar with the content and structure of the ontology from which the terms were retrieved. Unfortunately, most users did not feel comfortable with the idea of ontologies and struggled, or even refused, to use the provided interfaces, even though no logic-based content was there to confuse them. In many cases, though, this was not solely the fault of the users. The interface gave the users freedom to select the ontologies to be searched for the specied query. Inexperienced users usually did not know or care about which ontology contains the desired query term. For example, a user wished to search for Nonsmall cell lung carcinoma, by its abbreviation NSCLC. Querying NSCLC in 26
1.3. CONTRIBUTION the MedDRA terminology1 returned no results, since the concept is not present in the terminology. Although this behavior is correct, it seems wrong to the inexperienced user and may lead to loss of trust to the system. But even if the term is present in the ontology, the user should not be forced to know its exact spelling. For example, querying for NSCLC in the NCIT thesaurus also returned no results, despite the fact that the actual concept exists in the ontology. The searcher needed to know that the preferred term for the NSCLC concept is Non-small cell lung carcinoma. Abbreviations and dissimilar synonyms are common in the biomedical eld, so expecting the user to know the preferred term for each concept is considered problematic. In addition to the above, presentation of results was not always straightforward. Terms that demonstrate a strong semantic relation to each other were presented as stand-alone terms in the search results, subconsciously misleading users to deduce that the terms were independent. It was up to the user to judge the relevance of results to the query. For example, the results for Non-small cell lung carcinoma in NCIT included, among others, the terms Non-small cell lung carcinoma and Stage I non-small cell lung carcinoma equally spaced, in a way that users could not infer the connections between them. In fact, the latter term is a specication of the former. In reality, what users did was to choose all terms, even though they were looking for the broad term, because they became confused and did not want to take the risk of selecting only one. This collapse at the human-computer interface has motivated AstraZeneca to try to build tools that take advantage of the ontology structure and, at the same time, completely hide it from the user in order to facilitate the search procedure.
1.3
Contribution
The outcome of this thesis is the development of a user-friendly search application that allows users to nd information about concepts present in a medical
1
The dierence between terminology and ontology is described in Section 2.2
27
CHAPTER 1. INTRODUCTION ontology, without requiring from them to understand the underlying structure of the ontology. Information about a concept includes its accession code within the given ontology, the term for its preferred name, its denition and all available synonym terms. In order to facilitate the search procedure and enhance User Experience (UX), the search application includes features such as dynamic term suggestion, spelling correction and similar term visualization tools. The main challenge lies in the presentation of results; as stated in section 1.2, users are usually not sure about which term(s) to choose, when multiple similarlyspelt terms appear. Ranking of terms is performed with the aid of both lexical and semantic similarity. The former screens those terms that best match the user query and ranks them according to a string relevance metric. These results are processed by the latter, so that terms showing a strong semantic connection are grouped together. Ideally, the search application should bridge across terms from multiple ontologies. Due to the diversity in the format and annotation of dierent ontologies, this is not a straightforward generalization. Most importantly, within the biomedical society, the term ontology is often used erroneously to describe plain terminologies that, in fact, violate basic ontological principles.2 Therefore, ontologyspecic diculties are expected to arise, if semantic similarity measures are to be deployed. In summary, the goals of this thesis are to investigate the following topics: 1. To develop user-friendly search tools that allow users to build search queries based on the terms present in a medical ontology, without need for the users to understand the actual structure of the ontology. 2. To exploit the semantic annotations of the underlying ontology in order to enhance the quality and presentation of results. 3. To intermix results originating from dierent ontologies.
2
In MedDRA, the synonym of a term may be a child node of the term itself.
28
1.4. THESIS ORGANIZATION
1.4
Thesis Organization
The thesis is organized in a total of 9 chapters. Chapter 2 includes an introduction to ontologies and a brief description of some notable biomedical ontologies. Chapter 3 presents the background needed for understanding the dierent measures of lexical and semantic similarity. Chapter 4 discusses interface design principles for user-centered search applications. In chapter 5, the requirements and feature specications for the nal search application are addressed. Chapter 6 describes the design considerations that were taken into account for the ontological search application, while chapter 7 presents the nal implementation. Chapter 8 includes the evaluation of the search application. Finally, conclusions are drawn in chapter 9, along with possible future directions.
29
30
Chapter 2 Ontologies
The term ontology is an uncountable noun coined in the philosophical eld, by ancient Greek philosophers Guarino (1998). It involves the study of the nature of existence, at a fairly abstract level. In the world of computer science, the word ontology refers to the encoding of human knowledge in a format that allows for computational use. This chapter includes an introduction to the modern denition of ontology, along with a brief description of some of the most notable biomedical ontologies.
2.1
Modern Ontology Denition
In Articial Intelligence (AI), an ontology is commonly dened as a specication of a (shared) conceptualization Gruber et al. (1995). A conceptualization refers to an individuals knowledge about a specic domain, acquired through experience, observation or introspection Huang et al. (2010). Ontologies are shared conceptualizations, meaning that multiple participants, usually domain experts, contribute to their construction, maintenance and expansion. Conicts are certain to arise among the dierent participants, so an important aspect of ontology design is to bridge across multiple views of the desired domain into a single concrete representation. On the other hand, a specication is a transformation of 31
CHAPTER 2. ONTOLOGIES this shared conceptualization into a formal representation language. The outcome of a formal representation of a domain is a collection of entities, expressions and axioms. Entities include: concepts or classes, which are sets of individuals (e.g., Country, which contains all countries), individuals, which are specic instances of classes (e.g., Greece as an instance of Country), data types (e.g. string, integer), literals, which are specic values of a given data type (e.g. 1,2,3, or string values), properties (e.g. hasDisease, hasAge). Expressions refer to descriptions of entities in a formal representation language. The standardized family of languages for formal ontology representation is the Web Ontology Language (OWL), which builds on the Extensible Markup Language (XML), Resource Description Framework (RDF) and RDF-Schema (RDFS) standards to provide a highly expressive means for representing knowledge McGuinness et al. (2004). The underlying format of the resulting OWL document can vary among several types, with the most common being RDF/XML. Finally, axioms relate entities/expressions. This connection can be made class-to-class (i.e. SubClassOf), individual-to-class (i.e. ClassAssertion), propertyto-property (i.e. SubPropertyOf), among others. These relations can be asserted explicitly or inferred by a reasoner. Inferences are made, based on the logic relations of concepts. As an example of a simple inference, a concepts ancestors can be inferred automatically, once the parent concept is specied. An ontology may be visualized as a graph, in which concepts are nodes and relations are edges between nodes. Furthermore, if transitive hierarchical relations are isolated (e.g. subsumption, also known as is-a relation or hyponymy), 32
2.2. ONTOLOGY VS. TERMINOLOGY the ontology can be viewed as a taxonomy. The geometrical visualization of an ontology will be presented in more detail in chapter 3.
2.2
Ontology vs. Terminology
A terminology is a collection of term names that are associated with a given domain. A term is a mapping of a concrete concept to natural language. This term-to-concept mapping is usually not one-to-one, especially in the biomedical domain where term variation and term ambiguities arise Ananiadou and McNaught (2006). Term variation is a result of the richness of natural language and refers to the existence of multiple terms for the description of the same concept. For example, the terms Transmembrane 4 Superfamily Member 1, TM4SF1t, L6 Antigen all point to the same protein. Term ambiguity occurs when a term is mapped to more than one distinct concept. This is common when new abbreviations are introduced Liu et al. (2002). As an example, some of the concepts that the acronym CTX may map to are Cardiac Transplantation, Clinical Trial exemption and Conotoxin. Their disambiguation is a matter of context. A terminology is not constrained to being a simple list of terms. In fact, most terminologies feature some kind of structure, where terms that map to the same concept are grouped together and semantic relationships between concepts are explicitly or implicitly stated. Semantic relationships between terms include synonymy and antonymy, while semantic relationships between concepts include hyponymy, hypernymy, meronymy and holonymy Jurafsky and Martin (2000). Synonymy exists when two terms are interchangeable, while antonymy denotes that two terms have opposite meaning. Hyponymy introduces a parent-child, or is-a relation between concepts. A concept is a hyponym of another concept, if the former derives from the latter and it represents a more granular concept. Hyponymy is transitive; if concept a is a child of concept b, and concept b is a child of concept c, then a is also a child of c. Hypernymy is the reverse relation of hyponymy. Meronymy exists when a concept represents a part of another 33
CHAPTER 2. ONTOLOGIES concept. Holonymy is the opposite relation, where a concept has part some other concept(s). The dierence between a terminology and an ontology is not always clear, as terminologies continue to improve their state of organization in a way that resembles ontologies. The initial scope and aim of the two, though, is clearly dierent; the purpose of a terminology was initially, as the name implies, an eort to collect all terms associated with a specied domain. On the other hand, the target of an ontology has, from the start, been to provide a machine-readable specication of a shared conceptualization. Despite their many common characteristics, terminologies are not necessarily ontologies. If treated as ontologies, they may lead to inconsistencies or wrong inferencing mechanisms Ananiadou and McNaught (2006). An illustrative example is the case of MedDRA, which will be discussed in Section 2.3.4.
2.3
Notable Biomedical Ontologies and Terminologies
Hundreds of biomedical ontologies and terminologies have been published online. According to BioPortal1 statistics, the top ve most viewed ontologies or terminologies are SNOMED Clinical terms, National Drug File, International Classication of Diseases, MedDRA and NCI Thesaurus. In this section, a brief introduction to these ontologies/terminologies is performed.
2.3.1
SNOMED CT
The Systematized Nomenclature of Medicine Clinical Terms (SNOMED CT) is a biomedical terminology which covers most areas within medicine such as drugs, diseases, operations, medical devices and symptoms. It may be used for the cod1
BioPortal is a biomedical ontology/terminology repository which provides online ontology
presentation and manipulation tools (http://bioportal.bioontology.org/).
34
2.3. NOTABLE BIOMEDICAL ONTOLOGIES AND TERMINOLOGIES ing, retrieval and processing of clinical data. SNOMED CT is written purely in formal logic-based syntax (i.e., the so-called Release Format 2 or RF2) available and organized into multiple independent hierarchies. It is the result of the merging between the UK National Health Systems (NHS) Read codes and SNOMED Reference Terminology (SNOMED-RT), developed by the College of American Pathologists. The basic hierarchies, or axes, are Clinical Finding and Procedure. The last version contains more than 400000 concepts and over 1000000 of relationships, rendering SNOMED CT the most complete terminology in the medical domain. Only few denitions are present in the terminology. Each concept contains a unique identier and numerous synonymous terms that account for term variation. Also, each concept is part of at least one hierarchy and may have multiple is-a relationships with higher level nodes. SNOMED CT is part of the Unied Medical Language System (UMLS), a biomedical ontology and terminology integration attempt which comprises hundreds of resources.
2.3.2
NDF-RT
The National Drug File Reference Terminology (NDF-RT) was introduced by the U.S. Department of Veterans Aairs (VA) as a formalized representation for a medication terminology, written in description logic syntax VHA (2012). The terminology is organized into concept hierarchies, where each concept is a node comprising a list of term synonyms and a unique identier. As expected, top-level concepts are more general than lower-level ones. The central hierarchy is named DRUG KIND and indicates the types of medications, the preparations used in them and clinical VA drug products. Other hierarchies include DISEASE KIND, INGREDIENT KIND, MECHANISM OF ACTION KIND, PHARMACOKINETICS KIND, 35
CHAPTER 2. ONTOLOGIES PHYSIOLOGIC EFFECT KIND, THERAPEUTIC CATEGORY KIND, DOSE FORM and DRUG INTERACTION KIND. Roles exist between dierent concepts, and are specied only with existential restrictions (i.e. OWL equivalent of someValuesFrom). Mappings to other terminologies are also available. Currently, NDF-RT more than 45000 concepts in hierarchies of maximum depth 12.
2.3.3
ICD-10
The International Statistical Classication of Diseases and Related Health Problems (ICD) is a terminology which attempts to classify signs, symptoms and causes of disease and morbidity WHO (1992). It appeared in the mid-19th century and is now maintained by the World Health Organization (WHO). Currently it is available in its 10th revision, although the 11th version is claimed to be at the nal stage before release. As a taxonomy, it has relatively small maximum depth, equal to 6. Codes assigned to each concept tie it to a specic place in the taxonomy, with each code having only a single parent. It is thus not a proper application of ontological principles2 , since, in reality, it is not unusual for concepts to belong to more than one subsumers, and this is not modeled. In addition to that, there exist categories such as Not otherwise specied or Other, which are not needed in an ontology; the open world assumption already covers the fact that every ontology is incomplete, so stating it explicitly is redundant and may interfere with the evolution of the ontology, as new terms are not classied under their closest match.
2
nor was meant to be; its intent is classication
36
2.3. NOTABLE BIOMEDICAL ONTOLOGIES AND TERMINOLOGIES
Figure 2.1: The structure of the MedDRA terminology comprises a xed-depth hierarchy.
2.3.4
MedDRA
The Medical Dictionary for Regulatory Activities (MedDRA) is a terminology that is concerned with biopharmaceutical regulatory processes. It contains terms associated with all phases of the drug development cycle. MedDRA is organized in a hierarchical structure of xed depth, as seen in Fig. 2.1. System Organ Classes (SOCs) represent the 26 predened overlapping hierarchies in which terms belong to. High Level Group Terms (HLGTs) and High Level Terms (HLTs) are general term groupings, denoting disorders or complications. Preferred Terms (PTs) denote the preferred name for a concept, while Lowest Level Terms (LLTs) include terms of maximum specicity. LLTs may be connected with hyponymy, meronymy or synonymy relationships to their PTs. This is the main problem in trying to view MedDRA as an ontology. In a formal ontology, a concept cannot be a child of itself. In MedDRA, this clearly happens, when a PT and its LLTs share a synonymy relation. 37
CHAPTER 2. ONTOLOGIES
2.3.5
NCI Thesaurus
The National Cancer Institute Thesaurus (NCIT) is a controlled terminology for cancer research. The thesaurus has been converted to formal OWL syntax and is updated at xed intervals. The conversion was not an easy one; many inconsistencies and modeling dead-ends that were encountered in the conversion procedure have been documented Ceusters et al. (2005), along with some clear violations of ontological principles Schulz et al. (2010). The NCIT provides almost 100000 concepts, with approximately 65% containing a denition.
38
Chapter 3 Similarity Metrics

Similarity metrics aim at measuring the lexical or semantic similarity between terms. Lexical similarity focuses on terms that contain similar character or word sequences, while semantic similarity tries to determine how close in meaning the terms are. Lexical similarity is simpler to calculate, since string-based algorithms only require plain text to function. On the other hand, semantic similarity requires extra information about the terms present in plain text. This extra information is usually acquired with the help of a knowledge base (e.g. ontology, terminology, etc.) or through statistical analysis of corpora, i.e., large collections of text documents that resemble real-world usage of words.
3.1
Similarity Metric vs. Distance Metric
It is common in literature to come across the term semantic distance, instead of semantic similarity. A distance metric d(a, b), that compares entities a and b, must satisfy the following properties: 1. d(a, b) = 0 if and only if a = b 2. d(a, b) = d(b, a) 3. d(a, b) 0 39 (zero property), (symmetric property), (non-negativity property),
CHAPTER 3. SIMILARITY METRICS 4. d(a, b) + d(b, c) d(a, c) (triangular inequality).
On the other hand, the requirements for a similarity metric were formally introduced not long ago Chen et al. (2009). The denition states that a similarity metric s(a, b) must satisfy the following properties: 1. s(a, a) 0, 2. s(a, b) = s(b, a), 3. s(a, a) s(a, b), 4. s(a, b) + s(b, c) s(a, c) + s(b, b), 5. s(a, a) = s(b, b) = s(a, b) if and only if a = b. The counter-intuitive 4th property can be proven, using set theory. More specifically, if |a b| denotes the cardinality of common characteristics between a and b, and c denotes the complement of c, the following equality holds: |a b| = |a b c| + |a b c |. Then, |a b| + |b c| = |a b c| + |a b c | + |a b c| + |a b c| |a c| + |b|, (3.2) since |a b c| |a c| and |a b c | + |a b c| + |a b c| |b|. Deduction of similarity from distance is a common procedure that requires simple operations. Similarity is, intuitively, a decreasing function of distance. Conversion between the two can take many forms Chen et al. (2009). In this thesis, all formulas will be presented as similarity measures. (3.1)
3.2
Lexical Similarity
String-based methods that calculate lexical similarity can be divided into characterbased and word-based. In this section, some of the most popular metrics are presented. For a more complete survey of lexical similarity measures see Navarro (2001) and Gomaa and Fahmy (2013). 40
3.2. LEXICAL SIMILARITY
3.2.1
Character-based Similarity Measures
In character-based similarity, strings are viewed as character sequences and attempts are made to discover character relevance.
Longest Common Substring The Longest Common Substring algorithm Guseld (1997) tries to nd the maximum number of consecutive characters that two strings share. It may be implemented using a sux tree or dynamic programming.
Hamming Similarity Hamming similarity is a metric that can be applied to strings of equal length. It is a simple metric that measures the number of common characters between two strings. Given strings a and b, the formula for string similarity can be constructed as follows: 1(ai = bi ) simham (a, b) =
i
|a|
(3.3)
where 1() is the indicator function and | | denotes string length, measured in characters.
Levenshtein Similarity Levenshtein distance counts the number of character alterations that need to be made in order to transform one string to another Levenshtein (1966). This number is bounded by the length of the larger string, which is commonly used as a normalizing measure that restrains the value of distance to [0, 1]. Mathematically, normalized Levenshtein distance of terms a and b is computed using the following formula: dlev (a, b) = leva,b (|a|, |b|) , max{|a|, |b|} 41 (3.4)
CHAPTER 3. SIMILARITY METRICS where | | denotes string length in number of characters, max{i, j } , if min{i, j } = 0 leva,b (i 1, j ) + 1 leva,b (i, j ) = min leva,b (i, j 1) + 1 , else lev (i 1, j 1) + [a = b ] a,b i j
(3.5)
and max{}, min{} denote the maximum and minimum functions, respectively. Converting normalized distance to similarity can be done as follows: simlev (a, b) = 1 dlev (a, b). Jaro Similarity Jaro similarity Jaro (1989, 1995) takes into account both the number and sequence of common characters present in the two strings. Let us consider strings a = a1 . . . aK and b = b1 . . . bL . A character ai is said to be common with b if the character exists in b within a window of
min{|a|,|b|} 2
(3.6)
from bi . Let a = a 1 . . . a K be
those characters in a that are common with b, and b = b 1 . . . b L those characters in b that are common with a. A transposition for a , b is a position i in the strings a , b in which ai = bi . The number of transpositions for a , b divided by two is denoted as Ta ,b . Then, Jaros formula for similarity is given by: simjaro (a, b) = 1 3 |a | |b | |a | Ta ,b + + |a| |b| |a | . (3.7)
It should be noted that Jaro similarity violates the symmetry property of Eq. 3.1, therefore it is not a true similarity metric, according to that denition. Jaro-Winkler Similarity Jaro-Winkler similarity Winkler (1999) is a variation of Jaro similarity which promotes strings with long common prexes. The length of the longest prex common to both strings a and b is denoted as P . Then, if P = max(P, 4), 42
3.2. LEXICAL SIMILARITY Jaro-Winkler similarity is given by: simj &w (a, b) = simjaro (a, b) + P (1 simjaro (a, b)). 10 (3.8)
N-gram Similarity A string can be split into n-grams, i.e. all possible consecutive character sequences of length n in the string. As an example, the word protein can be split into the 3grams pro, rot, ote, tei and ein. When comparing two strings, the number of common n-grams is computed and normalized by the maximum number of n-grams. More specically, given strings a and b, similarity is given by: simngram (a, b) = Ncom , Nmax (3.9)
where Ncom denotes the number of common n-grams and Nmax denotes the maximum number of n-grams in either of the two strings.
3.2.2
Word-based Similarity Measures
As the name implies, word-based measures view the string as a collection of words. Similarity measures dictate how similar two terms are word-wise, and no weight is given on character similarity.
Dice Similarity Dice similarity considers input strings a and b as sets of words A and B respectively, and calculates similarity as follows: simdice (a, b) = 2|A B | , |A| + |B | (3.10)
where | | denotes set cardinality in number of words. 43
CHAPTER 3. SIMILARITY METRICS Jaccard Similarity Jaccard similarity counts the number of common words of the compared strings and divides it by the number of distinct words in both strings, i.e. simjacc (a, b) = Cosine Similarity In order to compute cosine similarity, the compared strings should be converted to vectors. The dimension of the resulting vectors will be equal to the total number of distinct words present in both. Therefore, each element in the vector represents one word. The vector values for each string are computed as follows: A vector contains unitary values in positions that correspond to words that are contained in the respective string. Similarly, a vector contains zero values in all positions that correspond to words that are not present in the respective string. Given strings a and b, the respective vectors a and b are computed. Cosine similarity is then given by: simcos (a, b) = ab , ||a|| ||b|| (3.12) |A B | . |A B | (3.11)
where || || denotes the Euclidean norm function. Manhattan Similarity Taxicab geometry considers that distance between two points in a grid is given by the sum of the absolute dierences of their respective coordinates. The grid resembles a uniform city road map, where diagonal movements are not permitted. This is the reason why the distance metric in this space is often called Manhattan distance or city block distance. Considering N -dimension string vectors a and b, Manhattan distance can be computed as:
N
|ai bi | simmanh (a, b) = 1

i=1
(3.13)
where N is a normalizing constant that represents the dimension of a and b. 44
3.3. ONTOLOGICAL SEMANTIC SIMILARITY Euclidean Similarity Euclidean similarity also considers strings as vectors, and computes similarity as:
N
|ai bi |2 simeucl (a, b) = 1

i=1
(3.14)
3.3
Ontological Semantic Similarity
An ontology is a collection of concepts and their inter-relationships. It may be visualized as a graph, in which nodes represent concepts and edges represent the relations between them. Usually, ontologies are viewed as taxonomies, where isa and part-of relations play the most important role. Viewing the ontology as a taxonomy, one can apply semantic similarity metrics that exploit the hierarchical structure. Probably the most famous object of semantic similarity tests is the computational lexicon WordNet Miller (1995). In WordNet, closely related terms are grouped together to form synsets. These synsets, in turn, form semantic relations with other synsets. WordNet is commonly referred to as a lexical ontology, due to an obvious mapping of lexical hyponymy to ontological subsumption.
3.3.1
Intra-ontology Semantic Similarity
Intra-ontology semantic similarity metrics are meant to measure similarity between concepts that reside within the same ontology. These metrics can be roughly divided into distance-based, information-based and feature-based.
Distance-based Metrics Distance-based metrics take advantage of the ontological topology to compute the similarity between concepts. This method requires viewing the ontology as a rooted Directed Acyclic Graph (DAG), in which nodes are concepts and edges among them are restricted to hierarchical relationships, with the most usual type 45
CHAPTER 3. SIMILARITY METRICS being is-a relationships. At the top, there is a single concept, the root. The graph is directed, starting from a low-level concept and directed towards its ancestors through transitive relationships. The graph is also acyclic, since a nite path from a source node to a destination node cannot return to the source node. In other words, a node can never be a child of one of its children. A simple look at an ontology from a geometric perspective may reveal important information about the similarity of concepts. As depth in the DAG increases, concepts become increasingly specic, thus similarity is expected to increase. Another important characteristic of the ontology DAG is that the path between concepts is not always unique, therefore distance-based similarity will depend on which path is chosen. Finally, the density of nodes is a good indicator of similarity; as density increases, concepts approach each other and similarity increases. The accuracy of distance-based methods depends on the level of detail that the ontology captures. A poorly structured ontology with many omissions might yield misleading similarity results. Fortunately, a lot of eort has been made to make biomedical ontologies as complete as possible, therefore network density in biomedical ontologies is usually high. The most straightforward way to measure the similarity of concept nodes is given in Rada et al. (1989). In that work by Rada et al., all edges are assigned a unitary weight and the distance between two concepts is equal to the number of edges that are present in their shortest path. Let us consider two distinct concepts c1 and c2 in the hierarchy. Each path i that connects these two concept nodes may be represented as a set which includes all edges ek present in the path, i.e. pathi (c1 , c2 ) = {e1 , e2 , . . . , eK }. (3.15)
with cardinality |pathi (c1 , c2 )| = K . The distance between concepts c1 and c2 is, then, equal to the shortest path that connects them, i.e., drada (c1 , c2 ) = mini |pathi (c1 , c2 )|. 46 (3.16)
3.3. ONTOLOGICAL SEMANTIC SIMILARITY Note that in literature, there are cases (e.g. Al-Mubaid and Nguyen (2006)) where Radas measure is used with node counting, instead of edge counting. In those cases, each path is represented as a set of the nodes that compose it, including the end nodes. The minimum distance can be converted into a similarity metric, as in Resnik (1995): simrada (c1 , c2 ) = 2D d(c1 , c2 ), (3.17)
where D is the maximum depth of the taxonomy. This method fails to capture the intuition that concept nodes, which reside at the lower part of the hierarchy and are separated by distance d, are more similar than higher-level nodes with the same distance separation d. Also, its success highly depends on the uniformity of edge distribution within the ontology. For these reasons, other approaches have been proposed in order to achieve a more representative score of similarity. In Wu and Palmer (1994), the relative depth of the compared concepts in the hierarchy is considered. In that work, Wu and Palmer introduce the Least Common Subsumer (LCS) of the compared concepts. The LCS is the hierarchically deepest common ancestor of the compared concepts. Similarity for concepts c1 and c2 is then given as: simw&p (c1 , c2 ) = 2h , N1 + N2 + 2h (3.18)
where N1 is the number of nodes in the path between concept c1 and the LCS, N2 is the number of nodes between concept c2 and the LCS, and h is the depth of the LCS, measured again in number of nodes. In Li et al. (2003), the authors followed various strategies in their attempt to calculate similarity as a function of the shortest path between the compared concepts, the depth of their LCS and the local density of the ontology. They perceived that the best performance was obtained when they used the following non-linear function: simli (c1 , c2 ) = e drada (c1 ,c2 ) eh eh , eh + eh (3.19)
where , are non-negative parameters and h = drada (LCS(c1 , c2 ), root) denotes the minimum depth of the LCS. Distances are measured in number of edges. 47
CHAPTER 3. SIMILARITY METRICS Al-Mubaid and Nguyen attempt to combine path length and node depth in one measure. In Al-Mubaid and Nguyen (2006), they view the DAG as a composition of clusters, with each cluster having as root a child of the ontology root. The usage of clusters aims to exploit local characteristics of dierent branches. Given concepts c1 and c2 , they rst compute their so-called common specicity: Cspec(c1 , c2 ) = Dc h, (3.20)
where Dc denotes the depth of the specic cluster and h refers to the depth of the LCS in the ontology, with both quantities measured in number of nodes. Then similarity is computed as: sima&n (c1 , c2 ) = log((P ath 1) (CSpec) + k ), (3.21)
where P ath is a modied version of Radas distance measure which is adapted according to the largest cluster, and , , k are constants, whose default values are unitary.
Information-Based Metrics One of the rst attempts to focus on nodes in the similarity formula is that of Leacock and Chodorow Leacock and Chodorow (1998). This method uses negative log likelihood in a way that resembles the formula of self-information Cover and Thomas (2012), but does not really involve valid probability. Instead, a normalized form of the path length between the concepts is used: siml&c (c1 , c2 ) = log(Np /2D), (3.22)
where Np is the number of nodes in the shortest path between concepts c1 and c2 . This variable also includes the end nodes. Resnik, in Resnik (1995), continues down this path by replacing the normalized path length with a probability measure P() to calculate the information content (IC) of a concept. He considers all common subsumers CSi of concepts 48
3.3. ONTOLOGICAL SEMANTIC SIMILARITY c1 and c2 and calculates similarity as: simresn (c1 , c2 ) = max[log(P(CSi ))],
i
(3.23)
or, equivalently, simresn (c1 , c2 ) = log(P(LCS )). (3.24)
Considering that the IC of a concept c is dened as the negative logarithm of its probability, i.e. IC(c)= -log(P(c)), equation (3.24) can also be written as: simresn (c1 , c2 ) = IC(LCS(c1 , c2 )). (3.25)
Probabilities are estimated with the help of a text corpus, i.e. a collection of nature language excerpts, specically chosen to provide a good representation of actual term usage. When dealing with biomedical ontology concepts, collections of Pubmed1 abstracts are commonly used as corpora to determine the probability of each concept. Given a corpus, the occurrence of a term which corresponds to concept c essentially implies the occurrence of each and every concept that subsumes c within the ontological structure. Conversely, the number of occurrences of a concept c depends not only on the number of appearances of c itself in the corpus, but also on every occurrence of its descendants in the hierarchy. Thus, the number of occurrences of concept c is given by: occ(c) = count(n),
n=subsumed(c)
(3.26)
where subsumed(c) represents c and its children concept nodes, and count() denotes the number of occurrences of the specic concept within the given corpus. Converting occurrences to probability can be done using: P(c) = occ(c) , N (3.27)
where N is the total number of occurrences of ontology terms in the corpus. This method results to higher probabilities for concepts residing at the top part
1
http://www.ncbi.nlm.nih.gov/pubmed
49
CHAPTER 3. SIMILARITY METRICS of the hierarchy, with the root having unitary probability. Therefore, concepts whose LCS lies lower in the hierarchy are more similar, since their LCS has low probability (i.e., high IC). A possible drawback of this method is that probabilities are tied to the choice of corpus. So far, in the biomedical domain, there is no widely accepted corpus that covers the domain needs Al-Mubaid and Nguyen (2006). This is due to the fact that thousands of new terms and abbreviations appear in the literature every year, thus a stable corpus might not function well. Since extensions of the corpus would need to be considered at xed intervals, it might not serve as a useful benchmark. Alternatively, computation of IC can be performed without the use of a corpus, by solely relying on the structure of the ontology DAG. Intrinsic computation of IC involves approximating the occurrence probability of a concept as a function of multiple variables, such as number of descendant nodes, number of subsumers or number of descendant nodes which are leaves in the ontology. In Seco et al. (2004), the IC of a concept c is given by: ICseco (c) = 1 log(descendants(c) + 1) , log(allConcepts) (3.28)
where descendants(c) returns the number of nodes that concept c subsumes, and allConcepts denotes the number of all the available concepts in the ontology. The IC function introduced by Seco et. al has the drawback that it assigns IC equal to one for every leaf node in the ontology, and also that concepts containing the same number of descendant nodes are again given the same IC. An attempt to distinguish the IC between leaf concepts was made in Zhou et al. (2008), by also including the depth of the node in the calculation, normalized by the maximum depth of the ontology. The proposed IC formula is given by: ICzhou (c) = k ICseco (c) + (1 k ) log(depth(c) + 1) , log(maxDepth) (3.29)
where depth(c) represents the depth of the concept c in the hierarchy, maxDepth is the maximum depth of the ontology, measured in node number and k is a weighting constant. 50
3.3. ONTOLOGICAL SEMANTIC SIMILARITY The authors in S anchez et al. (2011) further improve the modeling of the IC function. In that work, the IC function can also distinguish concepts that contain the same number of descendants, due to the fact that the number of subsumers of a concept is also used. The IC is given as: ICsan (c) = log
leaves(c) ancestors(c)
+ 1) , (3.30)
allLeaves
where leaves(c) is the number of nodes that are descendants of c and have no children, ancestors(c) refers to the number of concepts which subsume c and allLeaves denotes the total number of leaf nodes in the ontology. The IC functions of equations (3.28), (3.29) and (3.30) can be used in equation (3.25) to compute the similarity between two concepts without using a corpus. Lin et al. use IC in an alteration of the similarity metric of Wu and Palmer (1994). More specically, siml&p (c1 , c2 ) = 2 simresn (c1 , c2 ) , IC(c1 ) + IC(c2 ) (3.31)
This approach aims to include the individual characteristics of the compared nodes that Resniks approach neglected. Indeed, in Resniks measure, any two pairs of nodes that have the same LCS produce the same similarity. Jiang and Conrath follow a similar approach with Wu and Palmer (1994), but avoid the scaling of similarity Jiang and Conrath (1997). Instead, they use a distance metric as follows: dj &c (c1 , c2 ) = IC(c1 ) + IC(c2 ) 2 simresn (c1 , c2 ). (3.32)
Various transformations have been applied to convert this distance to similarity. Among these, the authors in Seco et al. (2004) consider a linear transformation and present the following formula of similarity normalized in the interval [0,1]: simj &c (c1 , c2 ) = 1 dj &c (c1 , c2 ) . 2 (3.33)
Another example can be found in Zhu et al. (2009), in which an exponential function is used for the similarity formula, along with a constant that accounts 51
CHAPTER 3. SIMILARITY METRICS for curve steepness: simj &c (c1 , c2 ) = e Feature-Based Measures Feature-based measures do not necessarily conform to the similarity metric rules of Chen et al. (2009), as they allow for similarity asymmetry. In feature-based techniques, the two compared concepts are viewed as sets of features, in contrast to the geometric view presented in previous sections. To calculate similarity, not only the common features of the concepts are taken into account, but also the dierences between them. That way, common features improve similarity, while dierent features penalize its value Tversky et al. (1977). Given concepts c1 and c2 , let C1 and C2 denote the sets that contain their features. Then, similarity between the two can be given as: simtve (c1 , c2 ) = |C1 C2 | , |C1 C2 | + |C1 C2 | + (1 )|C2 C1 | (3.35)
dj &c (c1 ,c2 )
(3.34)
where is a weight which takes values in [0,1]. In Rodr guez et al. (1999), the parameter is computed as follows: d(c1 ,LCS ) d(c1 ,c2 ) = 1 d(c1 ,LCS ) d(c1 ,c2 )
, d(c1 , LCS ) d(c2 , LCS ) , else
(3.36)
This asymmetric function stems from Tverskys observation that similarity might not be symmetric. In one of Tverskys examples, North Korea was said to be more similar to Red China than the reverse.
3.3.2
Inter-ontology Semantic Similarity
Inter-ontology semantic similarity measures try to quantify the similarity between concepts that belong to dierent ontologies. Fairly little research has been documented in this area, due to the inherent diculty of comparing heterogeneous structures. A common approach is to combine the dierent ontologies into a 52
3.3. ONTOLOGICAL SEMANTIC SIMILARITY single ontology through detailed concept mappings Gangemi et al. (1998). It is clear that this is very challenging and requires the help of a domain expert, as well as plenty of time and eort. Furthermore, not all biomedical terminologies are consistent and their lack of homogeneity is a major problem. Simpler approaches have been proposed in the literature. A usual rst step is to merge the dierent ontologies under a dummy root. This approach is found in Rodr guez and Egenhofer (2003), where the authors use a weighted version of Tverskys similarity which also takes into account geometrical features of the ontologies. A similar route is followed by Petrakis et al. (2006), where the authors substitute Tverskys similarity with a form of Jaccard similarity. The drawback of these cross-similarity metrics is that they do not consider term overlap in both ontologies. Other methods rely on extensions of single ontology similarity metrics. Examples of such work can be found in Al-Mubaid and Nguyen (2006) and S anchez et al. (2012).
53
54
Chapter 4 Search Interfaces

Search has risen to be one of the most commonly used tools for computer users. It can be found everywhere, from stand-alone web-based search engines to embedded search forms that appear in desktop applications and websites. To a large extent, success of the search procedure depends on the users ability to formulate their information needs, transforming them into queries that are highly likely to produce desired results. For this reason, a lot of eort has been spent on improving the search interfaces and providing tools that will enhance user experience. In this chapter, the basic characteristics of successful search interface design are presented, with main focus on web-search interfaces.
4.1
Information Seeking Models
Information seeking models attempt to recognize and describe the strategies followed by humans from the moment they sense a search need until the moment they acquire desired results. The search procedure may be viewed as a repetition of actions. In Sutclie and Ennis (1998), the authors identify the following four actions in what is considered the standard model of information seeking: 1. Problem Identication 2. Articulation of Need 55
CHAPTER 4. SEARCH INTERFACES 3. Query Formulation 4. Evaluation of Results The rst step refers to conceptualization of the search need, while the second step involves expressing this need in words. The third step requires the user to transform the articulated need into a format that will be accepted by the underlying search system. Finally, the fourth step refers to the procedure of judging the results critically, exploiting any relevant domain knowledge and deciding whether the need is satised. A search may be characterized as ok, failed or unsatisfactory. An ok search ends the cycle successfully. An unsatisfactory search may lead to reformulation of the query or re-articulation of the need, while a completely failed search might require re-identication of the problem. Sutclie and Enniss model assumes that the need does not change, unless results are disappointing. It does not capture the fact that users learn as they search. This dynamic aspect of information seeking was captured in an earlier work by Bates Bates (1989). In that study, the users needs are assumed to change as the process advances. Furthermore, Bates claims that the success of the search procedure does not only depend on the nal list of results, but on the selections made along the way. This model is referred to as the berry-picking model, to denote that it does not result in a single set of results. A simple example of the berry-picking model can be illustrated when a user attempts a broad query such as String similarity algorithms and renes the query to Jaro similarity after viewing this result in the initial result list.
4.2
Query Specication
Queries are usually specied through rectangular entry forms, as in Fig. 4.1. The width of these forms varies in size, with studies showing that wider forms promote formulation of longer queries Franzen and Karlgren (2000); Belkin et al. (2003). It has been observed that around 88% of search queries are composed of 1 to 4 56
4.2. QUERY SPECIFICATION
Figure 4.1: The google search engine entry form.
Figure 4.2: Facebook uses grayed-out descriptive text to help in the formulation of user queries.
words, with mean length equal to 2.8 words per query Jansen et al. (2007). The actual search is executed by pressing the return key or mouse-clicking a specied button (e.g. magnifying glass in Bing). In some cases, entry forms decorate their background with descriptive text that provides guidance for the user. An example is Facebooks search form, as seen in Fig. 4.2. The text disappears, once the user clicks inside the form. This usually helps to narrow down the search domain. After query submission, processing of the query takes place before any attempt to retrieve results. This process may include removal of stopwords (i.e. words with high appearance probability such as the, a), normalization of words (e.g. plural to singular) and permutation of word order. Boolean logic may also be used in the case of multiple words per query. Returning results that contain all query words (i.e. Boolean AND operator) seems more intuitive, although this might sometimes lead to overly specic queries that return no results. The actual types of processing are often hidden from the users, in an attempt to avoid confusion and promote transparency, Muramatsu and Pratt (2001). Most modern search interfaces are equipped with dynamic search suggestion, also known as auto-completion (See Fig. 4.3). As the user starts typing, a list of 57
CHAPTER 4. SEARCH INTERFACES
Figure 4.3: Bings search interface features a powerful dynamic search suggestion, where prexes are highlighted with grayed-out font and the remaining text is in bold.
term suggestions appears under the entry form. The suggestions contained in the list are usually queries whose prex matches what has been typed so far, although there are cases where interior matches are also included. The user can then mouseclick the most relevant query or navigate through the list, using keyboard arrows. Studies have shown that approximately one third of all search attempts in the Yahoo Search Assist were performed through a dynamically suggested query Anick and Kantamneni (2008). The dynamic search suggestion technique attempts to minimize unneeded typing from the user side and can alleviate spelling errors early. Most importantly, though, it reassures the user that results are available, so there is no frustration from empty result pages. An important point to consider is that searchers often return to their previously accessed information. In the empirical study undertaken by Tauscher and Greenberg Tauscher and Greenberg (1997), it was found that there is a 58% chance that the next web page to be visited had been visited before. A more recent study Zhang and Zhao (2011) about tabbed browsing, conducted in 2010, also nds page revisitation to be around the same levels, at 59.3%. Various tools 58
4.2. QUERY SPECIFICATION
Figure 4.4: The Safari browsers embedded search interface explicitly states which queries are suggestions and which belong to the users recent search history.
Figure 4.5: The Firefox browsers embedded search interface contains recent queries on top, and separates them from suggestions using a solid line.
exist to help users nd their intended pages, including Uniform Resource Locator (URL) history, bookmarking of pages, basic navigation buttons (e.g. Back button for short term page revisit) and change of URL font color if page has already been visited. Among other methods documented, users may save whole webpages to their local disk or keep URLs in text documents, after enriching them with comments Jones et al. (2002). Interestingly, a common approach to revisiting documents is actually re-searching for them Obendorf et al. (2007). Users who 59
Figure 4.6: Googles search results page is a typical scrollable vertical list of captions. Metadata facets, that restrain results to a particular type of information, are also present in the interface (e.g. Images tab).
adopt this strategy attempt to re-create the conditions of their previous search, by trying to formulate the exact same query. Another strategy requires past search queries to appear as the user types, along with regular dynamic term suggestion. Separation between suggested queries and previously generated ones varies among interfaces, as can be seen in Figures 4.4 and 4.5.
4.3
Presentation of Search Results
Search applications usually present results as a vertical list of captions, distributed along multiple pages (see Fig. 4.6). Each caption is a clickable entity which, as a minimum requirement, comprises a title and an excerpt of the target document Clarke et al. (2007). Usually, the excerpt includes some or all of the query terms, as highlighted text. In most cases, highlighting is performed using bold font or colored term background. Many search applications tend to group similar results, that originate from the same source, into the same caption. That way, result 60
4.3. PRESENTATION OF SEARCH RESULTS pollution from few sources is avoided and diversity is promoted. The relevance of search results is reected in their order of appearance. Although relevance scores were formerly used to grade the t of the result to the query, they are usually not present anymore in modern search applications. The reasons behind their omission might be to avoid reverse-engineering of the ranking algorithms and to reduce redundancy, since the ranking itself already reects the importance of results Hearst (2009). It has been observed that users tend to click on the uppermost captions Joachims et al. (2005). In the same study, it was found that the rst caption received more attention than its successors, even if its relevance was actually lower. Furthermore, the majority of users often remain on the rst page of results. The authors in Jansen et al. (2007) observed that only 30% continued to look for relevant results in the second page of the results, and only 15% looked even further. Usually, the patience of a user is a function of his/her experience in using the system. More experienced users tend to be more patient than users who are not accustomed to the search procedure. Inexperienced users, on the other hand, often prefer to rene their query or simply accept that what they search for cannot be found by the search application Hearst (2009). Apart from plain lists of results, further organization of captions may be performed, using some form of faceted browsing. Facets attempt to rene search results, according to their characteristics. As an example, Amazons search interface provides facets that correspond to the dierent departments that might contain the desired item (see Fig. 4.7).
61
Figure 4.7: Amazons search interface provides facets as a left panel to the results page, helping the user dynamically rene the initial search.
4.4
Query Reformulation
It is common that desired search results are not discovered with the rst try. Query reformulation is the procedure which attempts to transform the original query to a format that will match the information retrieval systems vocabulary. Studies using query logs have shown that the number of reformulated queries may reach up to 52% Jansen et al. (2005) of all queries. It has been observed that, if no help for query reformulation is given explicitly by the search application, users tend to provide simple alterations of the initial query Hertzum and Frkjr (1996). This bias towards initial queries is referred to as anchoring, a term coined by psychologists Tversky and Kahneman (1975). One of the most common sources of search failure is query mistyping Cucerzan and Brill (2004). A common approach, which aims to correct typographical errors, is using a dictionary and nding the most similar term to the erroneous query Kukich (1992). Among other techniques mentioned in that work are heuristic rule-based corrections, probabilistic approaches that determine how often specic 62
4.4. QUERY REFORMULATION sequences of characters are spelt wrong, and neural network models that train the system to automatically identify errors. The outcome of the reformulation procedure may be shown explicitly on the interface as a suggested query (e.g. Googles Did you mean), or be implicitly shown in the results. The former approach is preferred, since it gives users freedom to decide whether their intent is actually captured in the proposed correction. More recently, distributional approaches that take advantage of user query logs are preferred, especially by web-based search engines Li et al. (2006). Another dimension of query reformulation is term expansion. Term expansion refers to the suggestion of queries that relate to the initial one in some way. Choice of related queries might take the form of thesaurus-based term substitution Dennis et al. (1998) or attempt to extend the present query, usually by adding single words (see Fig. 4.8). Query suggestion might also be fetched from sessions of users who previously searched for the same information. In has also been proposed that search applications ask the user to provide relevance feedback Ruthven and Lalmas (2003). Although theoretical studies approve of this feature, its appearance in commercial applications is rare.
63
Figure 4.8: Pubmeds results page includes term expansion in two ways. On the right of the screen, there is a Related searches panel that preserves the initial query and adds a new related term to it. Also, right below the entry form there is a See also feature which suggests complete or partial modications in the initial query.
64
Chapter 5 Requirements
This chapter describes the objective of the project and the required functionality for the application, as stated by the AstraZeneca side.
5.1
Feature Specication
The objective of this project is to deliver a search application that allows researchers to quickly perform queries for terms included in medical ontologies and gain access to information about the chosen terms in intuitive ways. The application should not rely on the searchers knowledge about the structure of specic ontologies. The interface should be enhanced with interactive tools that guide the user towards the desired term; this includes term auto-completion, input query error correction, suggestion of similar terms, clever ranking and grouping of search results. The deliverable should be straightforward to use and easy to distribute to users, independent of the dierent operating systems that they might use. Furthermore, it should include the terminology MedDRA, which is widely used by researchers within the company. The previous search application used within AstraZeneca did not manage to meet the users requirements and was abandoned, as users had to refer to external sources (e.g. Google) to rene their searches, when the application presented un65
CHAPTER 5. REQUIREMENTS
Table 5.1: Documented failed queries and suggested reasons for failure.
Query Hepatotoxicity
Comments Searcher did not nd the term and decided to search online to nd a synonym for it and reformulate the query as Liver Disease.
Suggested Reason for Failure Wrong ontology choice by user. The term is clearly in MedDRA. It is also not an LLT, so the application would nd it.
NSCLC
The acronym refers to NonSmall Cell Lung Carcinoma, a concept which is listed in NCIT. Search returned no results.
Although the abbreviation NSCLC is documented in NCIT, it is not a preferred name so it was bypassed by the program.
DIHS
Searcher expected the concept Drug-induced hypersensitivity syndrome in MedDRA. No results were returned.
DIHS does not appear as an abbreviation in MedDRA, so this behavior was normal. Searcher needed
to explicitly specify the preferred name, which is Drug-induced hypersensitivity syndrome.
DRESS Syndrome
Refers to the same concept as DIHS. It was not found.
The term exists as an LLT in MedDRA. The application did not search for LLTs.
wanted results. The users lack of knowledge around formal logic and ontological structure played an important role towards this result. To quote the AstraZeneca side, Many of our users do not understand the concept of an ontology and, as a result, at best, struggle to use such an interface and, at worst, refuse to use the tool (e.g. they dont understand the concept of parent/child or if there are multiple terms which should they choose). What users are more familiar with is a google-like interface whereby they are able to type in their search terms without knowledge of an ontology or what that means for them. Although no log le containing extensive lists of query failures is available 66
5.1. FEATURE SPECIFICATION

Table 5.2: Documented failed queries and suggested reasons for failure (cont.).
Query VEGFR
Comments Searcher came across multiple returned terms and did not know which one(s) to choose. Therefore, all were chosen.
Suggested Reason for Failure The application does not help the user visualize possible relationships among results. Also, NCIT lists
VEGFR as a synonym for both Vascular Endothelial Growth Factor Receptor and Vascular Endothelial Growth Factor Receptor 1 (VEGFR-1), so it is up to the searcher to decide which one is needed.
LHRH
Most relevant result was Gonadotropin Releasing Hor-
The preferred term for LHRH is Gonadotropin Releasing Hormone.
mone. The searcher did not know that term, and did not understand why the results did not contain the query. NMDA Antagonist The searcher wanted to nd a list of the dierent NMDA antagonists. No results were found in NCIT, MedDRA or ICD. This is an ontology organization characteristic. For example, in
NCIT, antagonists do not all reside under a general term NMDA antagonist. The NMDA antagonist
Ketamine is listed in NCIT as a subclass of Anesthetic Substance, while Aptiganel is listed as a subclass of Neuroprotective Agent.
for AstraZenecas search application, examples of failed queries have been given. The reasons behind query failure are diverse; Tables 5.1,5.2 list some of the most characteristic failed queries, along with given or deduced justications for the 67
CHAPTER 5. REQUIREMENTS reason of failure. It is clear that failure of some queries was due to the content of the ontologies, therefore inevitable. Other causes of failure included wrong ontology chosen by the user, incomplete term coverage by the search application, lack of help and guidance from the system (e.g., relevance feedback or result visualization). These application-level failures should be targeted and alleviated.
68
Chapter 6 Design
This chapter addresses the design considerations for each stage of the project. In particular, three distinct stages can be identied; the rst involves gaining access to ontologies, the second is concerned with semantic similarity calculations, while the third covers data presentation and interface design.
6.1
Stage I: Access to Medical Ontologies
The rst design stage involves gaining access to medical ontologies and terminologies. It might be argued that ontologies should be exploited in a formal ontology language representation, such as OWL. This was abandoned for the following reasons: 1. Not all medical terminologies respect ontological principles, thus they are not all representable in a formal ontology language. 2. Access to the original format of some structured vocabularies (e.g. MedDRA) is neither public, nor free. 3. Currently, using the Java OWL1 Application Programming Interface (API), large OWL ontologies need to be kept in main memory for the whole du1
http://owlapi.sourceforge.net/
69
CHAPTER 6. DESIGN ration of program execution, fact which would degrade performance in the case of multiple ontologies. Fortunately, BioPortal2 has already represented hundreds of ontologies and terminologies in a common format, which is publicly accessible through the web Noy et al. (2009). As a result of the above observations, it was decided that the best design choice would be to maintain a local MySQL database with ontology terms. For demonstration purposes, three dierent structured vocabularies are used in this project: NCIT MedDRA ICDv9 They are downloaded from BioPortal and saved locally. From these, only NCIT is frequently updated, at approximately monthly intervals. The used versions of NCIT, MedDRA and ICDv9 contain 97946, 69389, and 22400 concepts, respectively.
6.1.1
Database and Table Creation
Initially, a MySQL database named Ontologies is created locally. The database holds a total of seven tables, having the following names: CONCEPTS DEFINITIONS SYNONYMS ROOTS
2
http://bioportal.bioontology.org/
70
6.1. STAGE I: ACCESS TO MEDICAL ONTOLOGIES PARENTS SIMILARITY MDR RELATED

Table 6.1: Ontologies database table structure
Table CONCEPTS
Name code preferredName ontology
Type varchar(20) text varchar(15) varchar(20) text varchar(20) text varchar(20) varchar(15) varchar(20) varchar(20) varchar(20) varchar(20) double double double double varchar(20) varchar(20)
DEFINITIONS
code denition
SYNONYMS
code synonym
ROOTS
code ontology
PARENTS
code parentCode
SIMILARITY
termcode1 termcode2 rada wu resnik li
MDR RELATED
code relatedCode
The CONCEPTS table will hold basic information about the concepts that are present in an ontology. More specically, for each concept, a record which contains its preferred name, code, and ontology will be inserted to the table. Due to the fact that multiple denitions and synonyms might exist for a single concept, these will be held in separate tables, DEFINITIONS and SYNONYMS, 71
CHAPTER 6. DESIGN respectively. The ROOTS table will contain all the top level terms of the ontology/terminology. Usually, multiple independent hierarchies exist, therefore multiple roots can be found. For example, MedDRA contains 26 parallel hierarchical structures. These so-called roots can be joined under a top-level universal imaginary node, that guarantees the presence of a single root in the ontology/terminology. The table PARENTS will contain hierarchical information about the terms. For each concept, all of its parents will be listed. This table can be exploited to compute semantic similarity at the next stage. The SIMILARITY table will hold semantic similarity scores between pairs of concepts that belong to the same ontology. The similarity metrics used are those of Rada, Wu-Palmer, Resnik and Li. Finally, the MDR RELATED table will contain MedDRA-specic concepts that do not clearly belong to any hierarchy themselves, but are considered very close to terms that do. The detailed structure of tables is shown in Table 6.1. All tables, except SIMILARITY will be populated at this stage.
6.1.2
Populating the Database Tables
The procedure for downloading the chosen ontologies and populating the database tables relies on the BioPortal Representational State Transfer (RESTful) services3 . These services allow the transfer of medical ontology information, from BioPortal servers to end user systems, through the Hypertext Transfer Protocol (HTTP). The response is, by default, in XML format, with limited support for JavaScript Object Notation (JSON) format. Complete support for JSON output is scheduled for next release. Accessing the BioPortal RESTful services is performed through the usage of intuitive Uniform Resource Identiers (URIs) of predened structure. All that is required for gaining access to the RESTful services is a user-specic API key, which is immediately given when a free account is created on the BioPortal website. Some examples of the types of available term
3
http://www.bioontology.org/wiki/index.php/BioPortal_REST_services
72
6.1. STAGE I: ACCESS TO MEDICAL ONTOLOGIES services are given in Table 6.2. Quantities in brackets are user-dened. As an example request, consider the get all terms service for NCIT:
http://rest.bioontology.org/bioportal/virtual/ontology/1032/all?pagesize= 50&pagenum=1&apikey=c6ae1b27-9f86-4e3c-9dcf-087e1156eabe. The virtual on-
tology id 1032 refers to NCIT. As stated before, the API key is a string identier which is received upon free registration to BioPortal. The response includes the rst 50 terms of the NCIT ontology. A (part of the) XML response is shown in Fig. 6.1. It should be observed that the get all terms service does not actually return all terms from a specic ontology at once; for each request, the user must provide a terms-per-page number, and the particular page that he/she wishes to view. All pages can be returned, if the user continues issuing page requests with increasing {pagenum}, provided that the user knows the number of concepts that the ontology includes.
Table 6.2: Examples of URI formats for BioPortal RESTful services.
Service URI format Get all terms http://rest.bioontology.org/bioportal/ virtual/ontology/{ontologyid}/all?pagesize= {pagesize}&pagenum={pagenum}&apikey= {YourAPIKey} Get concept info http://rest.bioontology.org/bioportal/ virtual/ontology/{ontologyid}/ {conceptid}&apikey={YourAPIKey}
Comments Returns terms ontology by page. Returns mation inforabout of all an page
a specic term, such onyms denitions. as synand
Get latest
http://rest.bioontology.org/bioportal/ virtual/ontology/{ontologyid}?apikey=
Returns the currently used version id of an ontology.
ontology {YourAPIKey} version
73
CHAPTER 6. DESIGN
Figure 6.1: A part of the XML response for the get all terms query of Table 6.2.
Access to BioPortal RESTful services can be achieved programmatically in a simpler and automated manner, using the ontoCAT4 Java API. This API provides classes and methods tailored to the BioPortal services. It provides a high level abstraction, that handles queries and XML responses behind the scenes and returns lists of Java objects that contain the information needed to populate the database tables. The provided methods are shown in Fig. 6.2. The ontoCAT API method getAllTerms() returns a list of all terms in the ontology, which is what is needed in this project. Its drawback is that it keeps all ontology terms in memory, causing a heavy memory burden which may lead to out of memory exceptions when further processing is needed. For this reason, I introduced a new function getAllTermsPageByPage(), which allows retrieving and processing terms page by page in a loop. Then, memory can be released after each iteration. In order to save information to the database tables, the
4
http://www.ontocat.org/
74
6.1. STAGE I: ACCESS TO MEDICAL ONTOLOGIES
Figure 6.2: The provided methods of the ontoCAT API Adamusiak et al. (2011).
getAllTermsPageByPage() method is called. It is chosen that pagesize=1, so that only one concept per page is returned. Then, for each concept returned by ontoCAT, the required information is saved to the appropriate table in the Ontologies database. The procedure is shown in Fig. 6.3. The Java applica-
Figure 6.3: Populating the Ontologies database is performed with the help of the ontoCAT API.
tion, that was developed for this project, requests all concepts of a BioPortal ontology, page by page, using ontoCAT methods. OntoCAT acts as an intermediary, responsible for accessing the RESTful services of BioPortal. It returns 75
CHAPTER 6. DESIGN Java object(s) back to the Java application, after processing the XML response of BioPortal. Once the Java application receives information about a term, all that is left is to choose the appropriate table(s) in the Ontologies database and, through the Java Database Connectivity (JDBC) API, insert record(s) of MySQL format. Once all pages are processed, the Java application nishes execution and all tables, except SIMILARITY, are populated.
6.2
Stage II: Computation of Semantic Similarity
This stage deals with the calculation of semantic similarity scores between pairs of concepts that reside in the same ontology. Semantic similarity scores will be saved in the SIMILARITY table and will later be used in the search application for the semantic grouping of search results and the suggestion of highly similar terms to a term chosen by the user. To populate the SIMILARITY table, the already populated tables CONCEPTS, PARENTS and ROOTS will be used.
6.2.1
Term Neighborhoods
Computing semantic similarity between all concept pairs in an ontology is a tedious task which requires a lot of computational and storage resources. Let us consider NCIT as an example: there are 97946 concepts, yielding 979462 pairs5 , whose semantic similarity must be calculated. This is not the only burden; semantic similarity calculation of a single pair is, by itself, a time-consuming process. For example, even for the simple Rada edge-counting measure, all connecting paths between two concepts must rst be computed (i.e. a recursive process) and, nally, the shortest one chosen. In large ontologies, it is not unusual that
5
actually, due to the symmetric property of similarity, there is no need to calculate all 979462
pairs. Also, self similarities can be avoided, depending on the similarity metric used. Still, the numbers are huge.
76
6.2. STAGE II: COMPUTATION OF SEMANTIC SIMILARITY multiple paths of variable length exist between two concepts, so nding the minimum path is not as trivial as it may seem. In the nal search application, semantic similarity will be used for suggesting highly similar terms to the query or grouping highly similar terms. Therefore, term pairs whose semantic similarity is low will never be needed. For example, there is no point in storing or even computing the similarity between the NCIT concept Greece and the concept Lung, since the resulting very low score will never be used in the search application itself. The term Greece will never be suggested as a highly similar term of Lung, and vice versa. For the above reasons, the design choice for this project is to exploit the geometrical structure of ontologies/terminologies and, for each concept, calculate semantic similarity only with concepts that are placed within a certain neighborhood from it. Given a concept c, its neighborhood is chosen to contain: All concepts that are descendants of c at most two levels down in the hierarchy. All concepts that are siblings of c. All concepts that are ancestors of c, at most two levels up in the hierarchy. This choice greatly simplies the computational burdens associated with semantic similarity computation in huge ontologies, without threatening the performance of the search application. Furthermore, valuable mySQL storage is not wasted.
6.2.2
Semantic Similarity Calculation
In this project, four dierent semantic similarity metrics have been chosen: Rada, Wu and Palmer, Resnik and Li. Due to lack of a specic corpus for Resnik similarity, Secos formula is used, as presented in Chapter 3 (see 3.28). For the calculation of semantic similarity, I developed a Java application, which contains the following basic methods6 :
6
method parameters and other utility methods are not shown, for simplicity
77
CHAPTER 6. DESIGN getAllPathsToRootDB() getMinimumPathToRootDB() getAllPathsBetweenTwoConceptsDB() getMinimumPathBetweenConceptsDB() computeLocalSimilarities() NormalizedRadaSimilarity() WuPalmerSimilarity() LiSimilarity() ResnikSimilarity() The method getAllPathsToRootDB() uses the PARENTS table to recursively build all paths between a concept and any of the roots of an ontology. Recursion stops every time a concept which belongs to the ROOTS table is encountered. The method getMinimumPathToRootDB() simply calls getAllPathsToRootDB() and chooses the minimum path out of the returned ones. The method getAllPathsBetweenTwoConceptsDB() rst computes each terms paths to the root separately, using the getAllPathsToRootDB() method. Then, it compares each of the rst terms paths to root to each of the second terms paths to root; if any two paths have common nodes, it means that a common path (that passes through their LCS) can be dened between the nodes; if no common nodes are present, a common path only exists through the single (imaginary) root of the ontology. The method getMinimumPathBetweenConceptsDB() simply calls getAllPathsBetweenConceptsDB() and selects the shortest one. The methods NormalizedRadaSimilarity(), WuPalmerSimilarity(), LiSimilarity(), and ResnikSimilarity() call the previously mentioned path building methods with two concepts as arguments, and produce a numerical value that corresponds to the particular similarity metric. The method computeLocalSimilarities() is 78
6.3. STAGE III: INTERFACE DESIGN DATA PRESENTATION the one that is called from main(). This method is responsible for computing the neighborhoods of a term, calling the NormalizedRadaSimilarity(), WuPalmerSimilarity(), LiSimilarity(), ResnikSimilarity() on each pair of concepts, and saving the results to the SIMILARITY table.
6.3
Stage III: Interface Design Data Presentation
At the end of stage II, the Ontologies database is complete and does not need further changes. The third stage deals with querying the available data and presenting it to the end user. It has been chosen to utilize web technologies for developing the search application. Building the search application in a web environment presents, among others, the following advantages: The les reside on a central server, and not on each of the clients machines individually. Updates may be done transparently. Access to the application by client systems is independent of their operating system. The application can benet from the browsers built-in functionality (e.g. no need to provide separate back-forward buttons). The application can benet from the huge variety of interactivity tools that have been designed for webpages. The information to be presented is fetched from the populated MySQL tables using the server-side scripting language PHP Hypertext Preprocessor (PHP). Presentation and styling are achieved using the Extensible HyperText Markup Language (XHTML) and Cascading Style Sheets (CSS), respectively. Autocompletion is performed using Asynchronous JavaScript and XML (AJAX) which 79
CHAPTER 6. DESIGN returns data in JSON format, to be fed to the Twitter Typeahead jQuery plugin7 . To further favor interactivity, various jQuery plugins are selected, including Tipsy8 and Throttle/Debounce9 . Finally, for visualization purposes, the D3 framework10 is used. The major advantage of all the above technology choices is that they are widely used, cross-platform and open-source, meaning that they are actively maintained, highly portable and modiable. More details about their usage will be presented in chapter 7.
6.4
Summary of Technology Choices
A summary of the technology choices for the project is shown in Table 6.3. The table is divided into sections that refer to the three stages described previously. The technologies, languages, frameworks and APIs used at each particular stage are mentioned.
https://github.com/twitter/typeahead.js https://github.com/jaz303/tipsy 9 http://benalman.com/projects/jquery-throttle-debounce-plugin/ 10 http://d3js.org/

8
80
6.4. SUMMARY OF TECHNOLOGY CHOICES
Table 6.3: Technology choices for the project.
Stage Description
Technologies, Languages, Frameworks, or APIs
Access to Medical
Java BioPortal RESTful Web API
Ontologies/Terminologies ontoCAT Java API JDBC API MySQL II Computation of Semantic Similarity Java JDBC API MySQL III Interface Design and Data Presentation PHP MySQL AJAX XHTML CSS JavaScript D3 jQuery Twitter Typeahead jQuery Tipsy jQuery Throttle/Debounce JSON
81
82
Chapter 7 Implementation
This chapter provides a thorough description of the features that are present in the nal search application. It introduces the visual interface, which is responsible for interaction with the end user. Furthermore, it familiarizes the reader with the functionality of the individual components that are responsible for the presentation, styling and interactive behavior of the application.
7.1
Structure
The organization of the les used for building the web application is listed in Fig. 7.1. The functionality of each le is briey described in Tables 7.1, 7.2, 7.3 and 7.4.
7.2
Search Entry Form
As mentioned in section 4.2, queries are usually less than or equal to 4 words. That result reects query specication in web-based search engines, where users can search about any topic they wish for. In the more granular biomedical domain, users usually attempt more targeted searches. Furthermore, the application to be deployed in this project is aimed at term searching, instead of document searching. Thus, users are aware that they are searching for short-length terms 83
CHAPTER 7. IMPLEMENTATION instead of multi-page documents, and it is likely that queries are even shorter than the average 2.8 words. Indeed, the example queries given by AstraZeneca are comprised of at most two words. Also, due to the auto-completion feature, lengthy terms will not need to be typed, but simply chosen from a dynamic list. Despite the fact that short queries are expected, a wide entry form is chosen, to resemble Google-like experience and provide better visibility for auto-completion features.
Figure 7.1: The organization of the les that comprise the web application. These les are responsible for the presentation, styling and interactive behavior of the web application.
84
7.2. SEARCH ENTRY FORM

Table 7.1: PHP les used in the search application.
File mysqli connect.php
Description Script which establishes a connection to the MySQL Ontologies database. This script should not be publicly accessible, for security reasons.
index.php
The main page. It also handles enter-key or mouseclick searches, by querying the Ontologies database and presenting the search results table.
performQuery.php
Script which queries the Ontologies database and echoes a JSON array of the results.
terminfo.php
Presents information about a specic term, including its code, denitions, and synonyms. A visualization of highly similar terms is shown, using d3.v3.min.js and jquery.tipsy.js. Also, an XML version of the visualization is shown.
Combinatorics.php
Performs permutations of a set of items (e.g. words of the query).
JaccardSimilarity.php
Computes the Jaccard lexical similarity between two strings.
Table 7.2: XHTML les used in the search application.
File header.xhtml
Description Contains the shared header information among all web pages. This includes the search box.
footer.xhtml
Contains the shared footer information among all web pages.
The search box can be seen in Fig. 7.2, inside the main window of the search application (index.php ). The search box is placed at the top-central part of the interface. It is visible on every page that a user visits, so that new queries can be performed anytime the user wishes. The box is characterized by rounded corner 85
CHAPTER 7. IMPLEMENTATION
Table 7.3: CSS les used in the search application.
File contentStyle.css tipsy.css type.css
Description Denes styles for the web application interface. Denes styles for building interactive tooltips. Denes styles for the auto-completion function.
Table 7.4: JavaScript les used in the search application.
File d3.v3.min.js
Description A JavaScript library that allows binding arbitrary objects to the DOM. It facilitates the development of visualization tools.
hogan-2.0.0.js
A JavaScript library that allows the sharing of templates between client and server.
jquery-1.10.1.js
A JavaScript library which facilitates DOM manipulation, event handling, animation and AJAX.
jquery.ba-throttle-debounce.js
A plug-in for throttle and debounce. Throttle limits the rate of execution of handlers. Debouncing ensures that a function is executed only once within a certain time period.
jquery.tipsy.js typeahead.js
A jQuery plugin for creating Facebook-like tooltips. A jQuery plug-in for auto-completion, developed by Twitter. It may receive an array of JSON objects to build the auto-completion pop-up menu.
performAsynchronousQuery.js
A script which calls performQuery.php and feeds the returned JSON object array to typeahead.js.
edges, a CSS3 feature. Also, a helpful message is set as a placeholder when the search box is out of focus. This message informs the user of the type of query that should be input. Once the user clicks inside the box, the grey message disappears and a blinking cursor appears (see Fig. 7.3). If the user clicks anywhere else within the page, the message reappears. 86
7.2. SEARCH ENTRY FORM
Figure 7.2: The main window of the search application. The search box is placed at the top of the screen, with central horizontal alignment. A submit button labeled Search is also provided, to assist users that prefer mouse-clicking.
Figure 7.3: Once the user clicks inside the search box, the grey help message disappears and a blinking cursor takes its place.
87
7.3
Handling the Input Query
The user may input a multi-word query in the provided search box. Handling the input query depends on the speed that the user is typing, and the keys or buttons that are pressed or clicked. To trigger the search, the user has the freedom to choose among pressing the return key, selecting a term from the pop-up autocompletion menu and mouse-clicking the button labeled Search, which is placed on the right side of the search input form.
7.3.1
Typing Speed
If a user presses keys at a fast pace, there is no need to burden the server with consecutive requests, when only the last response will be examined by the user. To achieve such functionality, a debounce function is used (dened in jquery.bathrottle-debounce.js ), which ensures that only the last event is taken into account, within a certain microsecond time period. Then, unintended requests are avoided and the applications performance is maintained at high levels.
7.3.2
Querying the Database
Once a query has been approved for processing, it is sanitized, i.e. it is ensured that its format is appropriate for insertion into a formal MySQL query and that SQL injections are avoided. The formed MySQL query searches for terms that contain the input words as prexes, in the CONCEPTS and SYNONYMS tables of the Ontologies database. For example, an input query can lun will return, among others, the terms lung cancer and cancer of lung, since all input words are found as prexes of words included in the terms. On the other hand, the query carc lun will not return the above two terms, since the carc term is not matched. It should be noted that order of the input query words is not important. Also, mid-word matches are not supported, so a query ance will not return the term cancer. 88
7.3. HANDLING THE INPUT QUERY Finally, it has been chosen that only a single result is returned per concept; a single concept might have multiple synonyms that match the same query. For example, the query lung ca returns both lung cancer and lung carcinoma, terms which correspond to the same concept. Presenting both terms in the results would be redundant, so only the lexically closest term to the query is presented (i.e., lung cancer). Thus, a term appearing in the results is not always the preferred term for a concept, but the term that best matches the given input query.
7.3.3
Ranking and Grouping of Search Results
Lexical similarity determines the ranking of search results, independent of how the search is triggered. For each term returned from the database query, the lexical similarity of its term name is computed against the input query. The nal score is the maximum of a character-based and a word-based lexical similarity. In this project, Levenshtein and Jaccard similarities are used, implemented as PHP functions. The similarity takes a value in [0, 1] and is converted to a percentage for visual purposes. Semantic similarity determines the grouping of search results. For each term in the results, its semantic similarity is retrieved with all the remaining result terms that reside lower in the table. This is achieved through MySQL queries to the SIMILARITY table. Highly similar terms (i.e., whose semantic similarity score is larger than a threshold, 0.75 or 75% in this project), are grouped together. From the semantic group, the term with highest lexical similarity to the query acts as the main concept in the table row, and similar terms appear indented. This choice preserves the lexical ranking. As an example, a search for Lung is shown in Fig. 7.4. The terms Right Lung and Left Lung are highly similar to Lung, so are presented in the same row. The main term which shelters the rest is Lung, since it is lexically identical to the input query. Semantic grouping is performed only in the return-key or mouse-click search cases, and not in the 89
Figure 7.4: Terms, that would appear on their own table row, are grouped under a more lexically-matching term to the query, when their semantic similarity to that term is higher than a threshold.
90
7.3. HANDLING THE INPUT QUERY auto-completion menu.
7.3.4
Return-key or Mouse-click Search
If the user presses the Return key or clicks on the Search button, the query is processed by index.php. The form is submitted using the HTTP GET method, as can be seen from the URL of Fig. 7.5. The index.php receives the query string through the predened $ GET variable in PHP. After the MySQL database is queried, results are presented in an array with clickable entries that redirect to the specic term information page. Lexical ranking and semantic grouping are performed. Each array entry contains basic information about the specic concept, including term name, preferred name for the concept, code identier in the ontology, abbreviation of the ontology it belongs to, and lexical similarity score from comparison to the input query.
7.3.5
Auto-completion Search
If the user presses any key other than Return, the query is processed by performAsynchronousQuery.js to produce auto-completion. Auto-completion requires that the page is not reloaded. The JavaScript function performAsynchronousQuery() uses AJAX to send an asynchronous query request to performQuery.php. The performQuery.php queries the MySQL database and returns an array of the results as JSON objects (see Fig. 7.6), which, in turn, are fed to typeahead.js to create the auto-completion pop-up menu, as seen in Fig. 7.7. Each entity in the auto-completion pop-up menu is dedicated to a single term. It presents four dierent types of information about it. On the top-left part, the term name that best matches the query is shown. This is not always the preferred-name for the term. For this reason, the lower left part of the entity always holds the preferred term name for the concept. The lower-right hand side hosts the abbreviation of the ontology/terminology from where the term is extracted. Finally, at the upper-right hand side, the lexical similarity to the input query is shown. For this 91
Figure 7.5: Pressing the Return key or clicking the Search button submits the query to index.php and a table of search results is added to the interface.
92
7.3. HANDLING THE INPUT QUERY
Figure 7.6: Part of the JSON response from performQuery.php, for the input query rash. Each JSON object represents a term matching the query, and contains information that can be used for its presentation.
Figure 7.7: Pressing any other key except Return submits the query through AJAX to performQuery.php and an auto-completion pop-up menu is created from the JSON response.
93
CHAPTER 7. IMPLEMENTATION project, the maximum number of entities that the auto-completion pop-up menu can contain has been set to 8.
7.4
Error Correction
If no term matches are found for the input query, the application tries to guess the intended query and match it to the closest term in the CONCEPTS and SYNONYMS database. Returning a No results screen was not preferred, as it is not helpful and can cause frustration to the user. The application uses soundex keys to perform elementary error correction for terms that sound similar, but are spelt dierently due to user error. An example is shown in Fig. 7.8, where the user input islyng. Since there are no matches in the database, the application suggests the term lung as a possible correction for the user to choose. The message takes the form Did you mean <suggestion> instead of <no result query>?. To accept the correction, the user can simply click on the provided link, instead of having to rene the query in the search box.
94
95
Figure 7.8: Error correction when input query is lyng. The closest term is suggested, as a clickable link.
7.5
Term Information Presentation
Once the user selects a term, either from the table of results or from the autocompletion pop-up menu, the terminfo.php script is called. The script accepts four dierent types of information about the term: 1. term name, 2. code, 3. preferred concept name, 4. ontology it belongs to. This information is passed using the GET method. The terminfo.php script produces an XHTML page which presents this information (see Figures 7.107.11). Furthermore, using the term code, the SIMILARITY table is queried to look for highly similar terms to the currently viewed term1 . Using the D3 JavaScript library, the returned terms are mapped to SVG circles, the size of which diers, depending on their semantic similarity score to the currently viewed term. These circles are organized in a spiral, whose central terms are the most similar to the currently viewed term. As we move towards the edge of the spiral, terms become less and less similar to the viewed term. Thus, larger circles reside at the center of the spiral, and their size decreases as we move out to the periphery. Inside each circle, a substring of the term name is shown. When the user places the mouse cursor over a circle, a tooltip with the full term name and semantic similarity score to the viewed term is immediately presented (see Fig. 7.9). When the user clicks on a circle, he/she is redirected to the particular terms information page. Circle size is not the only tool used for classifying terms. It would also be desirable that the user can distinguish if a term is:
1
in the term information gures presented in this thesis, Wu-Palmer semantic similarity is
being used. This can be easily changed in the terminfo.php script.
96
7.5. TERM INFORMATION PRESENTATION
Figure 7.9: When the user places the mouse cursor on a circle, a tooltip immediately appears, containing the full term name and the semantic similarity score with the viewed term.
1. a descendant, 2. a sibling, 3. an ancestor, 4. not in the hierarchy, when compared to the current term. To distinguish between the above cases, dierent colors are used. Red is used for descendants. Green is used for ancestors or siblings. Blue is used for terms not in the hierarchy. This last case is not valid for NCIT (see Fi. 7.10) or ICDv9, but can be observed in MedDRA (see Fig. 7.11). When MedDRA is stripped of the leaf level (i.e., LLT terms), it can be considered a valid hierarchy. At the same time, the removed LLT terms are not in any hierarchy anymore, despite the fact that very close relations to PTs exist. There must be a way to denote this type of similarity. In MedDRA, it is denoted 97
CHAPTER 7. IMPLEMENTATION as RQ, meaning related or possibly synonymous terms. The choice of color has dual usage. Dierent shades of the same color mean that: due to same color, the terms are all of the same type (e.g. all ancestors of the viewed term) due to dierent shade, each shade acts as a further grouping, denoting how semantically close the terms are to the viewed term. For example, ancestor terms, whose semantic similarity to the viewed term lies between 0.75 and 0.80, will have a lighter shade of green from ancestor terms, whose semantic similarity to the initial term lies between 0.90-0.95. This color clustering is a redundant measure; after all, circle size also clusters terms according to their semantic similarity score. Sometimes, though, circle sizes are very close, and the eye might be tempted to consider them as equal, so a dierent color shade removes this possibility. In addition to the D3 visualization, an XML representation of the similar terms is provided as an alternative. It may also be used in older browsers that do not support the JavaScript libraries used. Each term entry in XML includes basic term information, such as name and code, and a list of similar terms, as shown in Fig. 7.12. Finally, the page is equipped with help tooltips, that provide information about components that are present on the page (see Fig. 7.13).
98
99
Figure 7.10: Presentation page for the NCIT term Recurrent NSCLC. On the left side, the basic term information is shown, along with
an XML representation of highly similar terms. On the right side, a visualization of highly similar terms is provided, using the D3 JavaScript
library.
Figure 7.11: Presentation page for the MedDRA term Rash. The term has very close relations with terms that are not in the hierarchy. This is illustrated using blue color.
100
7.6. NAVIGATION
Figure 7.12: The XML representation of a term. It includes basic term information and highly similar terms.
Figure 7.13: Help is provided through tooltips that activate on mouse-over.
7.6
Navigation
The main pages that are presented to the user during a search are only two: index.php, which acts as the main and results presentation screen, and terminfo.php, which provides information about a chosen concept. The user can reach a specic term by performing four dierent actions: 1. by clicking on a term entry, which appears in the auto-completion pop-up 101
CHAPTER 7. IMPLEMENTATION menu (from either index.php or terminfo.php ), 2. by clicking on a term entry, which appears in the results table of index.php, 3. by clicking inside a circle in the term visualization tool in terminfo.php, 4. by clicking on a suggested correction term in index.php. Navigation is further assisted, by exploiting the browsers built-in functionality. Navigating through pages can be performed through Back and Forward buttons, or explicitly through the history log of the browser. As far as individual items are concerned, access to the search box can be achieved through the keyboard, using the Tab button. The used jQuery plugins also support commonly used keyboard shortcuts. As an example, the entries inside the auto-completion pop-up menu can be selected using the up and down keys. Pressing the Return key changes the page location to the appropriate term.
102
Chapter 8 Evaluation
The search application, that was developed in this project, is evaluated as follows: the failed queries of AstraZenecas previous search application are tested again, the application is compared to the BioPortal online search service, the applications potential use is commented on by an AstraZeneca search specialist.
8.1
Testing the Failed Queries
In this section, the failed queries of the previous search application used at AstraZeneca are re-tested, using the new search application that was developed in this project. The failed queries and their reasons for failure have been given in Tables 5.1 and 5.2 of Chapter 5. The results of testing the same queries with the newly developed application are summarized in Table 8.1. Only two queries did not produce better results, DIHS and NMDA Antagonist (see Figures 8.1 and 8.2), but this was expected behavior already from the specication; these two terms do not appear in the supported ontologies. They are neither listed as preferred terms, nor as synonyms, so it is normal that they cannot be found. 103
CHAPTER 8. EVALUATION From the other terms, Hepatotoxicity (see Fig. 8.3), NSCLC (see Fig. 8.4) and DRESS Syndrome (see Fig. 8.5) appear unambiguously in the autocompletion pop-up menu, as the user starts typing, so the user can quickly jump to the desired term page. The query LHRH returns two dierent results, with preferred names GNRH1 wt Allele and Gonadotrophin Releasing Hormone, respectively (see Fig. 8.6). The NCIT has listed LHRH as synonym for both concepts, so the user must decide which one is the desired. In contrast to the previous search application, though, the connection between Gonadotrophin Releasing Hormone and LHRH is clear (i.e., the former is a preferred name for the latter), so the user does not question the validity of the results. Finally, the query VEGFR greatly improves the previous applications search results (see Fig. 8.7). The term VEGFR appears as the best matching entity in the results list, and contains the similar terms Vascular Endothelial Growth Factor Receptor 1, Vascular Endothelial Growth Factor Receptor 2, Vascular Endothelial Growth Factor Receptor 3, which are more specic terms. At this point, it should be noted that both concepts Vascular Endothelial Growth Factor Receptor and Vascular Endothelial Growth Factor Receptor 1 contain VEGFR as synonym. Since VEGFR is the synonym which is closest lexically to the input query (i.e. 100% match), it is the representative name for both the concepts. This should not cause confusion, though; in both cases, the representative concept name is immediately followed by the preferred term name.
104
8.1. TESTING THE FAILED QUERIES

Table 8.1: Testing previously failed queries.
Query DIHS
Comments The term is not found (see Fig. 8.1). This is normal, since this abbreviation is not listed in the synonyms for the MedDRA term Drug-induced hypersensitivity syndrome.
NMDA Antagonist
No results (see Fig. 8.2), since the term does not appear in the currently supported ontologies. Also, no proposed term for error correction.
Hepatotoxicity
The term is found (see Fig. 8.3). The user can see that it belongs to MedDRA.
NSCLC
The term is found (see Fig. 8.4). The preferred name is listed too.
DRESS Syndrome LHRH
The term is found (see Fig. 8.5). This projects search application supports MedDRA LLT terms. There are two results for LHRH (see Fig. 8.6). Unlike in the previous search application, the user can now see that Gonadotropin Releasing Hormone is a preferred term for LHRH.
VEGFR
Semantic similarity has grouped the similar terms together (VEGRF-1,VEGFR-2,VEGFR-3) under the term VEGFR, which is an enhancement to the previous search application (see Fig. 8.7). The fact that VEGFR-1 contains VEFGR as synonym in NCIT might confuse matters in the listing, but the preferred term Vascular Endothelial Growth Factor Receptor 1 is also mentioned next to it, immediately clearing any doubts.
105
CHAPTER 8. EVALUATION
Figure 8.1: The term DIHS is not found, but this is normal, since it is not part of any of the supported ontologies. Instead, the term DIOS is proposed, in case the user had mispelt the query.
Figure 8.2: The term NMDA Antagonist is not found, but this is normal, since it is not part of any of the supported ontologies. No soundex match is found, so no error corrections are suggested.
Figure 8.3: The term Hepatotoxicity is shown in the auto-completion dialogue.
Figure 8.4: The term NSCLC is shown in the auto-completion dialogue.
106
8.1. TESTING THE FAILED QUERIES
Figure 8.5: The term DRESS syndrome is shown in the auto-completion dialogue.
Figure 8.6: The query LHRH produces two dierent 100%-matching results. Unlike in the previous search application, the user can now see that Gonadotropin Releasing Hormone is a preferred term for LHRH.
107
Figure 8.7: The results for the query VEGFR, illustrate a semantic grouping of 4 similar terms, namely VEGFR, Vascular Endothelial
Growth Factor Receptor 1, Vascular Endothelial Growth Factor Receptor 2, Vascular Endothelial Growth Factor Receptor 3. The latter three are grouped under the parent term.
108
8.2. COMPARISON TO BIOPORTAL SEARCH SERVICES
8.2
Comparison to BioPortal Search Services
Among other tools, BioPortal provides an online search form that allows users to search ontologies and terminologies for terms. Comparison of this projects application to BioPortal does not aim to prove one better than the other; clearly, BioPortal is a complete, multi-feature search application that allows searching of hundreds of ontologies and terminologies, simultaneously. The intent of the comparison is to highlight some of the dierent design choices that this project has adopted, which could further improve the usability of search services provided by BioPortal. The BioPortal search interface is shown in Fig. 8.8. Similarly to this projects search application, the interface simply contains a search box. The interface also oers advanced options, shown in Fig. 8.9. For comparison purposes, the advanced option to narrow search to NCIT, MedDRA and ICD9CM is used (see Fig. 8.10).
8.2.1
Auto-completion
BioPortal search does not oer auto-completion through the main search interface at all. For individual ontologies, BioPortal does oer auto-completion widgets, but this is not done through the main search interface. Therefore, the user is not helped throughout the procedure, and needs to press the Return key to check whether the query returns any results at all. Possibly, the justication for not providing auto-completion could be the large number of hosted ontologies, 353 in number, as of August 2013. On the other hand, even when the user chooses a very small subset of ontologies to search, again no auto-completion is provided. Let us consider the auto-completion widgets for individual ontologies. The widget for NCIT is chosen and nsc is typed. The auto-completion pop-up menu is shown in Fig. 8.11. This projects auto-completion results for nsc are shown in Fig. 8.12. It can be observed that many of the terms present in BioPortals autocompletion menu do not even contain nsc. BioPortal chooses to show only the 109
110
Figure 8.8: The BioPortal interface is a simple text box, similar to this projects main page.
Figure 8.9: BioPortal also oers advanced options to improve the search results.
Figure 8.10: Only NCIT, MedDRA and ICD9CM are chosen for searching, out of the 353 ontologies oered by BioPortal, so that comparisons to this projects work are achievable.
preferred names for terms. Indeed, let us consider the example of Becatecarin, shown third in BioPortals auto-completion menu. This term is a preferred name, whose synonym list includes the term NSC 655649. Clearly, the search for nsc matches NSC 655649, but instead of returning that term, BioPortal chooses to return its preferred name, Becatecarin. Then, it is annotated as synonym, stating that a synonym for the matching term is returned. For an inexperienced user, this is not clear. Unless the user knows every synonym of a given concept, it might be confusing to see result terms that do not even contain the search words. This projects application has alleviated this problem. Both the lexically closest term to the query and its preferred name are shown, so the user cannot doubt the result. This is very helpful in cases where the synonyms are highly dissimilar. For example, the term with preferred name Denatonium Benzoate, can be sought by any of its diverse synonyms: THS-839, WIN 16568, Aversion, Anispray and Lidocaine Benzyl Benzoate (see Figures 8.13-8.15).
8.2.2
Results Ranking
The main search application of BioPortal ranks results, depending on the ontology they belong to. Let us examine the complete search results for nsclc, both in BioPortals application (see Fig. 8.16) and this projects application (see Fig. 8.17). BioPortal presents the closest preferred term name, and groups the remaining results from the same ontology under this term. Each term holds a single entity, and no hints are given about possible connections among terms. On the other hand, our application does not group all the results of the same ontology together. It provides another type of results grouping, according to semantic 111
Figure 8.11: Auto-completion pop-up menu of BioPortal NCIT widget when the user has typed nsc. Only preferred terms are shown. The user might be confused when seeing the term Becatecarin in the results, since it does not contain nsc.
Figure 8.12: Auto-completion pop-up menu of this projects search application when the user has typed nsc.
similarity. The user can, then, see which terms are indeed very close semantically. The extra semantic grouping does come at the cost of extra computational power at the server side. 112
Figure 8.13: Searching for Denatonium Benzoate through its preferred term name.
Figure 8.14: Searching for Denatonium Benzoate through its synonym THS-839.
Figure 8.15: Searching for Denatonium Benzoate through its synonym WIN 16568.
8.2.3
Error Correction
Error correction is not supported in BioPortal search. If the user misspells even a letter in the query, a No Matches Found message will appear. In this projects search application, soundex-based error correction is used to correct simple spelling mistakes. The application suggests a term that might match the intended user query. The user can simply click on the term, and is immediately reassured that the term exists. Otherwise, the user would be uncertain, and 113
Figure 8.16: BioPortal search results rankings for nsclc. All terms are grouped according to the ontology they belong to, under the preferred name of the most lexically-relevant term to the query.
would possibly refer to external sources, such as Google, to identify any possible errors. Figures 8.18-8.21 illustrate how erroneous queries are handled in the two applications. The terms nsclca and caancer are used as queries. BioPortals application does not oer any error correction, while our application oers the suggestion of terms nsclc and cancer.
8.2.4
Visualization
BioPortal includes a visualization for each term, which illustrates the terms position in the hierarchy (see Fig. 8.22). In our application, the visualization is simplied, and does not refer to formal logic syntax (e.g. subclassOf). Our 114
Figure 8.17: This projects search results rankings for nsclc. Terms in the results are rearranged into groups that show high semantic similarity.
Figure 8.18: BioPortal returns no search results for the erroneously spelt term nsclca.
Figure 8.19: BioPortal returns no search results for the erroneously spelt term caancer.
application attempts to hide the underlying ontology and simplify the data visualization, so that inexperienced users can search without being consumed by 115
Figure 8.20: This projects search application returns a search suggestion of nsclc for the erroneously spelt term nsclca.
Figure 8.21: This projects search application returns a search suggestion of cancer for the erroneously spelt term caancer.
Figure 8.22: BioPortal uses a graph to visualize hierarchical relations. Edges are annotated with a description of the relationship between the connected nodes (e.g. subclassOf).
116
8.3. COMMENTS FROM AN ASTRAZENECA SEARCH SPECIALIST
Figure 8.23: This projects application focuses on inexperienced users and attempts to completely hide any formal-logic relationships that might confuse the user.
formal-logic references that would puzzle them. (see Fig. 8.23). Allowing users to choose between the two visualizations would be ideal, so that users of dierent levels all benet.
8.3
Comments from an AstraZeneca Search Specialist
This second part of the evaluation attempts to examine the search applications potential use in the area of medical knowledge acquisition. A short interview was conducted with a search specialist in research and development information at AstraZeneca. The search specialist is a researcher, responsible for running literature searches that ensure patient safety and other functions (e.g. the prediction of drug ecacy and safety at an early stage during drug development). 117
Figure 8.24: Search results depicting causal associations between smoking and cancer, as presented by the I2E text mining application.
In particular, the search specialist needs to examine the presence of certain term relationships and patterns in a corpus of medical research documents, which are retrieved from databases such as Clinicaltrials.gov. Ecient full-text search can be achieved through a text mining application named I2E, developed by Linguamatics. This tool features natural language processing (NLP)-based querying. It receives an NLP query as input, searches a predened collection of documents, and presents the relevant results in a structured format. As an example, let us assume that the searcher wishes to search through a list of medical documents for associations of smoking and cancer. The terms smoking and cancer are entered, along with the base form of the verb cause, to denote the association. The results are shown in Fig. 8.24. Each result row indicates the document in which the specied hit appears, and provides a textual excerpt of its context within the document. The tool also features plain search for terms within a set of ontologies, as shown in Fig. 8.25. Each result row contains the preferred term name, code and path of the terms parent to root. To achieve full results coverage, the search specialist needs to ensure that all possible variations of the input query have been examined. For example, an input query of the form has adverse event been seen in MEK inhibitors ? should consider all possible synonyms of terms that compose the query. The term MEK inhibitor may be present in literature in various forms, including MKK Inhibitor, MAPK/ERK Kinase Inhibitor, MAP2K Inhibitor, and MAPKK Inhibitor. The term adverse event may also be found as AdverseEvent, Adverse Experience or AE. Similarly, the verb cause might as well be replaced by similar 118
8.3. COMMENTS FROM AN ASTRAZENECA SEARCH SPECIALIST
Figure 8.25: Search results for the term MEK inhibitor in NCIT, when the I2E application is used.
verb base forms such as associate or result. Furthermore, when the number of results is too large, the search specialist should be able to quickly rene the input query and target it to more specic terms. The search application developed for this project can assist in nding synonyms for biomedical terms, and in quickly changing the granularity of searches. Each term page presents a complete list of synonyms for that term, retrieved from an up-to-date version of the ontology that the term belongs to. Further119
CHAPTER 8. EVALUATION more, visualizations oer quick browsing of similar terms, both of higher and lower specicity. For example, by following red circles, the searcher can delve deeper into the hierarchy and immediately view information about more specic terms, without need for re-searching. The search experts comments about the application were very positive. It was commented that the application would be very helpful for rening queries before feeding them to a tool like I2E. The interface was considered simple and the search procedure intuitive. The auto-completion feature and the presence of lexical similarity scores in the rankings greatly simplied the search procedure, and allowed the search specialist to quickly reach her goal and focus on the result, and not on the means to reach the result. Visualization of suggested terms was valued most of all. Through the developed application, the search specialist could easily browse neighborhoods of similar terms and rene the search granularity ondemand. The usage of colors instead of typical expanding menu hierarchies was also complimented for its usability.
120
Chapter 9 Conclusions and Future Work

Ontologies are expected to play a major role in the discovery of new knowledge within the biomedical sector. Providing user-friendly tools that help researchers navigate eciently through ontologies, without requiring from them to fully comprehend ontological principles, is more likely to help them reach their nal goals quickly, without confusion and frustration.
9.1
Conclusions
In this thesis, proposals have been made for enhancing the user experience in ontological search, through the design of a search application that features enhanced searching tools such as auto-completion, semantic grouping of results, query reformulation and similar concept suggestion. The outcome is a web-based application that allows searching and browsing ontologies of heterogeneous structure and format. The web application utilizes all the latest web technologies to produce a user-friendly environment. Focus has been given on promoting usability and positive user experience, by designing the search service from a user-centric perspective, such that even inexperienced users can become quickly acquainted with it. The search application relies heavily on pre-calculated semantic similarity scores; semantic similarity al121
CHAPTER 9. CONCLUSIONS AND FUTURE WORK lows expressing the relationships between terms as decimal numbers, in the range [0, 1]. Mapping term relations to real numbers allows for the development of the innovative visualizations and results clustering that are used in this application. The chosen design for the search application manages to improve certain aspects that even enterprise-strength ontological search applications, such as BioPortal, have not considered yet.
9.2
Future Work
The application can be further improved in the following ways: it may be connected to other medical applications. For example, it may assist in directly feeding lists of terms to text mining applications. it may be enhanced to accept ontologies of OWL and Open Biomedical Ontologies (OBO) formats. Currently, BioPortal versions of ontologies are used to populate the local database, so the application relies on BioPortal. more features may be added to the interface, including advanced options for searches, such as searching by code or searching only specic ontologies. the update of ontology versions and calculation of semantic similarities could be automated, by checking BioPortal at xed time intervals. it may be improved to be compatible with previous versions of web browsers. Since it relies heavily on JavaScript and novel libraries, alternative methods for presenting visualizations might be needed. Currently, it has been successfully tested in the latest versions of all major browsers.
122
Bibliography
Adamusiak, T., Burdett, T., Kurbatova, N., van der Velde, K. J., Abeygunawardena, N., Antonakaki, D., Kapushesky, M., Parkinson, H., and Swertz, M. A. (2011). Ontocatsimple ontology search and integration in java, r and rest/javascript. BMC bioinformatics, 12(1):218. Al-Mubaid, H. and Nguyen, H. A. (2006). A cluster-based approach for semantic similarity in the biomedical domain. In Engineering in Medicine and Biology Society, 2006. EMBS06. 28th Annual International Conference of the IEEE, pages 27132717. IEEE. Ananiadou, S. and McNaught, J. (2006). Text mining for biology and biomedicine. Artech House Boston, London. Anick, P. and Kantamneni, R. G. (2008). A longitudinal study of real-time search assistance adoption. In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, pages 701702. ACM. Bates, M. J. (1989). The design of browsing and berrypicking techniques for the online search interface. Online Information Review, 13(5):407424. Belkin, N. J., Kelly, D., Kim, G., Kim, J.-Y., Lee, H.-J., Muresan, G., Tang, M.C., Yuan, X.-J., and Cool, C. (2003). Query length in interactive information retrieval. In Proceedings of the 26th annual international ACM SIGIR con123
BIBLIOGRAPHY ference on Research and development in informaion retrieval, pages 205212. ACM. Ceusters, W., Smith, B., and Goldberg, L. (2005). A terminological and ontological analysis of the nci thesaurus. Methods of information in medicine, 44(4):498. Chen, S., Ma, B., and Zhang, K. (2009). On the similarity metric and the distance metric. Theoretical Computer Science, 410(24):23652376. Clarke, C. L., Agichtein, E., Dumais, S., and White, R. W. (2007). The inuence of caption features on clickthrough patterns in web search. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pages 135142. ACM. Cover, T. M. and Thomas, J. A. (2012). Elements of information theory. Wileyinterscience. Cucerzan, S. and Brill, E. (2004). Spelling correction as an iterative process that exploits the collective knowledge of web users. In Proceedings of EMNLP, volume 4, pages 293300. Davis, R., Shrobe, H., and Szolovits, P. (1993). What is a knowledge representation? AI magazine, 14(1):17. Dennis, S., Robert, M., and Bmza, P. (1998). Searching the world wide web made easy? the cognitive load imposed by query renement mechanisms. In Proceedings of ADCS 98 Third Australian Document Computing Symposium, page 65. Franzen, K. and Karlgren, J. (2000). Verbosity and interface design. SICS Research Report. 124
BIBLIOGRAPHY Gangemi, A., Pisanelli, D., and Steve, G. (1998). Ontology integration: Experiences with medical terminologies. In Formal ontology in information systems, volume 46, pages 9894. IOS Press, Amsterdam, AM. Gomaa, W. H. and Fahmy, A. A. (2013). Article: A survey of text similarity approaches. International Journal of Computer Applications, 68(13):1318. Published by Foundation of Computer Science, New York, USA. Gruber, T. R. et al. (1995). Toward principles for the design of ontologies
used for knowledge sharing. International journal of human computer studies, 43(5):907928. Guarino, N. (1998). Formal Ontology in Information Systems: Proceedings of the 1st International Conference June 6-8, 1998, Trento, Italy, volume 46. Ios PressInc. Guseld, D. (1997). Algorithms on strings, trees and sequences: computer science and computational biology. Cambridge University Press. Hearst, M. (2009). Search user interfaces. Cambridge University Press. Hertzum, M. and Frkjr, E. (1996). Browsing and querying in online documentation: a study of user interfaces and the interaction process. ACM Transactions on Computer-Human Interaction (TOCHI), 3(2):136161. Huang, C.-r., Calzolari, N., Gangemi, A., Lenci, A., Oltramari, A., and Pr evot, L. (2010). Ontology and the Lexicon: A Natural Language Processing Perspective. Cambridge University Press Cambridge. Hustadt, U. et al. (1994). Do we need the closed-world assumption in knowledge representation. Working Notes of the KI, 94:2426. Jansen, B. J., Spink, A., and Koshman, S. (2007). Web searcher interaction with the dogpile.com metasearch engine. Journal of the American Society for Information Science and Technology, 58(5):744755. 125
BIBLIOGRAPHY Jansen, B. J., Spink, A., and Pedersen, J. (2005). A temporal comparison of altavista web searching. Journal of the American Society for Information Science and Technology, 56(6):559570. Jaro, M. A. (1989). Advances in record-linkage methodology as applied to matching the 1985 census of tampa, orida. Journal of the American Statistical Association, 84(406):414420. Jaro, M. A. (1995). Probabilistic linkage of large public health data les. Statistics in medicine, 14(5-7):491498. Jiang, J. and Conrath, D. (1997). Semantic similarity based on corpus statistics and lexical taxonomy. In Proc. of the Intl. Conf. on Research in Computational Linguistics, pages 1933. Joachims, T., Granka, L., Pan, B., Hembrooke, H., and Gay, G. (2005). Accurately interpreting clickthrough data as implicit feedback. In Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pages 154161. ACM. Jones, W., Dumais, S., and Bruce, H. (2002). Once found, what then? a study of keeping behaviors in the personal use of web information. Proceedings of the American Society for Information Science and Technology, 39(1):391402. Jurafsky, D. and Martin, J. H. (2000). Speech & Language Processing. Pearson Education India. Kukich, K. (1992). Techniques for automatically correcting words in text. ACM Computing Surveys (CSUR), 24(4):377439. Leacock, C. and Chodorow, M. (1998). Combining local context and wordnet similarity for word sense identication. WordNet: An electronic lexical database, 49(2):265283. 126
BIBLIOGRAPHY Levenshtein, V. I. (1966). Binary codes capable of correcting deletions, insertions, and reversals. Technical Report 8. Li, M., Zhang, Y., Zhu, M., and Zhou, M. (2006). Exploring distributional similarity based models for query spelling correction. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, pages 10251032. Association for Computational Linguistics. Li, Y., Bandar, Z. A., and McLean, D. (2003). An approach for measuring semantic similarity between words using multiple information sources. Knowledge and Data Engineering, IEEE Transactions on, 15(4):871882. Liu, H., Johnson, S. B., and Friedman, C. (2002). Automatic resolution of ambiguous terms based on machine learning and conceptual relations in the umls. Journal of the American Medical Informatics Association, 9(6):621636. McGuinness, D. L., Van Harmelen, F., et al. (2004). Owl web ontology language overview. W3C recommendation, 10(2004-03):10. Miller, G. A. (1995). Wordnet: a lexical database for english. Communications of the ACM, 38(11):3941. Muramatsu, J. and Pratt, W. (2001). Transparent queries: investigation users mental models of search engines. In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pages 217224. ACM. Navarro, G. (2001). A guided tour to approximate string matching. ACM computing surveys (CSUR), 33(1):3188. Noy, N. F., Shah, N. H., Whetzel, P. L., Dai, B., Dorf, M., Grith, N., Jonquet, C., Rubin, D. L., Storey, M.-A., Chute, C. G., et al. (2009). Bioportal: ontologies and integrated data resources at the click of a mouse. Nucleic acids research, 37(suppl 2):W170W173. 127
BIBLIOGRAPHY Obendorf, H., Weinreich, H., Herder, E., and Mayer, M. (2007). Web page revisitation revisited: implications of a long-term click-stream study of browser usage. In Proceedings of the SIGCHI conference on Human factors in computing systems, pages 597606. ACM. Petrakis, E. G., Varelas, G., Hliaoutakis, A., and Raftopoulou, P. (2006). Xsimilarity: computing semantic similarity between concepts from dierent ontologies. Journal of Digital Information Management, 4(4):233. Rada, R., Mili, H., Bicknell, E., and Blettner, M. (1989). Development and application of a metric on semantic nets. Systems, Man and Cybernetics, IEEE Transactions on, 19(1):1730. Resnik, P. (1995). Using information content to evaluate semantic similarity in a taxonomy. arXiv preprint cmp-lg/9511007. Rodr guez, M. A. and Egenhofer, M. J. (2003). Determining semantic similarity among entity classes from dierent ontologies. Knowledge and Data Engineering, IEEE Transactions on, 15(2):442456. Rodr guez, M. A., Egenhofer, M. J., and Rugg, R. D. (1999). Assessing semantic similarities among geospatial feature class denitions. In Interoperating Geographic Information Systems, pages 189202. Springer. Ruthven, I. and Lalmas, M. (2003). A survey on the use of relevance feedback for information access systems. The Knowledge Engineering Review, 18(02):95 145. S anchez, D., Batet, M., and Isern, D. (2011). Ontology-based information content computation. Knowledge-Based Systems, 24(2):297303. S anchez, D., Sol e-Ribalta, A., Batet, M., and Serratosa, F. (2012). Enabling semantic similarity estimation across multiple ontologies: An evaluation in the biomedical domain. Journal of Biomedical Informatics, 45(1):141155. 128
BIBLIOGRAPHY Schulz, S., Schober, D., Tudose, I., and Stenzhorn, H. (2010). The pitfalls of thesaurus ontologizationthe case of the nci thesaurus. In AMIA Annual Symposium Proceedings, volume 2010, page 727. American Medical Informatics Association. Seco, N., Veale, T., and Hayes, J. (2004). An intrinsic information content metric for semantic similarity in wordnet. In ECAI, volume 16, page 1089. Citeseer. Sutclie, A. and Ennis, M. (1998). Towards a cognitive theory of information retrieval. Interacting with computers, 10(3):321351. Tauscher, L. and Greenberg, S. (1997). How people revisit web pages: Empirical ndings and implications for the design of history systems. International Journal of Human-Computer Studies, 47(1):97137. Tversky, A. et al. (1977). Features of similarity. Psychological review, 84(4):327 352. Tversky, A. and Kahneman, D. (1975). Judgment under uncertainty: Heuristics and biases. Springer. VHA, V. H. A. (2012). National Drug File Reference Terminology (NDF-RT) Documentation. U.S. Department of Veterans Aairs. WHO, W. H. O. (1992). International Statistical Classication of Diseases and Related Health Problems, Tenth Revision: Introduction; list of three-character categories; tabular list of inclusions and four-character subcategories; morphology of neoplams; special tabulation lists for mortality and morbidity; denitions; regulations. World Health Organization. Winkler, W. E. (1999). The state of record linkage and current research problems. In Statistical Research Division, US Census Bureau. Citeseer. 129
BIBLIOGRAPHY Wu, Z. and Palmer, M. (1994). Verbs semantics and lexical selection. In Proceedings of the 32nd annual meeting on Association for Computational Linguistics, pages 133138. Association for Computational Linguistics. Zhang, H. and Zhao, S. (2011). Measuring web page revisitation in tabbed browsing. In Proceedings of the 2011 annual conference on Human factors in computing systems, pages 18311834. ACM. Zhou, Z., Wang, Y., and Gu, J. (2008). A new model of information content for semantic similarity in wordnet. In Future Generation Communication and Networking Symposia, 2008. FGCNS08. Second International Conference on, volume 3, pages 8589. IEEE. Zhu, S., Zeng, J., and Mamitsuka, H. (2009). Enhancing medline document clustering by incorporating mesh semantic similarity. Bioinformatics, 25(15):1944 1951.
130

Enhanced Ontological Searching of Medical Scientific Information

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Enhanced Ontological Searching of Medical Scientific Information

Uploaded by

Copyright:

Available Formats

University of Manchester School of Computer Science Degree Programme of Advanced Computer Science

Enhanced Ontological Searching of Medical Scientic Information

Masters Thesis 2013

2.3.2 2.3.3 2.3.4 2.3.5

NDF-RT . . . . . . . . . . . . . . . . . . . . . . . . . . . . ICD-10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . MedDRA . . . . . . . . . . . . . . . . . . . . . . . . . . . NCI Thesaurus . . . . . . . . . . . . . . . . . . . . . . . .

4 Search Interfaces 4.1 4.2 Information Seeking Models . . . . . . . . . . . . . . . . . . . . . Query Specication . . . . . . . . . . . . . . . . . . . . . . . . . . 4

Presentation of Search Results . . . . . . . . . . . . . . . . . . . . Query Reformulation . . . . . . . . . . . . . . . . . . . . . . . . .

5 Requirements 5.1 Feature Specication . . . . . . . . . . . . . . . . . . . . . . . . .

Stage III: Interface Design Data Presentation

Summary of Technology Choices . . . . . . . . . . . . . . . . . . .

Error Correction . . . . . . . . . . . . . . . . . . . . . . . . . . . Term Information Presentation . . . . . . . . . . . . . . . . . . .

Navigation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 103

8 Evaluation 8.1 8.2

8.2.1 8.2.2 8.2.3 8.2.4 8.3

Comments from an AstraZeneca Search Specialist . . . . . . . . . 117 121

9 Conclusions and Future Work 9.1 9.2

Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 123

Bibliography Number of Words in the Document: 25648

Intellectual Property Statement

available in the University IP Policy (see http://documents.manchester.ac.

SOC UMLS URI URL UX VA WHO XHTML XML

JavaScript les used in the search application. . . . . . . . . . . .

Testing previously failed queries. . . . . . . . . . . . . . . . . . . . 105

BioPortal also oers advanced options to improve the search results.110

The dierence between terminology and ontology is described in Section 2.2

1.4. THESIS ORGANIZATION

Modern Ontology Denition

Ontology vs. Terminology

Notable Biomedical Ontologies and Terminologies

BioPortal is a biomedical ontology/terminology repository which provides online ontology

presentation and manipulation tools (http://bioportal.bioontology.org/).

nor was meant to be; its intent is classication

2.3. NOTABLE BIOMEDICAL ONTOLOGIES AND TERMINOLOGIES

Chapter 3 Similarity Metrics

Similarity Metric vs. Distance Metric

CHAPTER 3. SIMILARITY METRICS 4. d(a, b) + d(b, c) d(a, c) (triangular inequality).

3.2. LEXICAL SIMILARITY

Character-based Similarity Measures

Word-based Similarity Measures

where | | denotes set cardinality in number of words. 43

|ai bi | simmanh (a, b) = 1

where N is a normalizing constant that represents the dimension of a and b. 44

|ai bi |2 simeucl (a, b) = 1

Ontological Semantic Similarity

Intra-ontology Semantic Similarity

or, equivalently, simresn (c1 , c2 ) = log(P(LCS )). (3.24)

, d(c1 , LCS ) d(c2 , LCS ) , else

Inter-ontology Semantic Similarity

Chapter 4 Search Interfaces

Information Seeking Models

4.2. QUERY SPECIFICATION

Figure 4.1: The google search engine entry form.

CHAPTER 4. SEARCH INTERFACES

4.2. QUERY SPECIFICATION

CHAPTER 4. SEARCH INTERFACES

Presentation of Search Results

CHAPTER 4. SEARCH INTERFACES

CHAPTER 4. SEARCH INTERFACES

to explicitly specify the preferred name, which is Drug-induced hypersensitivity syndrome.

Refers to the same concept as DIHS. It was not found.

5.1. FEATURE SPECIFICATION

Most relevant result was Gonadotropin Releasing Hor-

The preferred term for LHRH is Gonadotropin Releasing Hormone.

Stage I: Access to Medical Ontologies