Professional Documents
Culture Documents
Beyond
University of Moratuwa
June 2017
Intelligent Q&A System
Beyond
June 2017
1
Declaration
We declare that this thesis is our own work and has not been submitted in any form for
another degree or diploma at any university or other institution of tertiary education.
Information derived from the published or unpublished work of others has been
acknowledged in the text and a list of references is given.
K.M.K. Hasantha
P.N. Udawatta
I.A. Abeysekera
Date:
Supervised by
Date:
i
We dedicate this work to
our Parents
ii
Acknowledgement
I wish to express my sincere gratitude to my supervisor Dr. (Ms.) A.T.P. Silva for the
excellent supervision, guidance, support, encouragement and patience he has given in making
this a successful research.
It also gives us great pleasure in acknowledging the generous support of all academic staff at
IT Faculty of University of Moratuwa and to all my external lecturers for sharing their
knowledge and providing all kinds of supportive guidance which immensely contributed in
making our project successful.
We would like to express our appreciation and gratitude to all our friends and fellow batch
mates for their support and help given throughout this project.
Last but not least we would like to offer our deepest gratitude to our loving parents for their
continuing love and support extended to us throughout this project. Not forgetting our
supportive brothers and sisters and helpful cousins and all our relatives for being there for us
to make this project a success.
iii
Abstract
With exponentially growing human knowledge it is impossible for a person learn or memorize
all the knowledge even within a very limited field of study. Knowledge on demand is the next
big requirement any profession. Intention of this project is to come up with a computer system
that can understand the meaning and intention of text content and provide answers to user
questions. Our proposed solution is an ontology based intelligent question and answering
system that relies on collection information via text mining process and mapping them to the
ontology. Then a question classifier is used to classify questions into a predefined class map
which becomes the input to the answer generation module that generated SPARQL query to
retrieve the expected answer.
iv
Table of Contents
1. Table of Figures ................................................................................................................. x
1. Introduction ........................................................................................................................ 1
1. Objectives ................................................................................................................ 2
2. Literature Review............................................................................................................... 3
2.1. Text mining approaches and ontology learning related approaches ........................ 3
2.3.2. Natural Language Query Interpretation into SPARQL Using Patterns [9] ....... 13
3.1. NLTK..................................................................................................................... 17
v
3.7. OwlReady .............................................................................................................. 18
4. Our Approach................................................................................................................... 19
vi
6.2.10. RDF- Ontology............................................................................................... 29
6.3.3. N-grams.............................................................................................................. 33
6.3.6. Classification...................................................................................................... 35
7. Implementation ................................................................................................................ 41
vii
7.2. Question Classification Module ............................................................................ 46
8. Evaluation ........................................................................................................................ 53
References ................................................................................................................................ 60
Appendix A .............................................................................................................................. 62
viii
Appendix B .............................................................................................................................. 65
ix
1. Table of Figures
Figure 2:1 Architecture of the ontology learning environment [2]............................................ 4
Figure 2:2 Hasti project architecture [4] .................................................................................... 6
Figure 2:3 the Structure of a Halex Entry [4] ............................................................................ 7
Figure 2:4 A ............................................................................................................................... 9
Figure 2:5 B ............................................................................................................................... 9
Figure 2:6 Table P [6] .............................................................................................................. 10
Figure 2:7 Table Q [7] ............................................................................................................. 10
Figure 2:8 Table R ................................................................................................................... 11
Figure 2:9 Components and search for synonyms using WordNet ......................................... 12
Figure 2:10 generic query pattern used in this approach ......................................................... 14
Figure 4:1 Answer Generation Module ................................................................................... 22
Figure 6:1 high level architecture ............................................................................................ 23
Figure 6:2 Text to Ontology Design ........................................................................................ 24
Figure 6:3 Question Classification Flow ................................................................................. 30
Figure 6:4 Question Pre-processing model .............................................................................. 32
Figure 6:5 Process of disease word identification and replacement ........................................ 33
Figure 6:6 Answer Generation Module High Level Arcitecture ............................................ 36
Figure 7:1 NER model operation ............................................................................................. 41
Figure 7:2 Relationship Construction Mechanism .................................................................. 45
Figure 7:3 Key steps of question classification module .......................................................... 46
Figure 7:4 Question conversion to preprocessed question ...................................................... 47
x
Table 7:8 Ability of providing correct answer ......................................................................... 57
xi
Chapter 1
1. Introduction
Main goal of the intelligent Q & A system is to provide accurate answers to natural language
questions using a given context. System consist to 3 main components. Ontology automation
module, question classification module & answer generation module. System currently focuses
on answering questions asked about the medical domain. An ontology is created on the medical
domain and populated with information about disease and their symptoms causes, risks &
prevention methods using unstructured data extraction methods. This system will able to
identify meaningful links between various parts of text and able to understand how words have
been interrelated to make meaningful sentences. Further system will able to identify the hidden
content of sentences by retrieving information from relationships and analyzing dependencies
between words and clauses. Since questions can take various forms, a question classification
module is used to identify the question type and classify them into predefined classes. Answer
generation module takes the question and the question class from the classifier and extract the
key information from the question. Then the collected information is use them to build
SPARQL queries to traverse the ontology to generate meaningful answers. Accuracy of the
system depends on the accuracy of each module. Since every module's output becomes an input
to the other module, error occurred in one module can easily proceed to another module. Thus
accuracy of each module is individually measured and improved.
1.1. Background
It is said that until 1900 human knowledge double approximately every century. By the end of
World War II knowledge was doubling every 25 years [1]. Current numbers shows that all of
human knowledge is doubling about every 13 months. With exponentially growing human
knowledge it has become increasingly difficult for a one person to become an expert particular
field of study or profession. We are at a point where traditional approaches such as memorizing
information and summarizing are not enough to function efficiently in any profession. Humans
are no longer efficient enough to process all the information available and generate valuable
knowledge out of it. This is a bottleneck that limits human creativity [1]. Requirement arises
to automate the process of providing accurate knowledge on demand. Since majority of human
knowledge is in written from, a system that understands the complexity of human language
well enough to recognize the meaning and intention of written knowledge so that it can abstract
1
that information & knowledge to address human demands can easily be the next revolutionary
step of mankind.
1.2. Aim
The aim of this project is to develop an intelligent agent which is capable of generating accurate
answer for a given question by extracting details from given knowledge base. System must be
able to answer all sorts of question which can range from direct questions to questions require
a deep understanding. In addition system will be able to improve continuously itself during
both training process and continues usage.
1. Objectives
Building an ontology model to represent the knowledge of the text.
Analyzing distribution of semantics using Latent semantic analysis [2].
Mapping the text (given inputs-question and document) in natural language into useful
representations.
Improve system knowledge base through a continuous training process
Extract meaning from text using statistics and machine learning
Generate accurate answer using Natural Language Generation (text planning, sentence
planning, text realization)
Develop interface to get user inputs (questions) and display the output.
2
Chapter 2
2. Literature Review
2.1. Text mining approaches and ontology learning related approaches
In this section we have considered that techniques and methodologies used by several
researches towards information retrieval and extraction from documents. Text mining generally
consists of set of natural language processing steps before information is actually extracted.
In order to extract information related to concepts uses superficial syntactic analysis. Which
includes pattern matching and local context (NPs) with word sense disambiguation. Relation
extraction is carried out automatically which based on linguistic property of noun components
and the inheritance hierarchy. This system is mainly based on corpus based learning.
Problems of this approach is that it recognize different sentences that talk about the same
concept, word sensing problems. Uses heterogeneous resources as their data sources for
producing ontology.
3
2.1.2. TEXT-TO-ONTO Ontology
TEXT-TO-ONTO Ontology Learning Environment, is a project which construct ontologies
from texts and conceptual structures based on discovering general architectures for those
ontologies and structures [2].
From text processing server it returns text that is annotated by XML and this XML-tagged text
is fed to the Learning & Discovering component to further evaluation which will model the
ontology. By its architecture it is very continent the process of text mining into ontology.
Main process of this System is Tex & Processing Management, Text Processing Server,
Learning & Discovering Algorithms, Lexical data base and domain lexicon, Ontology
Modeling Environment. TextToOnto proceeds through ontology import, extraction, pruning,
and refinement stages. Main advantage of this system is it has diverse algorithms that helps in
term extraction and taxonomy construction procedures. Also this provides ontology
maintenance algorithms as well such as ontology pruning and refinement algorithms.
4
have gathered data from free text which is available over the internet. In their methodology
they have used tokenizer, morphological analysis, and named entity recognition, part of speech
tagging and chunk parser throughout the concept extraction process. To extract relations
between data they have used co-occurrence clustering of concepts which cluster similar data,
also Heuristic rules based on linguistic dependency relations and general association rules in
machine learning.
Further relations has been identified using as follows. Term extraction, Synonym Extraction
which describes Pointwise Mutual Information (PMI) measure to extract synonyms.
Where if two events x and y is defined as:
Where P(x,y) is the probability for a joint occurrence of x and y and P(x) is the probability for
the event x[3]. This denotes that P(x,y) >= P(x).P(y) then it has positive value for PMI. Else
itl take either negative or zero as PMI value. This approach can be used to calculate the
statistical dependence of two words on the Web. Following is PMI value generated from google
from counting hits.
Then at concept learning this system has focused on approaches which induces concepts by
using concept clustering, linguistic analysis and inductive methods.[3] In concept clustering
concepts are formed and ordered hierarchically at the same time. In Linguistic analysis has
used to derive intentional description of concepts in natural language form.
Ontological learning related to this study has used several tools such as SMES, TextToOnto,
OntoEditor. When in implementation of knowledge representation language and store the
learned ontology primitives in a meta model called as Possible Ontologies Model (POM).
Advantage of this model is it has more control over ontology engineer since it can easily trace
back to original corpus changes.
5
corpus as an automated system. This kernel consists with primitive concepts, relations and
operators to build a suitable ontology [5].
Morphological and syntactic analysis and extracting new words features, Building sentence
structures (SSTs), Extracting conceptual-relational knowledge (primary concepts), Adding
primary concepts to the ontology, Ontology reorganization.
In Hasti project Lexicon is call as Halex (Hasti Lexicon). Halex structure in brief can be
identified as set of knowledge type entries such as Morphological Knowledge, Syntactic
Knowledge, Semantic Knowledge, and Pragmatic Knowledge. These are used to mitigate
6
ambiguity in Natural language. Halex consist N different senses of a word which it may take.
In this project Ontology is defined by y O=(C, R, A, Top) [5] where O is ontology, C is the set
of all concepts, R is the set of all assertions, A is axioms, Top is top level in the hierarchy.
In Hasti has Natural Language processing component which analyzes Lexical, Morphology,
Syntax, SST, and Predicator. Input text pass through these analyzers and generate knowledge
about words.
Despite that fact that classification of text/ documents is common and well answered area of
natural language processing, it is quite challenging to classify questions in order to support an
intelligent question and answer system. This is because question answering is different from
common search engine process. It requires to find a concise answer instead of finding a
matching set of documents. Target text is less likely to match with the question text. Thus it is
important to understand syntax and semantic of the question. [6] This is commonly achieved
via machine learning approach.
This research article focus on different type of classification methods used to create intelligent
question and answering systems. It enfasis on the Importance of using machine learning
approaches against manually constructed rule based question mapping techniques by pointing
out following advantages.
7
Advantages of Machine learning approach [6]
They have compared two classification with different level of features extracted from the
questions and reported their accuracy. Results are presented in <Table P>. Two classifiers used
are Support vector machines (SVM) and Maximum Entropy Models.
Wh-word.
Head word.
Word grams.
Wh-word is basically the question word which is one of who, why, where, which, when, how,
what and rest. rest are the questions that don't have any question word. Example: - Name of
a disease that cause rashes?
Head word is the noun or verb that considered as the key-word in the question. Example: -
what is group of fish are called? In this question headword is fish and it is type Entity: animal
(refer Table Q) [6]. However as article explains identifying head word is not so
straightforward. First approach to head word extraction is via syntactic parser ( Chaniak parser
/Stanford parser) and Modified form of Collins rules.
8
Figure A[6] is the result via a syntactic parser with the use of Colins rule Head word is
identified as did. However after modifications of rules to give priority to noun instead of
verb or verb phrase, head word is identified as year (Figure B)[6].
Figure A Figure B
Figure 2:4 A
Figure 2:5 B
Before using the above mentioned method for head word searching, Set of regular expression
rule based algorithm is used to identify the commonly known question types and find the head
word. Questions with when, where or why will return no head words since those type of hw-
words are considered as high informative hw-words. If the algorithm fail to identify the head-
word modified Collins rules based parse trees are used to find the head word. In a case where
the extracted head word is with noun or noun phrase tag then as the last step first word of that
question with the noun or noun phrase is considered as the tag head word.
WordNet semantic feature is to identify meaningful related word for the extracted head word.
WordNet is a tool used for semantic analysis and it is used to identify hypernyms or head words
found. Hypernyms is a generic word for a given word. It can be at different levels.
Example
Hyper name of Dog at depth 1 is Domestic Animal and depth 2 is Animal. Hypernyms can
exist on verb-sense and noun-sense. Type of a sense can introduce a ambiguity to the head
word and cause noise in the data set. Correct sense for the head word is identified using Lesk
algorithm which calculates the maximum number of common words between the question and
the each sense type words.
9
Fourth feature extracted from the question is wordgasm. Basically a n-gram is subsequence of
N words from a given question. They have used unigrams, bigrams, and trigram features for
the experiment.
As the last feature Word Shape is considered. Which is one of all upper case, all lower case, mixed
case and all digits.
Experiments are done using subsets of total feature sets via both SVM and ME classifiers and
results are reported in the following Table P
10
Classification is done by identifying the type of the answers expected from the question. Type
of the answer is first classified as 5 Coarse classes which is classified using a coarse classifier
and each Coarse class has fine classes which are classified using fine classifier [7]. Both
classifiers are fed with features extracted from questions from the same feature extractor. 6
Primitive feature of the question are considered here. Which are nouns, pos tags, chunks, head
chunks, named entities and semantically related words [7]
Here the noticeable difference in the approach is the multi-class classification. In the previous
method using SVM and EM classifiers each question is identified as belong to a one class and
subclass. In this approach multiple candidate Coarse classes are selected from the Coarse
classifier and all the fine classes of each selected Coarse classes are used to identify candidate
fine classes using the fine classifier.
However they have only managed to construct a multi-classifier to only Coarse classes and
classification is done using one feature at a time.
2.2.3. Conclusion
When comparing the experimental results from both approaches (Table P and Table R) it is
clear that significant improvements to the accuracy can be achieved using multi-class
classifiers. Even when classification is done using all the 6 features, accuracy remain below
90% in the first approach. However with multi-class classifier with only one feature can have
accuracy close to or if not over 90% in most cases.
11
2.3. Natural Language to SPARQL Generation Approach
This section gives an abstraction of researches which have been conducted on extracting
information from ontology by converting natural language to sparql queries.
2.3.1. QASYO[8]
QASYO is an question answering system using YAGO ontology as knowledge base. YAGO
is an ontology which extract information from WordNet, Wikipedia and GeoNames and it has
been integrated to link data cloud by linking to DBpedia and SUMO ontologies. QASYO
integrates semantic ontologies with natural language processing in an unified framework. It
extract key words from question to identify question type using semantic analysis.
Question answering process of QASYO is consist of question analysis and answering retrieving
phases.
Question analysis phase generate a query pattern by classifying parsing the question. Query
pattern is a natural language query which is labelled with ontology concepts and morphological
information. Classification is done by categorizing question as W/H questions (who, what,
when, which and where) or yes/no questions. Answer type is identified Based on the question
category and the question type is matched with the entity type in ontology. Then semantic
triples are generated using Linguistic Components and search for synonyms using WordNet
the detected unknown components.
12
gives the answer if it exist in ontologies or simply gives a message as don't know if it is not
in the knowledge base.
2.3.2. Natural Language Query Interpretation into SPARQL Using Patterns [9]
This system suggest a way of designing queries expressed in terms of conceptual graphs and
adapt Semantic Web languages instead of Graphs. System introduces pivot language by
allowing to express relations in keyword queries. Pivot query is a new query which is obtained
from translating obtained dependency graphs into new graph using identified name entities and
dependencies. Those queries contain relationships which are connected to keywords. Pivot
queries are matched with predefined patterns to obtain potential list of query interpretation. It
input natural language query and rank SPARQL queries and the associated answers as output.
It justify patterns from literature by translating natural language to pivot queries. Those patterns
contain repeatable sub patterns and optional patterns.
Generated patterns have four triples as (G, Q, SP, Q). G is a RDF graph which represent query
family and generalize the structure of the pattern. Q shows the quantification of elements and
it is a sub set of G.SP contain set of sub-patterns sp of p such that,
v is cut vertex of G. card min and card max means minimum and maximum cardinalities which
have been used to categorize sub patterns. Cardinality with zero are the sub patterns which are
optional and the patterns which have cardinality value greater than 1 are not optional. S is
descriptive sentence template.
13
In descriptive sentences , n substrings swi correspond to the n sub-patterns and wj correspond
to the m selected elements in m substrings which are unique. Following image shoes the generic
query pattern used in this approach.
This is the sub pattern obtained by instantiating q by the resource in the pattern
p. This instantiation only possible if is compatible with q. Sub patterns are generated by
nesting pattern p = (G, Q, SP, s) in main pattern recursively. A pattern are considered as a sub-
pattern
if it not nested in another pattern and the maximal
and minimal cardinalities equal to 1.Instantiation mechanism of pattern and sub pattern
remains the same.
14
This approach generate relevancy mark and suggested query interpretation to reformulate the
query to prevent habitability in answer generation. Habitability problem is ,user entering
question which are out of the system capabilities.
AutoSPARQL proposes QTL(Query Tree Learner) algorithm which is able to fill one of gap
in research and practice in area of generating sparql queries from natural languages. A query
tree is the structure which is used internally by the Query Tree Learner algorithm .Query tree
roughly represent a SPARQL query. It
uses supervised machine learning techniques to allows users to ask questions without knowing
the underlying knowledge base schema beforehand. System generates SPARQL queries based
on positive and negative examples. Positive examples are the resource which are included in
the results of sparql query and the negative examples are the resources which are not included
in the results of sparql query. This approach gives the freedom to user to ask questions like
other question answering systems or to directly search for a interested resource.
This is the definition of a query tree. RDF resources have been denoted as R, L represent set
of RDF literals , S represent set of strings and SQ denote set of sparql queries. Restriction of
a function with a domain D is denoted as f|D. Definition of a sub tree is given bellow.
15
Query tree maps each and every resource in RDF graph. When mapping a resource to a query
tree, system has been limited to a recursion depth to increase the efficiency. Maximum nesting
of triple patterns correspond to recursion depth and the maximum nesting
of triple patterns can be learned by the QTL algorithm.
This is the work flow of AutoSPARQL. System suggest questions if query result does not
match with user intend. If user is interested in suggested question then again execute the QTL
process and generate answers. This is active learning environment on top of QTL.
16
Chapter 3
3. Technology Adapted
3.1. NLTK
NLTK is a toolkit available which can be imported using python and easy to use package. This
has useful techniques to do most of the natural language processes. In our context we have use
nltk for several process such as tokenizing sentences and words, POS tagging on new
relationship type builder section.
For extend it has inbuilt packages to carry out other POS tagging and parsing facilities which
we have not use in our approach. NLTK is useful when to carry out most of the initial level
process like word tokenize, sentence tokenize, regexp base tokenize, access several set of
copora to train modules.
3.2. WordNet
This consist lexical database in English Language. This provides synonyms for each words
using synset. Other than that WordNet has stored with short definitions and examples of these
synonyms sets.
In WordNet it uses hypernym hierarchy which allows users to traverse through up and down
of this hierarchy to find out the relationship between word classes.
3.3. Sciket-learn
Sciket-learn is a machine learning library using in Python. Used for Classification, Regression,
Clustering, Dimensionality reduction, Model selection and Preprocessing. SVM is a binary
classifier. However our requirement is to develop a multi class classifier. Sciket-learn has One
Vs One Classifier with enable multi class classification from binary classification. One Vs One
Classifier does this by constructing classifier per pair of classes. During the prediction time
most voted class is selected.
17
3.4. SPARQL Wrapper
This is a Wrapper for SPARQL service. Used to run sparql queries again locally hosted .owl
files. (Efficient that rdflib python library in query execution).
3.6. SPARQL
This is the RDF query language. Used for retrieve and manipulate data stored in RDF format.
3.7. OwlReady
OwlReady is a library available for python 3 to carry out any owl related operations. The
operations related to our system are ADD new Classes, INSERT new Individuals under
classifications, ADD new property types. Map each relationship using predefined property
type.
Other than that this library allows to create new ontologies, update ontology between online
and local files.
18
Chapter 4
4. Our Approach
4.1. Ontology generation through text mining
Text mining to learning ontology is critical process which we found through this project. In
order proceed to content based question and answering first we must provide high quality data
extracted through content. Answer quality, trueness and validity is mostly depend on the data
extracted from the source. To support to answer generation for the request query we are
providing an ontology which has ability to learning through the context provided.
Throughout this project we have selected Medical Web sources as our domain which we try to
provide solutions for the questions ask related to Diseases. Initial data gathering will be carried
out based on web content which available at these web sources with the help of a web spider
which crawler through web pages. This crawler will gather documented data from web sources
as list type and paragraph type separately. These unstructured data documents will be stored in
data directories which can be later used for the processing processes.
In this approach which aligned with the design first we feed data as documents to the system.
Input data document is in a form of unstructured data which is crawled from web as it is. These
data then go through set of preprocessing approaches inside mechanism. Tokenization is the
process of breaking a stream of text into sentences, words, phrases, symbols, or other
meaningful elements which then called as tokens. Token is a group of characters which has
collective meaning.
After Tokenization
Token3 (=,)
Token5 (;,)
19
After Tokenizing we use NER model to recognize named entities and tag them using BIO-NER
tagging approach. This Tagging includes DISEASE (e.g.: Heart Attack), CREATURE (e.g.:
Mosquito), MICROORGANISM (e.g.: Virus), BIOLOGY (e.g.: Muscles), NN (e.g.: Airborne
Droplets). Each Entity classification includes B or I tags prefix to their term which indicates B
for beginning of the entity name or I for inside of the entity name. Also we use O to represent
any word outside of the chunk.
From these recognized Named Entities we recognize subjects and object which required to
create relationship between. Then in between sentence part will be analyzed using Relationship
builder. Which is associated with several internal and external modules. For common scenarios
this relationship builder will identifies several common relationship types.
Semantic relationship types identified are Synonymic relationships, Causal relationships and
Hyponymic relationships.
Synonymic relationships are in a form where Entity1 is equivalent to Entity2 type. E.g.
Dengue is a type of Virus.
Causal relationships captures causative relationship types which between Entities. E.g. Brain
disorder is caused by human immunodeficiency virus
Hyponymic relationships are in a form which Entity1 and Entity2 are similar types. E.g.
Myocarditis is similar to Influenza
If none of these identified then we look for alternate approach inside this model. Which by
following a POS tagging we identify Verbs and Adverb types (e.g. VB, VBD, RB, RBR, etc.)
and create new relationship types based on that.
External module is separately trained module to understand the relationship out of sentences
which then classify sentences into disease_is, is_cuased_by, cause_symptom,
has_treatment and other types. This helps to grab more relations which may have missed
by the internal module.
These semantic relationships then map with each concept identified by previous module. These
will be arranged into semantic templates which Subject, Object, and their properties into a
format.
Knowledge will be extracted at sentences level and word level will be represented in concepts
and their relations. Later they will be converted into ontology elements by ontology creator.
20
Ontology creator task is to define primary concepts such as disease, creature, microorganism
type etc. Which then has ability to create inter-relations between noun and verb phrases and
place them in original place at the ontology.
This ontology will be used in further developing and learning on questions which user allows
to query on the system. Text mining and ontology development output will be this learned
ontology.
(Table C)
(Questions such as Can you bring me the pen? are not considered here.)
Output based on:- Question type, This is the type of Entities present in question.
Expected Answer type, this is the semantic type of the expected answer.
Goal: - Categorize the question into different semantic classes based on the nature of the
question and expected answer type.
Question Semantic classes: - To identify the type of the matching rdf:type of search phrase.
21
Example:-
SELECT ?cause
WHERE{
22
Chapter 5
6. Analysis and Design
6.1. High Level Architecture
23
Figure 6:2 Text to Ontology Design
24
6.2.1. Web Spider - Crawler
Task of this crawler is to absorb data from web source provided with focused web content
which is predefined to the crawler. From that system can separately gather set of documents
related to each web page, web context separately. This documents will be classified into answer
types and question types which will be stored in separate data directories for future reference
in ontological learning procedure.
6.2.2. Tokenizer
After document is feed into the system tokenizer will initially identify sentences and create
splits from whitespaces which will generate new tokens. At this tokenizer with the help of
regular expressions we were able to make this process accurate by removing punctuations. Also
at this stage we have considered on stopwords (e.g. is, am, the, with, etc) as well in order to
optimize the tokenization process.
e.g.
Sentence: People with flu can spread it to others up to about 6 feet away.
Tokenized: ['People', 'with', 'flu', 'can', 'spread', 'it', 'to', 'others', 'up', 'to', 'about', '6', 'feet', 'away']
25
In design POS tagging coming under new relation type generation which involve to identify
verbs (VB, VBD, VBG, VBN, VBP and VBZ) and adverbs (RB, RBR and RBS).
However this module is included with a semi supervised method followed by bootstrapping
method and hand built patterns.
26
6.2.6.2. Relationship Categorization Verbs Relations
This is another approach to further identify relationships which are not categorized under Hand
Build Pattern. Therefore this module uses set of natural language based technological
approaches to categorize by identifying verbs and adverb relationship of sentences.
Using NLTK POS tagger words in sentence can be tagged appropriately. Then from that verbs
which are tagged and adverbs we can create new type of relationships to match identified
named entities in previous modules. POS tags relates to verbs are VB, VBP, VBZ, VBD, VBG,
VBN. And POS tags related to adverbs are RB, RBR, RBS are considered. Therefore using a
simple regex pattern to recognize words starting V and R it can construct words which are
verbs and adverbs. Words are feed in an order to this section. So by keeping that order we
create new relationship types form verbs and adverbs.
Then that feature vectors are fit to the Random Forest Classification model. Doing so it allows
to predict on classes for newly entered sentences. For do that using pickle we store a trained
model in a dump file. So when new data element requires to be predicted on its classification
just have to load the model file from the dump and feed it back to the model as its classification
model.
In this model it has been trained to classify sentences into 5 classes which are, is_caused_by,
cause_symptom, has_treatment, disease_is and other. These are very basic classes that data to
put into. Which has kept the model simple and increased its accuracy.
6.2.6.3.1. TF-IDF
TF-IDF is stands for Term Frequency and Inverse Document Frequency. Idea behind this
concept is to identify most important words. Words such as is, a, the, etc. appears in almost all
the documents which are least significant. Therefore by applying tf-idf it can be identified some
words which are significant by scoring it using following equation.
27
TF-IDF = TF x IDF
Where TF stands for Term Frequency and IDF stands for Inverse Document Frequency.
TF = Number of time term t appears in a document / Number of total words in that document
This implies that a term appears less number of documents and same term appears more times
in a single document means that term is high important term.
Eg:
D1: [1,1,1,0,1,0,0]
D1: [1,0,1,1,0,1,1]
In our case we have developed this algorithm to use with 100 estimators (100 decision tree
subsets).
Under sklearn ensemble algorithms this algorithm can be found and can be used as a standard
SVM, Nave Bayes algorithm.
28
6.2.7. OWL - Ontology Creator
List of semantic relations will be extracted from the given sentence. This structure will be
guided with semantic templates which has allows to find most suitable method to update
ontology based on extracted knowledge. Primary concepts will be created as a noun phrases
and verb phrases inter related concepts. Then this will guide those concepts towards original
placement in the ontology.
Following is a sample semantic template which demonstrate a sequence which a subject and
object that arranges on a specific template to match and map into sematic RDF triplets.
eg.
29
An ontology is an explicit, formal specification of a share conceptualization [a formal
definition]. In other words an ontology is an abstract model which is understandable by
machines and represents shared knowledge which is accessible to extract information or infer
new knowledge from it.
In OWL ontology it has entities which represent owl classes and two types of property
assertions. They are Object type property and data type property. Objects in the classes are
mapped with each other using these property relationships.
:disease a owl:class
:people a owl:class
In this case a (rdf:type), isDiseasTypeOf and :infected can be identified as object type
properties which has created relations among each context. Corresponding individuals has been
mapped into their classes.
Ontology can be update as required after 1st update to the server via PUT http requests. The
advantage of using PUT is that it overrides entire ontology. Therefore all the modifications
done inside local ontology will also be corrected at the online ontology.
30
Natural language question is first presented to the preprocessor, which remove and replace unnecessary
features and enhance features that benefits the classification process. Once question is processed it is
presented to the vectorizer, which uses its predetermined feature vectors to vectorize the question so
that question can be fed to the classifier in vector base form. Classifier outputs the class of the question
based on the expected answer and question type.
ABT_diseases Question that are directly requesting information about a particular disease. (ex
:-what is chickenpox ?, want to know more about arthritis)
ABT_symptoms Question that requests information about symptoms for a particular disease.(ex :-
what are the symptoms of heart attack ?, how to diagnose dengue fever ?)
ABT_prev Question that request information about the prevention methods of a disease. (
ex :- how to prevent from cholera ?, how to counter osteonecrosis ?)
IS_sym_dis Question that request to guess the disease from given symptoms.
(ex :- Are red spots and fever symptoms of dengue fever ?, Difficult breathing
and muscle twitching)
IS_cau_dis Question that wants to check if a particular behaviour leads to a disease. (Ex :-
Does STD transmit via public bathrooms ? Does excessive drinking leads to
liver problems ?)
IS_prev_dis Question that wants to check if a particular prevention method work for a
disease (ex :- Does vaccines prevent dengue fever ?, Can i reduce the risk of
having cholera by boiling drinking water ?)
ABT_risk Question about risk from particular diseases. (ex :- Is chickenpox fatal ?,can
cataract lead to blindness ?)
ABT_treatment Asking about treatment for particular disease (ex :- how to treat skin-burns, how
to cure chickenpox ?)
31
6.3.1. Question Preprocessing
Question preprocessor takes the natural language question process it into a form that is more suitable
for the vectorization process. Purpose of the preprocessor is to enhance the features of the question to
increase the accuracy of the classification process.
In each question a disease can be represented by different number of words. Here only unigram and
bigram words are considered. First each word of the Question is checked for words representing
diseases and them if a word found it is replaced with the word replacement word (ex :-disease) then
32
reset of the words are used to from unigrams and again each unigram is checked for phrases representing
diseases. If found those phrases are replaced with a single Replacement word (ex: - disease) to identify
the disease word two methods are used. First method uses the NLTK.wordnet library as a source to
identify hypernyms for the given word or phrase. Then list of those hypernyms are compared with the
set of root words that represent diseases. If a similarity found that word is identified as a disease and
replaced with the replacement term.
Following are the set of root words used
{'disease','illness','cancer','contamination','defect','disorder','epidemic','fever','flu','illness','sickness','syn
drome'}
Second method look for the diseases words in the ontology. It uses a predefined SPARQL query to
check if a particular term is represented as a disease in the ontology. It search for Instances that are of
type disease and which also happens to share common name with the phrase word that is compared.
6.3.3. N-grams
If the feature set consist of only unigram words, different sentence consisting with same words (but
different order) will be represented as the same sentence.
Example :- I can go home , can i go home. Both will have the same vector if only unigrams are
considered as the feature set. However with N-grams can i and i can will be represented as two
bigrams and this will result in two different vectors. Thus it is important to use N-grams to preserve
the sentence structure when represented as a vector.
6.3.4. Veterization
Questions must be represented in a vector form in order to train a classifier. There are several methods
that can be used to vectorize a text content.
Bag of words
Word2vec
Vector space model representation
33
Bag of features create a feature vector using each word available in the data set (documents) and that
feature vector is to vectorize a given statement or document. It consider the availability of each word
and the frequency of features appear. Every word carries smiler weight in this representation. This
representation does not consider the semantic or the structure of the documents. Word2Vec is a reason
deep learning approach to vectorize words in a way that captures their relation with other words.
However to apply this method dataset much contain minimum of over one million words. Initially when
tested this resulted in very poor accuracy. Thus as moderate approach Vector space model
presperenstion is used for the vectorization.
1+
Idf(t,d) = ( ) +1
1+(,)
Here f is the term (word) and the d is the document (in our case the question) that idf need to be found.
Where n is the number of documents (question) in the dataset. df(t,d) is the number of
documents(question) the term t is present. Idf is multiplied with tf (term frequency) to calculate the idf
value for each term. Tf-idf representation ensure perfect balance between common words and rare
words. After the if-idf vector is calculated it is normalized using the following method.
= =
||
1 2 + 22 +. . . . +22
34
6.3.6. Classification
Vectorized question are subjected to a classifier to identify the class of each question. To find out the
better classification method, 3 type of classifiers are used and their accuracies are compared with each
other. Since the dataset is quite small following classification methods are selected. Better classifier is
selected with their accuracy with the training set and testing set.
Logistic Regression.
Gaussian Naive Bayes.
Support Vectors
Logistic regression is a regression model that is used for classification. This a binary classifier that can
be extended multiple classes using one-vs-rest or cross- entropy loss schemes. Hypothesis is a sigmoid
function and inputs that outputs greater than 0.5 or equal for the hypothesis are classified as 1 class and
others are classified as 0. Cost function and the gradient decent approach to minimize the cost function
is mentioned below.
()
= () ( ( () ) () )
Gaussian naive bayes is considered as the baseline (popular) classification method for text based data.
Since it consider frequency of words appear in documents it is better suited for larger document
classifications. However it is simple and fast model that can be easily implemented with little
preprocessing. Naive bayes is a conditional probability model that can calculate the probability
hypothesis given set of conditions. Assume of X is a n dimensional vector representing a certain
sentence. Conditional probability can be used to calculate the probability that the sentence is belong to
a particular class(type) given that X vector. ( P(Class=Abt_diseas|X) = probability of that class of the
question is ABT_disease given that the question vector X ). To achieve this Gaussian naive bayes use
bayes theorem in combination with the chain rule. In this experiment Gaussian naive bayes classifier is
used as a baseline to measure other two classifiers accuracy.
35
6.3.9. Support Vector Machine
Support vector machine is a binary classifier that uses separating hyperplane to classify data points.
When given a labeled training data, algorithm calculates an optimale hyperplane which categorize new
examples. Since this is a binary classifier one against one approach is used for multi-class classification
problems. Advantage of svm is that it can handle data points that are not linearly separable using a non-
learner kernel (ex :- RBF).
Using the same dataset and vectorizer each classifier is trained and their accuracy for the training &
testing set is compared to identify the best classifier suitable for this application. Once question type is
identified from that selected classifier question class is presented to the answer generation module.
36
Answer generation module use natural language query(question) as initial input and generate
answers(output) using semantic approach. First module recognize entities in question using
natural language processing and using question type as input to module which is provided by
question classification module. Then module recognize medical terms in question using
Unified Medical Language System (UMLS) and generate SPARQL queries using recognizes
semantic types and entities. Generated SPARQL queries search in ontology which is the output
of ontology generation module and input to the answer generation module. When ambiguity
occurs in selecting answer module ask questions from user and clarify the question so as to
generate the best answer.
This phase use questions of the user and question type from classification module as initial
inputs.
First question will be tokenized in to sentences and words. Filtering process will work on these
tokenized words to remove stop words. Then synonyms will be generated for each filtered word
using word-net and identify the entities which have connection with ontology. Generally in this
phase module identify what is user asking about and the ontology classes which have the
answers for specific user question.
For the above question this phase recognize there is a disease in this question and user need
answer for the symptom of that specific disease.
Medical term recognition phase identifies medical terms such as disease names and symptoms
using Unified Medical Language System (UMLS). Unified Medical Language System, is a set
of files and software that brings together many health and biomedical vocabularies and
standards to enable interoperability between computer systems. This phase use Metathesaurus
37
of UMLS which have semantic Network,Lexicon and Lexical Tools. Searching in UMLS have
two sub phases as search by term and search by Concept Unique Identifier(CUI).
Ex: Arthritis
Search Results (2219)
C0003864 Arthritis
C0003865 Arthritis, Adjuvant-Induced
C0003868 Arthritis, Gouty
C0003869 Arthritis, Infectious
C0003872 Arthritis, Psoriatic
C0003873 Rheumatoid Arthritis
C0003875 Arthritis, Viral
C0003892 Neurogenic arthropathy ...............
CUI: C0003864
Semantic type: Disease or Syndrome
At first, medical term detection was tried with word-net and it failed to detect some diseases
which have more than one word. So as to increase the accuracy of the process, searching for
medical terms is done by Unified Medical Language System.
38
6.4.3. SPARQL Query Generation
Final output of both Name Entity Recognition and Medical Term recognition phases create
an array which is consist of all the classes,medical terms and their values.
Ex: What are the leads to diabetes and what are the symptoms of it?
Output:- [[Class:Disease, Value:Diabetes], [Class:Cause, Value: ''NULL''],
[Class:Symptom, Value:''NULL'']]
SPARQL Query Generation phase generate semantic triples which consist of object,predicate
and subject for every sub array which consist of a class and value. Phase identifies NULL
values are the terms which need to be answered. Here query use random variables foe
unrecognized subjects and then make relation between two triples. Using this process system
generate SPARQL queries. Then generated queries will run via Apache Jena Fuseki sparql
server and extract answer from ontology which is the output of Ontology Generation Module.
Reason for using ontology instead of relational databases is ontologies are capable of inferring
implicit information from existing details and relations.
When there's a ambiguity in selecting answers for a given question system ask question from
user to clarify the asked question and then regenerate SPARQL queries for clarified question
and extract answer from ontology.
CASE 01
Ex: I have excessive thirst and increased urination. My vision is getting blurry. Do I have
diabetes?
Output:- You have high probability of having diabetes
Here medical tern recognition phase recognize excessive thirst, increased urination and blurry
vision as symptoms. Then search for these symptoms in ontology using SPARQL queries.
39
Searching does not find diseases which have same symptom set. So there will not occur an
ambiguity. So diabetes will select as answer.
CASE 02
Ex: I have fatigue , breathing difficulties and Coughing up blood. Recently I have loss my
weight a lot.
What may be am I having?
Output 1:- Do you have pain in bones?
If user input YES , Output 2:- You have probability of having lung cancer
If user input NO , Output 2:- You have probability of having bronchiectasis
In this question recognized symptoms are fatigue , breathing difficulties ,Coughing and
weight loss. When searching for symptoms, module find same symptom set in two places.
That mean both lung cancer and bronchiectasis have all given symptoms. This is the case
where ambiguity occurs. So system starts to compare symptoms of both selected diseases and
select a unique symptom for one disease from these two diseases. For this question system
identify that bronchiectasis don't show symptom of bone pain and then ask from user
whether he is having bone pain so as to clarify the disease. According to user input system
select best answer.
40
Chapter 6
7. Implementation
7.1. Knowledge Extraction to Ontology automation
This section has detailed explanation on the experiments carried out and the conducted
experimentations conditions which has been used. Experiments carried out in several levels in
some cases and details and accuracy improvements are stated with in this section. In knowledge
extraction we have focused on identifying key concept and their relations using following
models. NER model for classify entities and a relationship classification model to predict
classification of a given sentence.
41
Flow model [Figure6.1] describes the way the processes has been conducted.
From Unstructured data gathered from large corpus relates to disease which are from CDC
[11]. Those data then preprocessed into words and their labels with the guidance of CDC and
other medical data information providers such as PubMed, WHO to classify such words into
appropriate classifications. Dataset which has been constructed is attached under Appendix.
These manually tagged data has been split into training set and test set to train the model and
test its accuracies. Using CRF classifier these labeled data is trained and fit into a model. To
train model separate batch file has been written and it includes the path of java libraries to be
included and path of property file which includes all other details from feature selection and
model to be trained from dataset.
trainFileList classifiers/dis_train4_BOI.tok
serializeTo classifiers/dis-bio-ner-model.gz
map word=0,answer=1
useClassFeature TRUE
userWord TRUE
UseNgrams TRUE
noMidNGrams TRUE
useDisjunctives TRUE
maxNGramLeng 6
usePrev TRUE
useNext TRUE
useSequences TRUE
usePrevSequences TRUE
maxLeft 3
useTypeSeqs TRUE
useTpyeSeqs2 TRUE
useTypeSequences TRUE
wordShape chris2useLC
42
pandas data extraction objects are separately assigned. These data is gathered under following
panda settings:
These settings meaning that the reading document contains first line with column names.
Encoding is set to ISO-8859-1 due to some of characters inside the documents were not
supporting utf-8 encoding. Delimiter explains how the columns are separated from each other
and \t is for tab separated indicator. Quoting at 3 is to state that we ignore any double quotes
inside document.
43
This vectorizer helps to fit our preprocessed data to fit into a vector model and transform into
features. In our case we have set our number of features to be max at 5000. Then its required
to convert this feature vector into array to perform training using classification models.
This model has been trained using several models. They are RFC, Logistic regression model
and SVM linear SVC models. Training data is evaluated at Evaluation chapter of this
dissertation.
When in training Random forest classifier number of estimators has been set up to 100 trees.
This creates 100 decision trees to predict final result from an input sentence.
7.1.3. Hand built patterns to generate relations
Hand built patterns is a relationship extraction model which by recognizing relationship in
between two concepts. These two concepts are Subject and its Object which has linked by a
property. From NER model we recognize Subject and Object with their related class. Then we
use these patterns to recognize general relationship category which that relationship emphasize.
44
Figure 7:2 Relationship Construction Mechanism
45
requires data to be arranged in the way that semantic templates has been organized. In design
part there is more detailed structure of semantic templates.
Dataset include over 900 question that have similar number of questions representing each class. Since
collected question were unstructured class of each questions need were entered manually. Then the next
batch of testing questions are generated as a byproduct of ontology generation process. Instead of
manually going through each question trained classifier is used to classify them into classes and then
errors were corrected manually. However testing set of question are little bit biased for particular type
of question classes. It is important to have have roughly similar number of questions for each class
when training the classifier. Question are collected from various medical forums using web-crawlers
and python library beautifulsoup (which is used to collect data from various forms of web pages )
Training Dataset - 909 questions , 9 classes and roughly 10 question per each class.
Testing Dataset - 600 Question, number of questions are little bit biased for some classes.
46
7.2.1. Preprocessor
Preprocessor is used to enhance the features of the each question before submitting to the vectorizer.
Nltk library lemmatization module and tokenization module is used fopr the early stages of the
preprocessor.
question_list =dataset
FOREACH question IN question_list:
tokenized_question = TOKENIZER(question)
Lemmatized_tokenized_question = []
FOREACH word IN tokenized_question:
Lemmatized_tokenized_question +=LEMMATIZATION(word)
RETURN Lemmatized_tokenized_question
Once question is tokenized and lemmatized it is presented to the Disease word replace module. Disease
word replacement module search for words or phrases that represent disease in each sentence and
replaces the word/phrase with a one term disease. To increase the efficiency, first bigrams
representing disease phrases removed and replaced. After that rest of the words are checked for
unigrams.
Nltk is a natural language processing toolkit/platform. It consist of a croups named wordnet that is a
lexical database for English language. Wordnet can be used to find meaning of words,synonyms,
antonyms and also hypernyms.
Hypernyms :- hypernym for a name is a word that has related but broad meaning than that word. For
example hypernym for a car is automobile. With the use of nltk wordnet corpus for a given word
hypernym path can be found.
Example :- hypernym path for word heart_attack is
Entity > physical_condition > disorder > cardiovascular_disease
47
If a particular word hypernym path leads to words such as disease, illness, cancer, contamination, defect,
disorder, epidemic, fever, flu, illness, sickness and syndrome it can be decided that word represent some
kind of a name for a disease. With the following function is used to find hypernyms using the nltk
corpus wordnet.
FIND_DISEASE_WORD(word)
hypernyms = FIND_HYPERNAMES(word)
FOREACH hypername IN hypernames
IF hypernyms IN predefine_desease_word_set:
RETURN TRUE
RETURN FALSE
Second method is to use the centered ontology which answers are generated. There diseases are
recorded as instances. If a instance is rdf:type disease. Thus, SPARQL queries can be used to find out
if a particular disease exist with the given name.
SELECT COUNT(disease)
WHERE {
?disease rdf:type base:disease.
?disease base:label ?strName.
FILTER regex(?strName, "chickenpox", "i" ).
}
If the COUNT(disease) returns a value 1 or more then it can be considered that disease exist with the
name chickenpox
48
7.2.3. Vectorization Process
vectorizer =TfidfVectorizer(min_df=1,ngram_range=(1,2),token_pattern=r'\b\w+\b')
Min_df = minimum document frequency is set to one. Once vectorizer is formed, a new word is only
considered if that word exist at least 1 time in the training dataset. Ngram_range is set to 1 to 2 to
consider both unigrams and bigrams. TfidfVectorizer does the function of CountVectorizer ( create the
feature set and form vectors with each term frequency) and influence the frequency value with rarity of
each words by converting them to tf-idf values. TfidfVectorizer output vectors are also normalized.
Once vectorizer is created it is saved using nltk library pickle so that it can be reused in both training,
testing, and using phase of the classifier.
Pre_processor
Without Disease word replacement.
With disease word replacement.
Vectorizer
Ngram_range is set to 1
Ngram_range is set to 1-2
Ngram_range is set to 1-3
Linear regression
Multi_class = multinomial , which is that it is using cross- entropy loss schemes for
multi class classification. (recommended in [PAS_6])
49
7.2.5. Accuracy calculation
For both training set and testing set accuracy is calculated as shown below.
Accuracy = Correct Predictions / Total number of tests
Python Natural Language Toolkit(NLTK) was used to tokenize and remove stop words in
questions in Name Entity Recognition phase. First approach of finding entities is searching for
synsets in word-net and getting hypernym path for each synonyms. Then classes will be
identified by matching synonyms. To clarify that module don't miss any entities in questions
,second approach of entity finding will be executed for tokens. This approach have already
defined synonym set for each classes in ontology. Tokens will be iterate through those synonym
set and identify corresponding classes in Ontology. Here
before searching for matching synonyms, morphological affixes will be removed from tokens
and given synonym set so as to get all the word to basic form. This is done by using Porter
Stemmer function in NLTK. Pseudo code for this approach is given below.
FOR i IN RANGE(token_size):
token = token[i - 1]
row_number = 0
FOR syn_set IN synonyms_list:
row_number += 1
FOR synonym IN syn_set:
IF stemming(token) IN stemming(synonym):
class = Ontology_entity[row_number - 1])
RETURN class
50
7.3.2. Medical Term Recognition
Medical Term Recognition is done by UMLS API. UMLS version 2016AB is using for the
search.
Authentication for accessing API is obtained by API key given by UMLS . First words will be
search from API and identity matching CUI(Concept Unique Identifier). Then selected CUI
will be search again in API so as to obtain semantic type. Output of the API request are
converted in to JASON format.
7.3.2.1. Pseudo code for searching by term:
WHILE TRUE:
access = AuthClient.getst(AuthClient.gettgt())
pageNumber += 1
query = {'string': string, 'ticket': access, 'pageNumber': pageNumber }
query['searchType'] = "exact"
r = requests.get(uri + content_endpoint, params=query)
r.encoding = 'utf-8'
items = json.loads(r.text)
jsonData = items["result"]
FOR result IN jsonData["results"]:
cui = result["cui"]
RETURN cui
51
7.3.3. SPARQL Query Generation
Input for this phase is an array which is have class and value tuple. Here the value module need
to search for is mentioned as NULL. For each class value tuple SPARQL query will be
generated an then each queries will matched with relations. Pseudo code for generating queries
is given below .(?rv is random string variable)
SPARQL queries will be run via Apache Jena Fuseki SPARQL server.
IF answer_count ==1:
RETURN answer
ELSE IF answer_count > 1:
FOR value IN answer1:
IF value NOT IN answer2:
unique_value = value
question_generation(unique_value)
IF answer_unique_value=="YES":
RETURN answer1
ELSE:
RETURN answer2
ELSE:
RETURN answer_list
ELSE:
RETURN "not enough facts"
52
Chapter 7
8. Evaluation
8.1. Ontology Automation Modules Evaluation
8.1.1. Data Retrieval
Objective behind this model is to automate the data gathering from online web sources. This
has been successfully achieved via a web spider which crawl through web pages and organizing
data into local directories. Initial approaches were carry out using individual pages information
retrieval via semi structured web pages. However not just being stick to semi structured this
approach further developed to unstructured data retrieval. Control experiments limits to the
single web domain to gather most of the data which significant to ontology development based
on MEDICAL domain.
This Conceptualization model has developed using Stanfords CRF classifier by training
NER tagging model. CRF classifier is a most suitable approach for most applications like
POS tagging, binary classification, named entity classification, etc.
These accuracies were obtained via a python based accuracy score tester. Which is tested to
calculate precision, recall and their f1-score measures.
53
8.1.3. Relationship Identification
During implementation of this model it was difficult to find the datasets relevant to the domain
which we have selected. Therefore datasets were prepared using web based data sources. And
tagged them iteratively while training and improving their accuracies. Following figure 6.1.2.1
indicates the level of training carried out and the accuracy gained after improving classification
results.
From this it shows improving model reaches higher accuracies with the training data set
improves. Zhou et al. results [reference] similar approach shows that maximum obtained
accuracy is around 69%. Following figure
54
this ontology generation. With following several Named Entity Identification, Relationship
Extraction techniques this module has achieved its objective.
SVM 86.55
55
Table 8:7 without disease word replacement module
SVM 73.12
Test question set consist of small number of question that had varied meanings from their respective
classes. Questions such as combinations of symptoms and preventions were often idenfiied incorrectly.
Example :- What are the symptoms and preventions of heart attack was identified as ABT_symptoms
and What are the preventions and symptoms of heart attack was identified as ABT_Preventon.
However best class that suits such questions are ABT_diseases, since users can get the overall
information about the disease including the symptoms and prevention methods. Testing set did not had
enough examples from IS_sym_dis, IS_caus_dis and IS_prev_dis classes. However those questions
were correctly predicted.
From both training set, and testing set accuracies it is clear that using bigram can increase the accuracy
to a significant amount. However the going from bigram to trigram did not increase the accuracy in
most cases. Logistic regression showed high accuracy in all situations. Disease word replacement
module has a significant impact on the accuracy of the classifier. Roughly 10% increase of accuracy
can be seen between with and without Disease word replacement module.
Multiclass classification module of 6 Primary classes reported maximum of 92.2% of accuracy using
bigrams headwords and wh-word[6] and over 5000 training data. However this classifier with 9 classes
managed to achieve similar accuracy with fewer training sets involved. Since this is focused on medical
domain, classes used for classification are more specific than the classes used in Multi Class classifier
mention in literature review section[6]. In Order to change the domain new generalized class system
need to be introduced for the classification process.
56
more information. Some type of question and the ability of providing correct answer by system
have listed below.
Table 8:8 Ability of providing correct answer
57
Chapter 8
9. Conclusion and Further Work
Throughout the project with in this 1 year of time frame data extraction and ontology
automation model has certainly achieved its basic footsteps to ontology definition for any given
domain. In other words by following the initial steps of this content this can develop another
ontology definition for another domain. However drawback of this model is that it requires
finding large data sets and have to manually improve trained data to improve accuracy by the
model. For general topics this model can be fixed with the availability of datasets. Relationship
extraction is heavily dependent on NER and supervised relationship recognition models. For
NER model it is at satisfactory level which accuracies at 75% above. Therefore entities can be
recognized with higher probability value. But relationship extraction is below the satisfactory
level (around 63%). This undesired outcome could be as a result of conducting a domain
specific approach to train model. However with increased volume of data set this can further
improve at least up to 70%.
This work presents a semantic and linguistic based approach for the extraction of medical
entities using semantic relations in medical domain. This approach have five main steps.
The accuracy of the module is based on the precision of identified terms and entities. Usage of
UMLS and WordNet have increased the performance of the system. In addition effectiveness
and accuracy of the system depend on the accuracy of question classification module. This
module is useful for common users and it can be develop further for usage of doctors with
ontology expansion and pattern recognition between diseases and other entities and relations.
Further work
59
References
[1]R.J Bayardo, et al., "InfoSleuth: Agent-Based Semantic Integration of Information in Open
and Dynamic Environments", Microelectronics and Computer Technology Corporation
,Austin, Texas.
[2]A. Maedche and S. Staab, "The TEXT-TO-ONTO Ontology Learning Environment",
Institute AIFB, University of Karlsruhe, 76128 Karlsruhe, Germany.
[3]P. Cimiano, A. Madche, S. Staab and J. Volker, "Ontology Learning", Institute AIFB,
University of Karlsruhe, Karlsruhe, Germany.
[4]S. Mishra and S. Jain, "Automatic Ontology Acquisition and Learning", epartment of
Computer Applications, Teerthankar Mahaveer University, Moradabad, U.P, 2014.
[5]M. Shamsfard and A. Barforoush, "Learning Ontologies from Natural Language Texts",
Computer Engineering Dept., Amir Kabir University of Technology, Hafez ave., Tehran,
Iran.
[6] Huang, Zhiheng, Marcus Thint, and Zengchang Qin. "Question Classification Using
Head Words and Their Hypernyms." Proceedings of the Conference on Empirical Methods in
Natural Language Processing - EMNLP '08 (2008): n. pag. Web.
<http://www.aclweb.org/anthology/D08-1097>.
[7]Xin Li, and Dan Roth. "Learning Question Classifiers." Learning Question Classifiers
(2004): n. pag. Learning Question Classifiers. Web. 01 Feb. 2017.
<http://cogcomp.cs.illinois.edu/Data/QA/QC/>.
[9]C. Pradel, O. Haemmerle and N. Hernandez, "Natural Language Query Interpretation into
SPARQL Using Patterns", IRIT, Universite de Toulouse le Mirail, 2013.
[10]J. Lehmann and L. Buhmann, "AutoSPARQL: Let Users Query Your Knowledge Base",
University of Bonn,University of Leipzig, 2011.
60
[16][online]. Available: https://jena.apache.org/tutorials/sparql.html
61
Appendix A
Individual Contribution
As a member of the beyond project group I selected the Question classification module of
the intelligent question answering system. Goal of the Question classification module is to train
a question classifier that can identify the question type based on the expected answer of the
question. During the early stages (before interim) following are my contributions to the project.
Populate the testing ontology using data collected from traversing the web pages using.
beautifulsoup module (for the testing purposes)
Identifying set of classes for the question classifier.
Build a training and testing dataset based on the classes defined.
Training the prototype classifier.
After the interim period my focus was on improving the accuracy using different methods.
Creation of the question preprocessing system.
Implementation of the vector based model.
Using 3 different classifiers and twitching small details of the training process to reach
a higher accuracy.
Studying the possibility of generalized classification model that can be used to expand
the domain.
62
Name of the student: I. A. Abeysekera (124003M)
I was responsible to research about new concept where to collect data from data sources and
map those unstructured data into an ontology. This concept seemed as minor task at the
beginning however when digging into deeper I have found that its more important to retrieve
high quality information in order to provide quality answers. There was high correlation
between Question answers and data extraction. Therefore I have gone through several related
concepts who have carried out similar approaches. While gathering those information my
mindset changed and understood how those principles applied in text mining and data
gathering.
Following are the areas which I have involved more during this research study:
As a group member I had to always communicate with my colleague members to keep the
individual processes in track. Always asking what are the things they need as inputs from my
system I have carried out the design phase.
During Final implementation I had work mostly on improving accuracies of the training
models. At this stage I had to carry out several test cases relates to ontology results. Due to
relations extracted from the system always showing new patterns which should have
categorized properly, I had to test on several grammar types on chunk parsers and other relation
extraction models. I have trained two separate models to recognize entities using named entity
recognition model and supervised relationship extracting model. This was great new
experience to learn several machine learning approaches and their internal mechanisms.
63
Name of student: K.M.K.Hasantha (124069T)
As a member of the project group I choose the Answer Generation Module of the Intelligent
Question Answering System. This module include key term extraction from natural language
query and auto generating SPARQL queries to extract the answers from ontologies.
64
Appendix B
Sample Dataset used during NER classification Module
Dengue B-DISEASE
fever I-DISEASE
is O
a O
disease B-NN
caused O
by O
a O
family O
of O
viruses B-NN
that O
are O
transmitted O
by O
mosquitoes B-CREATURE
. O
Symptoms O
of O
dengue B-DISEASE
fever I-DISEASE
include O
severe O
joint O
and O
muscle O
pain O
, O
swollen O
lymph O
65
Sample Dataset used to train Supervised Relationship extraction model
id classification relation
301 cause_symptom They can also build up and cause inflammation.
302 other Normally your blood doesn?t have a large number of eosinophils.
Your body may produce more of them in response to, allergic
disorders, skin conditions, parasitic and fungal infection, autoimmune
303 is_caused_by diseases, some cancers, and bone marrow disorders.
In some conditions, the eosinophils can move outside the bloodstream
304 cause_symptom and build up in organs and tissues.
Symptoms of EoE include nausea, vomiting, and abdominal pain after
305 cause_symptom eating.
A person may also have symptoms that resemble acid reflux from the
306 cause_symptom stomach.
In older children and adults, it can cause more severe symptoms, such
as difficulty swallowing solid food or solid food sticking in the
307 cause_symptom esophagus for more than a few minutes.
308 is_caused_by In infants, this disease may be associated with failure to thrive.
In some situations, avoiding certain food allergens will be an effective
309 has_treatement treatment for EoE.
Eosinophilic fasciitis is a very rare syndrome in which muscle tissue
310 disease_is under the skin, called fascia, becomes swollen and thick.
66