You are on page 1of 14

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TBDATA.2016.2612236, IEEE Transactions on Big Data
IEEE TRANSACTIONS ON BIG DATA 1

Extracting Medical Knowledge from


Crowdsourced Question Answering Website
Yaliang Li, Chaochun Liu, Nan Du, Wei Fan, Qi Li, Jing Gao, Member, IEEE,
Chenwei Zhang, and Hao Wu

Abstract—The medical crowdsourced question answering (Q&A) websites are booming in recent years, and increasingly large amount
of patients and doctors are involved. The valuable information from these medical crowdsourced Q&A websites can benefit patients,
doctors and the society. One key to unleash the power of these Q&A websites is to extract medical knowledge from the noisy
question-answer pairs and filter out unrelated or even incorrect information. Facing the daunting scale of information generated on
medical Q&A websites everyday, it is unrealistic to fulfill this task via supervised method due to the expensive annotation cost. In this
paper, we propose a Medical Knowledge Extraction (MKE) system that can automatically provide high quality knowledge triples
extracted from the noisy question-answer pairs, and at the same time, estimate expertise for the doctors who give answers on these
Q&A websites. The MKE system is built upon a truth discovery framework, where we jointly estimate trustworthiness of answers and
doctor expertise from the data without any supervision. We further tackle three unique challenges in the medical knowledge extraction
task, namely representation of noisy input, multiple linked truths, and the long-tail phenomenon in the data. The MKE system is applied
on real-world datasets crawled from xywy.com, one of the most popular medical crowdsourced Q&A websites. Both quantitative
evaluation and case studies demonstrate that the proposed MKE system can successfully provide useful medical knowledge and
accurate doctor expertise. We further demonstrate a real-world application, Ask A Doctor, which can automatically give patients
suggestions to their questions.

Index Terms—Crowdsourced Question Answering, Medical Knowledge Extraction, Truth Discovery

1 I NTRODUCTION service, so the crowd-generated information grows tremen-


dously. For example, xywy.com alone receives thousands
With the booming new technology, the traditional health of new health related questions every day. There is no
care system is undergoing an evolution. Besides visiting doubt that the information from these crowdsourced Q&A
a doctor in person for the health concerns, the young websites is valuable [2]–[5], but how to take good use of
generations would also prefer to search the information such information is a big question. One way to utilize
readily available on the Web, or ask the doctors through the such information is to extract knowledge from the medical
Internet. A recent report shows that tens of millions of health Q&A websites. The extracted knowledge can facilitate the
related queries are searched every day using Baidu search development of many online health services. For example,
engine [1]. Globally the online health service now becomes the knowledge can contribute to the construction of a robot
a billion-dollar industry. Many health service websites are doctor who can automatically generate answers to new
developed, such as medhelp.org in the USA and xywy.com in health related questions.
China. The latter website has millions of registered users One of the most important challenges of extracting
and hundreds of thousands of registered doctors. knowledge from the medical crowdsourced Q&A websites
As an emerging industry, this new type of health care is that the quality of question-answer pairs is not guar-
service brings opportunities and challenges to the doctors, anteed. The questions asked by patients can be noisy and
patients, and service providers. Compared to the traditional ambiguous. The answers’ quality varies due to reasons such
one-to-one service, the online medical crowdsourced ques- as doctors’ expertise, their level of commitment, and their
tion answering (Q&A) websites provide crowd-to-crowd purpose of answering questions. To extract useful knowl-
edge, it is important to distinguish relevant and correct
• Most of this work was done during an internship of the first author in information from unrelated or incorrect information.
Baidu Research Big Data Lab, Sunnyvale. In the light of this challenge, one possible solution is to
• Y. Li, Q. Li, and J. Gao are with the State University of New York at label the quality of question-answer pairs and then learn
Buffalo, 338 Davis Hall, Buffalo, NY 14260. E-mail: {yaliangl, qli22,
jing}@buffalo.edu.
classification or regression models. Such models are then
• C. Liu, N. Du, and W. Fan are with Baidu Research Big Data Lab, 1195 used to judge the quality of each question-answer pair.
Bordeaux Drive, Sunnyvale, CA 94089. E-mail: {liuchaochun, dunan, However, this approach may not work on the medical Q&A
fanwei03}@baidu.com. dataset. To build a training set, we need to hire experts to an-
• C. Zhang is with University of Illinois at Chicago, 851 S. Morgan,
Chicago, IL 60607. E-mail: czhang99@uic.edu. notate the labels, but the cost can be prohibitive to annotate
• H. Wu is with University of Southern California, 941 Bloom Walk, Los enough training examples. Besides, medical knowledge is
Angeles, CA 90089. Email: hwu732@usc.edu. highly domain specific, and because of this, multiple experts
may be needed and this further boosts the cost. In addition,

2332-7790 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TBDATA.2016.2612236, IEEE Transactions on Big Data
IEEE TRANSACTIONS ON BIG DATA 2

the privacy protection of patients adds to the difficulty of • Last but not least, we demonstrate a real-world medi-
annotation. These issues motivate us to develop unsuper- cal application built upon the proposed method. This
vised learning methods that extract knowledge from noisy application, Ask A Doctor, shows that the extracted
crowdsourced data on medical Q&A websites. knowledge can enable and facilitate many online
The key component in the medical knowledge extraction healthcare applications.
task is to find reliable answers without any supervision.
In the following, we first present an overview of the
Some existing truth discovery methods [6]–[10] assume that
proposed Medical Knowledge Extraction system (MKE sys-
the answers from high-expertise users are reliable, and the
tem for short). Then, in Section 3, we formally formulate
users who provide reliable answers should have high exper-
the problem and present solutions to tackle the challenges
tise. Thus these methods can iteratively estimate the source
in the knowledge extraction task. In Section 4, we conduct
reliability (i.e., user expertise) and infer reliable answers
experiments on real medical Q&A datasets to validate the
from user-contributed data. Although truth discovery may
effectiveness of the proposed system. In Section 5, we
be adopted, there are some unique challenges in the medi-
demonstrate an mobile App that uses the proposed MKE
cal knowledge extraction task that existing truth discovery
system. We discuss related work in Section 6, and conclude
methods cannot handle as discussed below.
the paper in Section 7.
First, truth discovery methods are developed for struc-
tured data (e.g., relational database), but for crowdsourced
Q&A websites, all the inputs are unstructured data (i.e., 2 OVERVIEW OF THE S YSTEM
texts). We address this challenge by transforming unstruc- The objective of this system is to extract knowledge
tured texts to entity-based representation. Second, trustwor- triples <question, diagnosis, trustworthiness degree> from
thy answers to a health-related question are usually not noisy question-answer pairs in medical crowdsourced Q&A
unique. There may be multiple possible answers to the same websites. Meanwhile, for the doctors who give answers in
question, and those answers are likely to be correlated. Such these Q&A websites, their expertise will be automatically
correlation violates the assumptions held by existing truth estimated.
discovery methods. To address this challenge, we model We propose a MKE (Medical Knowledge Extraction)
their correlation via a similarity function defined on the system that can jointly conduct the medical knowledge
word vectors of answers. Third, we observe severe long-tail extraction and doctor expertise estimation without any su-
phenomenon in the medical Q&A data which makes it dif- pervision. We give an overview of the proposed system in
ficult to estimate doctor expertise and trustworthy answers. this section. Figure 1 shows the pipeline of the proposed
Specifically, most doctors only provide a few answers, so MKE system, and a concrete example is adopted for the
it is difficult to estimate their expertise accurately; many purpose of illustration.
questions receive only a few answers which may not include In medical crowdsourced Q&A websites, patients have
any trustworthy answers. To reduce the effect of such long- various intentions when asking questions. For example,
tail phenomenon, we propose to incorporate a pseudo count they may want to find out the possible diseases based on
for each doctor in the estimation of doctor expertise, and we their symptoms, or particular side-effects of a drug. Doctors,
adopt the proposed entity-based representation to merge who play essential roles in these Q&A websites, provide
similar questions which enlarges the answer set for many answers to these questions. For the same question, multiple
questions. doctors may give different answers due to their diverse
To evaluate the proposed medical knowledge extraction expertise.
system, we collect medical Q&A data from xywy.com, a In order to distill the trustworthy medical knowledge,
popular online health website in China. To quantitatively we propose a truth discovery method to automatically esti-
evaluate the extracted knowledge, we compare the knowl- mate doctors’ expertise, and conduct weighted aggregation
edge with expert annotations. We also validate the estimated based on the estimated doctor expertise. To apply the truth
doctor expertise by comparing with external information discovery framework, we first extract entities from texts and
indicating doctor expertise crawled from xywy.com. Further, transform texts into entity-based representations. The new
we provide some cases studies which clearly demonstrate representations will then be fed into the proposed truth
the meaningful knowledge extracted by our system, and we discovery method, which outputs the medical knowledge
demonstrate the usefulness of the extracted knowledge in a triples <question, diagnosis, trustworthiness degree> and the
real-world medical application. estimated doctor expertise.
In summary, our contributions in this paper are: Based on these outputs, various real-world applications
can be built. For example, the extracted medical knowledge
• We propose a truth discovery method to automati- triples can be used to answer medical questions in Auto-
cally extract medical knowledge from noisy crowd- matic Diagnosis and Medical Robot. Besides, the estimated
sourced question answering websites without any doctor expertise can be applied in the tasks such as Doctor
supervision. The proposed method provides a cost- Ranking and Question Routing, which play important roles
efficient and effective way to mine knowledge from in crowdsourced Q&A websites.
crowdsourced question answering websites.
• The proposed truth discovery method is designed to
tackle the new challenges in the medical knowledge 3 M ETHODOLOGY
extraction task, and the experimental results on real- In this section, we present the technique details of the
world dataset confirm its effectiveness. proposed MKE system, which extracts information from

2332-7790 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TBDATA.2016.2612236, IEEE Transactions on Big Data
IEEE TRANSACTIONS ON BIG DATA 3

Q.0/%,(-*89* “I am 29 years old. I have been feeling the pain in my Extracted Entities
throat since the day before yesterday. I got runny <{!"#$%&'()%*+),-$'.--#*-(/01*/-002031*4(561*7(4%('*8>
nose and kept sneezing. Any idea or suggestions?”
<{!"#$%&'()%*+),-$'.--#*-(/01*/-002031*/,-./*,-<04%,(-1*7(4%('*!>
Doctor 19* “Hi, according to the symptoms you have, you
probably got cold due to infection:; <{!"#$%&'()%*+),-$'.--#*-(/01*/-002031*.++0'*'0/+,')%('#*,-<04%,(-1*7(4%('*=>
……
Doctor 29* “Based on your descriptions, while at the same time,
considering the current weather, I would like to say it
is sinus infection caused by cold weather.” Entity Extraction
Doctor 39* “Hello. Your condition very likely belongs to
symptoms of upper respiratory infection.”

……
Truth Discovery

Medical Robot Automatic Diagnosis Discovered Knowledge


<{!"#$%&'()%*+),-$'.--#*-(/01*/-002031*4(561*>:=>
<{!"#$%&'()%*+),-$'.--#*-(/01*/-00203!"/,-./*,-<04%,(-1*>:!?>

<{!"#$%&'()%*+),-$'.--#*-(/01*/-002031*.++0'*'0/+,')%('#*,-<04%,(-1*>:!>
……

Estimated Doctor Expertise


Doctor Ranking Question Routing
Applications 0.70

0.58

0.49
……

Fig. 1: Overview of the Medical Knowledge Extraction (MKE) System. The illustrative example is translated from xywy.com,
a Chinese medical crowdsourced question answering website.

noisy question-answer pairs and summarizes into medical • Claim: a claim is a tuple that consists of an question,
knowledge. After formally defining the task, we introduce a a doctor ID, and the corresponding answer from this
basic truth discovery method, and then propose solutions doctor to this question.
to address the unique challenges in medical knowledge • Knowledge triple: a knowledge triple consists of an
extraction task. question, a diagnosis, and a trustworthiness degree
of the diagnosis. Knowledge triples are created by
aggregating claims from multiple doctors.
3.1 Problem Formulation and Notations • Doctor expertise: each doctor who answers the ques-
tions of a certain topic is associated with an expertise
We first introduce some important terms used in this paper: score that indicates his probability of providing trust-
worthy answers on this topic. As we do not know the
• Question: a question from a patient contains a set trustworthiness of answers, nor the doctor expertise
of statements and a particular health concern (for a priori, we need to estimate the doctor expertise from
example, describing symptoms and asking for pos- data, and incorporate estimated doctor expertise into
sible diseases, or describing diseases and asking for a weighted aggregation to derive the knowledge
drugs). It also contains some other information about triples.
the patient such as her/his age.
• Question topic: For each question, we assume that it Here we formally define our task. Suppose there is a
belongs to one particular topic such as Pulmonology. set of medical questions Q and a set of doctors D who
In practice, most of the medical question answer- provide answers. Let xdq denote the answer to the q -th
ing websites such as xywy.com already assign each question provided by the d-th doctor, and wd denotes the
question a pre-defined topic. We directly keep such expertise score of the d-th doctor. The objective is to conduct
question topic information for the our system. weighted aggregation on the noisy data {xdq }q∈Q,d∈D to
• Doctor: a doctor is a person who answers questions derive knowledge triples <question, diagnosis, trustworthiness
on the medical Q&A websites. On the website from degree> and estimate doctor expertise.
which we crawl the data, the “doctors” are real
doctors, though it may not be this case for other 3.2 Basic Truth Discovery Method
websites. Since both answer trustworthiness and doctor expertise are
• Answer: an answer is a diagnosis provided by a unknown, this problem can be formulated as a Truth Dis-
doctor for a particular question. There may be mul- covery problem [10]–[12], which jointly estimates the trust-
tiple answers provided by different doctors for the worthiness of answers and doctor expertise from the Q&A
same question, and these answers may be noisy data. Various truth discovery methods have been developed
and unreliable. Note that in the collected dataset, all and their success has been demonstrated in many real-world
the answers to the same question are independently applications, such as healthcare [13], crowd sensing [14]–
provided by different doctors. [16], and knowledge base construction [17], [18]. In this

2332-7790 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TBDATA.2016.2612236, IEEE Transactions on Big Data
IEEE TRANSACTIONS ON BIG DATA 4

section, we introduce a basic truth discovery method for weights are doctor expertise scores, and Eq. (2) updates
the medical knowledge extraction task. the expertise score for each doctor based on the answer
Truth discovery methods take input tuples of <object, trustworthiness degrees. These equations follow the general
value, source>. In our specific task, the object is a question, principle of truth discovery. To be more specific, the truth
the value is an answer, and the source is a doctor. In the truth discovery method starts with a uniform initialization of
discovery problem setting, an object may receive conflicting the doctor expertise, and then iteratively estimates answer
claimed values from different sources, and a source may trustworthiness degrees and updates doctor expertise. The
provide values for different objects. The goal of truth discov- iterative procedure will end when a stopping criterion is
ery is to resolve the conflicts and find the truth (i.e., the most met, e.g., the maximum number of iterations is reached.
trustworthy answer) for each object by estimating the source The main advantage of truth discovery is that it can
reliability (i.e., doctor expertise). A straightforward solution discover the trustworthy information supported by a few
is voting, which takes the majority answer as the reliable information sources. This advantage is brought by the fact
one. However, this solution has an underlying assumption that truth discovery estimates source reliability score for
that all sources are equally reliable. each source and conducts weighted aggregation. In our
To capture the variety in the source reliability, truth scenario, answers supported by a few high-expertise doctors
discovery methods are developed to jointly conduct truth can be selected as trustworthy answers, which greatly helps
computation and source reliability estimation. These meth- us to extract trustworthy medical knowledge from noisy
ods hold a common principle: If an answer is provided by crowdsourced question answering websites.
doctors with high expertise, it is regarded as a trustworthy
answer; Meanwhile, if a doctor often provides trustworthy
3.3 Challenges and Solutions
answers, she/he is assigned a high expertise score. Based on
this principle, we can iteratively update the trustworthiness Although the medical knowledge extraction task can be
of answers and doctor expertise as follows: formulated as a truth discovery problem, the basic truth
discovery method overlooks some unique challenges of
• Estimate the trustworthiness degree of a possible
the task. Therefore, the basic truth discovery method have
answer xq for the q -th question:
X to be adapted to the medical knowledge extraction task.
T (xq ) = wd · 1(xq , xdq ), (1) In this section, we discuss these challenges and present
d∈D corresponding solutions.
where 1(·, ·) is an indicator function. 1(x, y) = 1, if
x = y ; otherwise, 1(x, y) = 0. Eq. (1) is formulated 3.3.1 Noisy Input
based on the truth discovery principle: the trustwor- The first challenge we are facing is how to clean the noisy
thiness degree of an answer is determined by the input. Existing truth discovery methods can only work on
expertise scores of doctors who provide that answer, structured data, but the knowledge extraction task deals
and the trustworthy answers are the ones that are with unstructured and noisy text data. In order to achieve
supported by doctors with high expertise. If wd ’s better performance, we need to derive better representations
are higher, then correspondingly, the trustworthiness of the questions and answers. In this section, we present
degree T (xq ) is higher. The trustworthiness degrees the solution to convert text into structured data using the
will be normalized such that the sum of all possi- question text as an example, and the answer text can be
ble answers’ trustworthiness degrees for a particular handled in a similar way.
question will be 1. Thus T (xq ) can be interpreted as Specifically, we propose an entity-based representation
the probability that xq is trustworthy. of text. That is, we extract a set of entities from each
• Update doctor expertise score: question q ∈ Q to represent the original question text, where
P an entity can be a particular symptom, disease, drug, etc.
x∈Vd T (x)
 
wd = − log 1 − , (2) Correspondingly, the entities we are looking for from the
|Vd | answer text are disease, drug, drug side-effect, etc. We use
where Vd is the set of answers provided by the d- an available medical entity dictionary for entity extraction.
th doctor. Eq. (2) is also formulated based on the If a word from a question text exists in the dictionary, then
truth discovery principle: a higher expertise score we put that word into the entity set for this question. As the
is assigned if the doctor provides more
P trustworthy age of patient is important for diagnosis, we also include the
x∈V T (x) age information in the entity set.
answers. In this equation, the term d
|Vd | is the
average trustworthiness degree of the d -th doctor’s For the answer text, similar procedure is performed.
P
x∈V T (x) Note that truth discovery framework requires that each
answers, so we can treat 1 − d
|Vd | as the prob- answer only contains one diagnosis. Therefore, if more than
ability of the d-th doctor providing wrong answers. one answer entities are extracted from a doctor’s answer
Logarithm function is used to re-scale the expertise text, we assume that this doctor provides multiple answers
scores so that the differences among the scores are where each entity corresponds to a different answer.
enlarged. From Eq. (2), it can be seen that a doctor Using the proposed entity-based representation of text
who is more likely to provide wrong answers gets a enables us to convert text input into structured represen-
lower expertise score. tations. During this process, text with similar meanings
Eq. (1) estimates the trustworthiness degree for each is mapped into similar or even the same representations.
possible answer by conducting weighted voting where the For example, “I have a headache, fever, and coughing”

2332-7790 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TBDATA.2016.2612236, IEEE Transactions on Big Data
IEEE TRANSACTIONS ON BIG DATA 5

is converted into a set with three entities <headache, fever, cosine similarity between answers into Eq. (1). Then the
cough>, so as “I cough, and have a fever and a headache”. trustworthiness degree of a possible answer is calculated
Therefore, this entity-based representation is able to merge as:
questions with similar meanings. X X 0
T (xq ) = wd · 1(xq , xdq ) + Sim(vxq , vx0 ) · T (xq ), q
d∈D x0q 6=xq
3.3.2 Multiple Linked Truths
(3)
The second challenge of the medical knowledge extraction
where Sim(v, v0 ) is the cosine similarity between two vec-
task is that there may be multiple trustworthy answers 0
tors, and xq is another possible answer to the q -th question.
to a question, and these answers can be correlated with
The similarity between two possible answers plays the
each other. For example, a patient describes his symptoms
role as a coefficient, and automatically controls how much
as “headache, fever, and cough”, and asks for the disease
influence should be considered. Thus the trustworthiness
he might get. The doctors may suggest that the potential
of an answer is enhanced if it is supported by other similar
disease is common cold or flu based on his description.
answers; on the other hand, if an answer is not supported or
According to common sense, we know that these two an-
even opposed by other answers, then the cosine similarity
swers are both possible, and these two diseases share many
gives a negative value, so the answer’s trustworthiness is
common symptoms, so they are not independent answers.
discounted. By incorporating answer similarity, the correla-
This challenge violates an important assumptions of many
tion among different answers can be modeled.
truth discovery methods [9], [11], [12], which is the single
truth assumption. The assumption is that there exists one 3.3.3 Long-Tail Phenomenon
and only one trustworthy answer for each question. There In many crowdsourcing applications, the long-tail phe-
is some recent work that relaxes the single truth assumption nomenon is observed [24]. This phenomenon holds for the
and considers the multiple truths scenario, such as [19], [20]. medical crowdsourced Q&A data studied in this paper
However, in these work, another assumption has to be held, as well, as demonstrated in Figure 2. That is, for most
that is, the truths for the same question are independent doctors, they provide answers to a few questions, and only
with each other. For example, if A and B are listed as a small set of doctors provide answers to many questions.
the authors of a book by a website, then the probabilities Or for most questions, the number of received answers is
of A being correct and B being correct are independent. small, and only a small set of questions receive a large
Obviously, the knowledge extraction task violates this as- number of answers. Both types of long-tail distribution
sumption, and thus these work [19], [20] is not applicable. exist, but existing truth discovery work that handles long-
We can easily adapt the truth discovery method in tail phenomenon [24] only considers the long-tail from the
Section 3.2 to handle multiple truths scenario. Since we perspective of sources, i.e., most sources provide only a
calculate the trustworthiness degrees for all the answers, few answers. However, Figure 2 clearly demonstrates that
we can regard the answers whose scores are higher than the long-tail phenomenon is also seen on questions: Most
a threshold as trustworthy answers. Some further efforts questions get only one or two answers while only a few
are needed to capture the relationship between answers as questions get plenty of answers. The long-tail phenomenon
discussed in the following. of both types poses challenges to the basic truth discovery
To capture the correlation among multiple possible an- method. Without sufficient answers from each doctor, we
swers, we propose to represent the answer entity using the cannot accurately estimate the doctors’ expertise. Without
neural word embedding method [21]–[23], where each word sufficient answers to each question, the trustworthy answer
is represented by a real-value word vector. The key idea be- cannot be identified since it is probable that none of the
hind neural word embedding is that the meaning of a word answers is trustworthy.
can be characterized by its context words. Therefore, the To tackle the challenge brought by the long-tail phe-
vector representation of words can be obtained by training nomenon on sources, we modify Eq. (2) based on the solu-
on a large corpus without syntax analysis or any manual tion proposed in [24]: when calculating the source weights,
labeling. For example, according to their similar context a chi-square parameter is used to capture the effect of source
words, the neural word embedding methods can automat- size in the calculation of source reliability. The weights
ically learn similar real-value word vectors for “common of those sources that provide a few answers would be
cold” and “flu”, which capture the relations between these discounted. Inspired by this idea, we add a pseudo count
two diseases. The benefit of using neural word embedding Cpseudo for each source when estimating its expertise score:
is that we can easily calculate the similarity of the words P
x∈Vd T (x)
 
(answer entities) as a real value. If two words have similar wd = − log 1 − . (4)
meanings, then the similarity of their vectors will be high, |Vd | + Cpseudo
and vice versa. In this equation, if a doctor provides only a few answers,
The correlation between words can then be used to then Cpseudo will dominate the term |Vd | + Cpseudo , so the
improve the calculation of the answer trustworthiness. Let’s doctor expertise score would be low. On the other hand, if a
revisit the earlier example. If “common cold” is a trust- doctor provides many answers, then |Vd | will dominate the
worthy answer for that question, then “flu” should also term |Vd | + Cpseudo , and his expertise score would be close
be considered as a trustworthy answer, since “common to the original estimation.
cold” and “flu” are highly correlated. Therefore, we modify The challenge brought by the long-tail phenomenon on
Eq. (1) based on the idea of “implication” (i.e., similarity questions can be naturally solved by the entity-based rep-
function) between answers [7], [11]. We incorporated the resentation introduced in Section 3.3.1. As discussed earlier,

2332-7790 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TBDATA.2016.2612236, IEEE Transactions on Big Data
IEEE TRANSACTIONS ON BIG DATA 6

the answer text. MKE system also extracts the patients’ age
600 from the question text and incorporates it into the repre-
sentation. Next, MKE system constructs the tuples < {age,
500
entities from question text}, entity from answer text, doctor ID>
Number of Doctors

400
as the input to the truth discovery approach. Finally, MKE
system applies the proposed truth discovery method to
300 build knowledge triples <question, diagnosis, trustworthiness
degree> and compute the doctor expertise {wd }.
200
Algorithm 1 : MKE System
100
Input: set of medical questions Q and their corresponding
0
0 5 10 15 20 25 answers {xdq }q∈Q,d∈D , an external entity dictionary with
Number of Answers entity types, and real-value vector representations of
(a) Distribution of Number of Answers per Doctor entities.
Output: discovered knowledge triples <question, diagnosis,
trustworthiness degree>, and doctors expertise {wd }.
14000

12000 1: pre-processing: segment text to words;


Number of Questions

2: entity extraction: extract one type of entity (for


10000
example, symptom) from question text and another
8000 type of entity (for example, disease) from the answer
text;
6000
3: input tuple construction: form tuples < {age, entities
4000 from question text}, entity from answer text, doctor ID> as
input to truth discovery;
2000 4: initialize doctors’ expertise uniformly;
0 5: repeat
0 2 4 6 8 10 12 14 16 18
Number of Answers 6: Calculate the trustworthiness degree of each answer
(b) Distribution of Number of Answers per Question
according to Eq. (3), where the similarity function is
calculated based on the real-value vector
Fig. 2: Long-Tail Phenomenon representation of the entities;
7: Estimate doctor expertise wd according to Eq. (4);
8: until Stopping criterion is satisfied.
the entity-based representation can merge similar questions. 9: return discovered knowledge triples <question,
Consequently, the number of answers to those questions diagnosis, trustworthiness degree> and the estimated
increases after the merging, which can solve the long-tail doctor expertise {wd }.
issue.

3.3.4 Summary of MKE System


4 E XPERIMENT
We adopt the iterative procedure of updating answer trust-
In this section, we evaluate the proposed method from
worthiness and doctor expertise. Based on this framework,
several perspectives. (1) First, we show a quantitative eval-
we propose to further conquer the unique challenges in
uation on the discovered knowledge by comparing it with
the medical knowledge extraction task. Specifically, the
a small set of annotations from a medical expert in Section
formulas to estimate the answers’ trustworthiness degrees
4.2. (2) Then, in Section 4.3, the estimated doctor expertise
and doctor expertise are modified accordingly. Compared
is validated using some external information. (3) We also
with the basic truth discovery solution, we incorporate the
give some qualitative evaluation by showing case studies
similarity function to capture the relationship among pos-
in various tasks with different question intentions, such
sible answers when calculating the trustworthiness degree
as symptom-disease task, disease-drug task, and disease-
of answers as shown in Eq. (3); we add a pseudo count to
examination task in Section 4.4. (4) We further demonstrate
each source in order to mitigate the effect of the long-tail
the effectiveness of the similarity function and the entity-
phenomenon in the doctor expertise calculation as shown in
based representation in Sections 4.5 and 4.6 respectively.
Eq. (4).
The pseudo code of the proposed MKE system is sum-
marized in Algorithm 1. The MKE system starts with the 4.1 Data Collection and Pre-processing
preprocessing of the text. Then based on an available entity All the datasets adopted in this paper are crawled from
dictionary, MKE system conducts entity extraction. Note a Chinese medical Q&A website, xywy.com. On this web-
that at this step, different types of entities are extracted site, patients can ask health related questions, and get an-
for different tasks. For example, if the question contains swers from registered doctors. For all the questions asked
statements of symptoms, then the patient is asking for the on the website, xywy.com also provides their topic labels,
possible disease, so the disease type will be extracted from such as Pulmonology and Hypertension. Since doctors should

2332-7790 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TBDATA.2016.2612236, IEEE Transactions on Big Data
IEEE TRANSACTIONS ON BIG DATA 7

have different expertise on different topics, we use this that on average, the extracted medical knowledge from
provided fine-grained topic information to partition the the MKE system is meaningful and of high quality. This
crawled dataset. Eventually, we obtained the datasets for experiment quantitatively demonstrates the effectiveness of
seven topics, and the number of questions and the number the proposed method.
of involved doctors are summarized in Table 1.
Case Kendall Coefficient Spear Coefficient
Topic Number of Questions Number of Involved Doctors
1 0.5238 0.6071
1 30,104 2,233 2 0.3333 0.4424
2 9,540 1,617 3 0.2143 0.2619
3 10,939 1,689 4 0.8 0.9
4 5,409 1,419 5 0.4 0.6
5 10,470 1,550 6 0.2444 0.3091
6 4,049 1,051 7 0.4222 0.5394
7 4,957 1,347 8 0.2 0.3
9 0.2 0.3
TABLE 1: Statistics of the Datasets 10 0.1429 0.1905
11 0.0667 0.0061
12 0.3333 0.4667
The crawled questions and answers are in Chinese. Dif- 13 0.1556 0.1273
ferent from English, Chinese strings are not divided by word 14 0.3333 0.4424
15 0.3333 0.4303
delimiter. Thus word segmentation is required, where the
16 0.3333 0.3697
Chinese strings are cut into Chinese component words. This 17 0.3056 0.3875
pre-processing step is performed by applying an existing 18 0.8222 0.9394
word segmentation package [25]. 19 0.2889 0.5515
20 0.2778 0.3167
21 0.4222 0.5152
4.2 Quantitative Evaluation on the Extracted Medical 22 0.3889 0.5333
Knowledge 23 0.1111 0.0667
24 0.5556 0.6833
The feedback information from patients are valuable for 25 0.4444 0.5333
the purpose of evaluation. However, such information is so Mean 0.3461 0.4328
limited in the collected dataset that we cannot rely on it. Median 0.3333 0.4424
Thus in this section, we quantitatively evaluate the quality Standard Deviation 0.1855 0.2252
of extracted medical knowledge by comparing with human
annotations from a medical expert. TABLE 2: Evaluation on the Quality of Extracted Medical
For each question, the proposed MKE system outputs Knowledge
possible diagnoses with their corresponding trustworthi-
ness degrees. However, it is much harder for human to
assign precise probabilities to the possible diagnoses, but
4.3 Quantitative Evaluation on the Estimated Doctor
it is relatively easy for them to rank the possible diagnoses
Expertise
based on their likelihood. Therefore, we rank the results of
the MKE system by their estimated trustworthiness degrees, The proposed MKE system automatically learns both the
and adopt ranking comparison between MKE’s output and medical knowledge triples and the expertise for doctors. In
human annotations to quantitatively evaluate the quality of the following experiments, we conduct quantitative evalua-
extracted medical knowledge. We hire a medical expert to tion on the estimated doctor expertise.
annotate for the disease-drug task, where patients describe The xywy.com website maintains a profile for each reg-
their diseases and doctors suggest drugs to take. Due to the istered doctor. Based on the profile as well as the doctor’s
cost of annotation, we randomly select 25 cases for labeling. historical activities such as the number of thanks she/he
Each case is associated with 5 to 15 possible diagnoses. gets, the website assigns a level score to each doctor. This
Two widely used rank correlation coefficients are external information cannot provide a precise measure of
adopted as the performance metrics: Kendall rank correla- the doctor expertise as the score is topic independent, but it
tion coefficient [26] and Spearman’s rank correlation coeffi- can still give us some guidelines regarding the expertise of
cient [27]. Both coefficients compare two ranked lists, and doctors. Generally speaking, if a doctor is assigned a high
output a value in the range of −1 to 1. A positive output level score by xywy.com, it is likely that the doctor is better
indicates that the two ranked lists are correlated positively, than the one who has a low level score.
and a negative output indicates negative correlation. The We compare the estimated doctor expertise distributions
absolute values of the coefficients indicate the strength of for different doctor levels and plot the results for two
correlation. The bigger, the stronger correlation. randomly selected topics in Figure 3. From Figure 3, we
The results for the 25 randomly selected cases are listed observe that for both topics, the mean of the learned doc-
in Table 2. From the table, we can see that for all the tor expertise is positively correlated with the level scores.
cases, both the Kendall and Spearman’s rank correlation For the doctors with higher level scores, the mean of the
coefficients are positive, which indicates positive correla- estimated doctor expertise is also higher, which is expected
tions between the knowledge discovered by the proposed according to our intuition. Although it is difficult to prove
method and the human expert. The mean and median that the estimated doctor expertise is accurate due to the
for both coefficients are considerably high, which implies lack of ground truth information, this observation confirms

2332-7790 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TBDATA.2016.2612236, IEEE Transactions on Big Data
IEEE TRANSACTIONS ON BIG DATA 8

that the MKE system can automatically estimate meaningful


doctor expertise even without any supervision. 0.8 Doctor 1

Estimated Doctor Expertise


Doctor 2
Doctor 3
0.6
0.35
Estimated Doctor Expertise

0.3
0.4
0.25

0.2 0.2
0.15

0.1
0
1 2 3 4 5 6 7
0.05
Topic

1 2 3 4

Doctor Level Score 0.8 Doctor 4

Estimated Doctor Expertise


(a) topic 3 Doctor 5
Doctor 6
0.6

0.35
0.4
Estimated Doctor Expertise

0.3

0.25
0.2
0.2

0.15 0
1 2 3 4 5 6 7
0.1 Topic
0.05
Fig. 4: Various Doctor Expertise over Topics
0

1 2 3 4

Doctor Level Score (disease-drug task), and describing diseases and asking for
(b) topic 5 proper medical examinations (disease-examination task).
Table 3 shows two case studies for symptom-disease
Fig. 3: Evaluation on Doctor Expertise Estimation task. Since the original datasets are in Chinese, we also
provide the English translation in Table 3. We choose these
In fact, the estimated doctor expertise by the proposed cases because their symptoms and diseases are common, so
MKE system may be more useful than the level score as- our general readers without medical background can still
signed by xywy.com, because the former infers fine-grained judge the quality of the results. The first patient describes
topic expertise for each doctor. From common sense, we himself as 40 years old, and having symptoms of a headache
know that doctors have their own specialties. Therefore, and stuffed nose. The proposed method suggests that the
for different topics, the doctor’s expertise scores should be possible diseases he might get are bronchitis with probabil-
different. Here, we confirm this by showing the estimated ity of 0.2254, common cold with probability of 0.2908 and
doctors expertise on different topics. Since there are thou- pharyngitis with probability of 0.2349. The second patient
sands of doctors in the crawled dataset, it is impossible describes himself as 60 years old, and having symptoms of
to plot the estimated doctors expertise for all of them. distending pain around abdomen, inappetency, and ascites.
We randomly select six doctors, and plot their estimated The proposed method provides that the possible diseases he
expertise over the seven topics in Figure 4. might get are cirrhosis with probability of 0.3453, hepatitis
Figure 4 clearly demonstrates that the estimated doctor with probability of 0.3203 and liver cancer with probability
expertise scores are quite different over topics. For example, of 0.3343. From these two cases, we can see that the di-
doctor 1 might be an expert on topic 5, but not so on other agnoses and their trustworthiness degrees provided by the
topics. Doctor 2 has a high expertise score on topic 2, while proposed MKE system are reasonable.
his expertise scores on other topics are quite low. These Tables 4 and 5 show more case studies for disease-
observations confirm our intuition about the necessity of drug task and disease-examination task respectively, and
fine-grained doctor expertise estimation. their English translation are also provided in these tables.
All the case studies demonstrate that the MKE system
4.4 Case Studies can extract valuable medical knowledge from the medical
The above experiments quantitatively validate the perfor- crowdsourced Q&A websites. Therefore, it is helpful for
mance of the proposed MKE system. In this section, we give both patients and doctors.
some case studies in several tasks with different question in-
tentions. The question intentions include describing symp- 4.5 Effectiveness of Similarity Function
toms and asking for possible diseases (symptom-disease In Section 3.3.2, we discussed the challenge of multiple
task), describing diseases and asking for effective drugs linked truths. That is, there may be several trustworthy

2332-7790 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TBDATA.2016.2612236, IEEE Transactions on Big Data
IEEE TRANSACTIONS ON BIG DATA 9

Symptom Disease Trustworthiness Degree


支气管炎 (bronchitis) 0.2254
40岁, 头痛, 鼻塞 感冒 (common cold) 0.2908
(40 years old, headache, stuffed nose) 咽炎 (pharyngitis) 0.2349
··· ···
肝硬化 (cirrhosis) 0.3453
60岁, 胀痛, 食欲不振, 腹水 肝炎 (hepatitis) 0.3203
(60 years old, distending pain around abdomen, inappetency, ascites) 肝癌 (liver cancer) 0.3343
··· ···

TABLE 3: Case Studies: Symptom-Disease Task

Disease Drug Trustworthiness Degree


奥美拉唑 (omeprazole) 0.2060
20岁,胃炎 吗丁啉 (domperidone) 0.1796
(20 years old, gastritis) 西咪替丁 (cimetidine) 0.0977
··· ···
通窍鼻炎片 (TongQiao rhinitis tablet) 0.2518
20岁,鼻炎 氯雷他定 (loratadine) 0.1355
(20 years old, rhinitis) 复方薄荷脑滴鼻液 (compound menthol) 0.1244
··· ···

TABLE 4: Case Studies: Disease-Drug Task

Disease Medical Examination Trustworthiness Degree


过敏原检测 (allergen test) 0.3614
40岁, 过敏性鼻炎 支气管激发试验 (bronchial provocation test) 0.1558
(40 years old, allergic rhinitis) 鼻腔内镜检查 (nasal endoscopy) 0.1439
··· ···
胸部x线 (chest X-ray) 0.2559
60岁, 肺心病 肺功能 (lung function test) 0.1177
(60 years old, pulmonary heart disease) 心电图检查 (ECG examination) 0.1174
··· ···

TABLE 5: Case Studies: Disease-Examination Task

answers for the same question and the answers are corre- trustworthiness degrees of the diagnoses. Similarly, in the
lated with each other. To address this challenge, we add a second example, the trustworthiness degree of the third di-
similarity function to the calculation of the answer trustwor- agnosis increases after using the similarity function, because
thiness degree to quantify the similarity between answers. the third disease, diarrhea, is correlated with the second
In this section, we experimentally illustrate the importance disease, enteritis. Both examples indicate that the proposed
and effectiveness of the proposed similarity function. similarity function significantly improves the answer trust-
To learn the real-value vector representation for enti- worthiness estimation by successfully modeling the link
ties, we adopt a large corpus, which contains 64 million among correlated answer entities.
question-answer pairs. Word2vec package [28] is a popular
tool to train vector representation of words, and thus we use
its Skip-gram architecture in our experiment. The dimen- 4.6 Effect of Question Text Representation
sionality of the learned vectors is set to be 100, the context As mentioned in Section 3.3.3, long-tail phenomenon on
window size is set to be 8, and we specify the minimum questions is severe in the medical Q&A dataset. Most
occurrence count to be 5. For more details, please refer to questions get only one or two answers while only a few
[22]. questions get more answers. However, as the entity-based
To illustrate the effect of similarity function, we show representation for text can merge similar questions, it can
two examples in Table 6. In the first example, if we do enlarge the set of answers for many questions and reduce
not consider the correlation of the answer entities and treat the side-effect of the long-tail phenomenon on questions.
them independently, the second and the third diagnosed We validate this claim in Table 7. We compare the average
diseases have very different trustworthiness degrees: the number of answers for each question before and after us-
second one is significantly higher than the third one. How- ing the entity-based representation for questions. It clearly
ever, the second disease is allergic rhinitis and the third is demonstrates that the size of answer set is boosted.
rhinitis, and these two diseases are highly correlated. If one Note that in Algorithm 1, we include the patients’ age
is trustworthy, so should the other. Successfully capturing information as part of question text representation, because
this correlation, the similarity function helps to correct the patients’ age information plays an important role in doctors’

2332-7790 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TBDATA.2016.2612236, IEEE Transactions on Big Data
IEEE TRANSACTIONS ON BIG DATA 10

Trustworthiness without Trustworthiness with


Symptom Diagnosis
Similarity Function Similarity Function
感冒 (common cold) 0.3253 0.3022
40岁, 打喷嚏, 流鼻涕 过敏性鼻炎 (allergic rhinitis) 0.5556 0.3565
(40 years old, sneezing, runny nose) 鼻炎 (rhinitis) 0.1190 0.3412
··· ··· ···
贫血 (anemia) 0.4946 0.3271
10岁, 胸闷, 气短, 四肢无力 肠炎 (enteritis) 0.4946 0.3630
(10 years old, chest pain, short of breath, limply) 腹泻 (diarrhea) 0.0071 0.3097
··· ··· ···

TABLE 6: Case Study: Effect of Similarity Function

Topic The Average Number of Answers for Each Question patient confirms her/his question, the App will automati-
before after cally analyze the patient’s symptoms, retrieve the extracted
1 2.25 4.05 medical knowledge triples of the MKE system, and provide
2 1.87 3.07 a list of possible diseases with corresponding confidence
3 2.49 4.66 probabilities. For example, the patient may have a cold with
4 2.08 3.23
a probability of 0.67, or have rhinitis with a probability of
5 2.60 4.63
6 1.54 5.69 0.07. The App further provides useful information about
7 1.85 3.68 these diseases by linking with Baidu Baike.
Though we only describe one application, the outputs
TABLE 7: Effectiveness of Entity-Based Representation on of the proposed MKE system, both the discovered medical
Long-Tail Phenomenon knowledge and the estimated doctor expertise, have great
potential to benefit various real-world medical applications.
For example, the estimated doctor expertise can help the
diagnosis. In Table 8, we use an example from the extracted task of question routing, which studies how to send ques-
medical knowledge to illustrate how age information in tions to appropriate doctors in order to receive high-quality
question text representation can affect the diagnoses. answers in short time. We plan to build more applications
Table 8 shows the top recommended drugs to cure com- based on the proposed MKE system in the future.
mon cold (感冒) for patients of different ages. For the infant
and young kid (age 1 and age 4), the top recommended
drug is Pediatric Paracetamol; for the teenager and adult 6 R ELATED W ORK
(age 10, 20 and 40), one of the top recommended drugs is 6.1 Crowdsourced Question Answering
amoxicillin; for the elderly (age 60), the recommended drugs The core problem of the proposed medical knowledge ex-
are antibiotics and antiviral drug. Although the patients traction (MKE) system is to evaluate the quality of question-
have the same disease, the recommended drugs are quite answer pairs in crowdsourced Q&A websites. There is some
different for different age groups. The recommendation related work that aims to solve similar problems. The work
for the young kid is weaker and safer compared to the can be categorized into two groups: the first group formu-
recommendation for the adult. This demonstrates that it lates the quality evaluation task as a classification problem,
is necessary to include patients’ age information in the while the second group infers the quality of question-
question text representation. answer pairs based on the expertise of users who provide
the answers.
We first introduce the related work that treats the quality
5 R EAL -W ORLD A PPLICATIONS
evaluation task as a classification problem and solves the
As mentioned in Section 2, various real-world applications task via supervised learning methods. The authors in [29]
can be built based on the extracted medical knowledge extract 13 non-textual features from each question-answer
triples and the estimated doctor expertise. Here, we describe pair, and train a maximum entropy model to classify the
one application called Ask A Doctor. This application can unlabeled question-answer pairs into three quality mea-
analyze the descriptions from patients and automatically surement levels, i.e., Bad, Medium and Good. In [30], the
infer the possible diseases that the patients might get. This authors propose a general classification framework for the
application, Ask A Doctor, has been implemented, and its quality evaluation task. Various features can be incorporated
beta version is available in the App Store (iOS). into the model: content-based features such as n-gram,
For the purpose of illustration, we give a series of link-based features such as asker-answerer link, and usage-
snapshots of this App in Figure 5. The App asks the patient based features such as the number of clicks. The feature
to describe her/his symptoms, and the patient can reply construction and selection for the quality evaluation task are
in speech. The App converts the voice input into question further studied in [31]: The authors ask a crowd of people to
text and display it on the screen, such as the one shown annotate the quality of question-answer pairs in terms of 13
in the first snapshot in Figure 5: ”I am 29 years old. I predefined criteria, such as the novelty of the answers, and
have been feeling the pain in my throat, and I got runny the helpfulness of the answers. By analyzing these annota-
nose and kept sneezing. Any idea or suggestion?” Once the tions, they find that the profile of answerer is quite useful

2332-7790 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TBDATA.2016.2612236, IEEE Transactions on Big Data
IEEE TRANSACTIONS ON BIG DATA 11

Age Top Recommended Drugs


1 小儿氨酚黄那敏颗粒 (pediatric paracetamol)
4 小儿氨酚黄那敏颗粒 (pediatric paracetamol)
10 双黄连口服液, 阿莫西林 (Shuang-Huang-Lian oral solution, amoxicillin)
20 阿莫西林, 阿奇霉素 (amoxicillin, azithromycin)
40 阿莫西林, 抗生素 (amoxicillin, antibiotics)
60 抗生素, 抗病毒药物 (antibiotics, antiviral drug)

TABLE 8: A Case Study on the Effect of Age Information

(a) (b) (c) (d)

Fig. 5: Snapshots of the App: Ask A Doctor. (a) The patient enters his question “I have been feeling the pain in my throat,
and I got runny nose and kept sneezing. Any idea or suggestion?” (b) The App is analyzing the patient’s symptoms. (c) A
list of possible diseases and their corresponding probabilities are provided: cold with a probability of 0.67, rhinitis with a
probability of 0.07, etc. (d) The App links with Baidu Baike to provide more information about cold.

for the quality evaluation task, which indicates the necessity the topic tagging information about questions. In this line of
of considering answerers’ expertise. The above work simply related work, methods require various external information,
concatenates different features. However, textual and non- such as the best answer voting information, to estimate the
textual features usually have different representation, and expertise of answerers. Then the quality of question-answer
the correlations between them are non-linear. Thus in [32], pairs is inferred based on the derived answerer expertise.
the authors propose a multi-modal Deep Belief network to However, on the medical crowdsourced Q&A websites, the
learn an informative unified representations of both textual patients seldom give such feedback, and the interaction
and non-textual features. However, the drawback of this among patients and doctors (answerers) is also very rare,
group of methods is that to train a good classifier, they need which make it difficult to apply these methods. Therefore,
a large amount of labeled data, which is inefficient or even in this paper, we propose method to evaluate the quality of
impossible for the large-scale knowledge extraction on the question-answer pairs without any supervision.
medical crowdsourced Q&A websites.

To tackle the challenge of lacking labeled data, the sec-


ond line of work tries to infer the quality of question-answer Besides the above related work in crowdsourced ques-
pairs by analyzing the expertise of the answerers. In [33], the tion answering research area, there is some other work that
authors make the assumption that for a particular question- focuses on how to define the quality of question-answer
answer pair, if a user has contributed the best answers for pairs. Some work proposes that the quality should be exam-
similar questions before, the answer from this user should ined from the perspective of the questioners, i.e., the quality
be in high quality. However, this assumption is restricted should be measured as the satisfaction degree of questioners
as the “best answer” information might not be available on [35], [36]. While some other work [29], [34] defines the qual-
many websites. In [34], the authors estimate the fine-grained ity of question-answer pairs as their value for the public.
expertise for answerers using various types of information, In this paper, we extract useful medical knowledge from
such as the best answer voting information, the link struc- Q&A websites to support various applications, and thus our
ture among users, the context of question-answer pairs, and definition of the quality belongs to the latter.

2332-7790 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TBDATA.2016.2612236, IEEE Transactions on Big Data
IEEE TRANSACTIONS ON BIG DATA 12

6.2 Truth Discovery system has great potential to benefit more applications such
The main component in the proposed MKE system belongs as robot doctors and question routing in Q&A websites.
to the topic of truth discovery [10]–[12]. Truth discovery
methods can automatically estimate source reliability (doc- ACKNOWLEDGMENTS
tor expertise) from the data without any supervision, and
This work was sponsored in part by US National Science
incorporate such estimated source reliability into the aggre-
Foundation under grant IIS-1553411. The views and con-
gation of noisy multi-source information. Some representa-
clusions contained in this paper are those of the authors
tive work includes TruthFinder [6], [11], AccuSim [7], [12],
and should not be interpreted as representing any funding
3-Estimates [37], Investment [8] and CRH [9]. Although the
agency.
mathematical formulations are different, they share a similar
main principle: the information from a reliable source is
more trustworthy, and a source is reliable if it often provides R EFERENCES
trustworthy information. [1] “Baidu health related queries,”
Most of truth discovery methods work on clean struc- http://www.ebrun.com/20150812/144515.shtml.
tured data. Recently, people has begun to pay more at- [2] L. Nie, Y.-L. Zhao, M. Akbari, J. Shen, and T.-S. Chua, “Bridging
the vocabulary gap between health seekers and healthcare knowl-
tention on the noisy textual data, such as [17], [38], [39]. edge,” IEEE Transactions on Knowledge and Data Engineering, vol. 27,
In this work, we also study how to apply truth discovery no. 2, pp. 396–409, 2015.
on textual data. From the perspective of techniques, as [3] L. Nie, M. Wang, L. Zhang, S. Yan, B. Zhang, and T.-S. Chua,
“Disease inference from health-related questions via sparse deep
discussed in Section 3, we address some unique challenges learning,” IEEE Transactions on Knowledge and Data Engineering,
in medical knowledge extraction which cannot be solved by vol. 27, no. 8, pp. 2107–2119, 2015.
existing truth discovery methods. First, most of truth dis- [4] L. Nie, T. Li, M. Akbari, J. Shen, and T.-S. Chua, “Wenzher:
covery methods have the single truth assumption. Recently, Comprehensive vertical search for healthcare domain,” in Proc. of
the ACM SIGIR International Conference on Research and Development
some work [19], [20] studies the scenario of multiple truths in Information Retrieval (SIGIR’14), 2014, pp. 1245–1246.
scenario. However, they assume that the multiple truths [5] L. Nie, M. Akbari, T. Li, and T.-S. Chua, “A joint local-global
are independent. While in our case, this assumption does approach for medical terminology assignment,” in SIGIR 2014
Workshop on Medical Information Retrieval, 2014, pp. 24–27.
not hold as the relations among multiple possible diseases
[6] X. Yin, J. Han, and P. S. Yu, “Truth discovery with multiple
should be taken into account. To capture the linked multiple conflicting information providers on the web,” in Proc. of the ACM
truths challenge, we adopt the similarity function. Although SIGKDD International Conference on Knowledge Discovery and Data
in [7], [11], the similarity function is also adopted, they still Mining (KDD’07), 2007, pp. 1048–1052.
[7] X. L. Dong, L. Berti-Equille, and D. Srivastava, “Integrating con-
make the single truth assumption. Second, we observe the flicting data: The role of source dependence,” The Proceedings of the
long-tail phenomenon on the collected data, and previous VLDB Endowment (PVLDB), vol. 2, no. 1, pp. 550–561, 2009.
work [24] also addresses this challenge. However, it only [8] J. Pasternack and D. Roth, “Knowing what to believe (when you
address the long-tail phenomenon from the perspective of already know something),” in Proc. of the International Conference
on Computational Linguistics (COLING’10), 2010, pp. 877–885.
sources, while in our case, the long-tail phenomenon exists [9] Q. Li, Y. Li, J. Gao, B. Zhao, W. Fan, and J. Han, “Resolving conflicts
on both sources and questions. Moreover, the solution in in heterogeneous data by truth discovery and source reliability
[24] is derived for continuous data, whereas we deal with estimation,” in Proc. of the ACM SIGMOD International Conference
on Management of Data (SIGMOD’14), 2014, pp. 1187–1198.
textual (categorical) data. [10] Y. Li, J. Gao, C. Meng, Q. Li, L. Su, B. Zhao, W. Fan, and J. Han, “A
survey on truth discovery,” arXiv preprint arXiv:1505.02463, 2015.
[11] X. Yin, J. Han, and P. S. Yu, “Truth discovery with multiple
7 C ONCLUSIONS conflicting information providers on the web,” IEEE Transactions
The medical crowdsourced Q&A websites provide valuable on Knowledge and Data Engineering, vol. 20, no. 6, pp. 796–808, 2008.
[12] X. Li, X. L. Dong, K. B. Lyons, W. Meng, and D. Srivastava, “Truth
but noisy health related information. To extract high quality finding on the deep web: Is the problem solved?” The Proceedings
medical knowledge from the question-answer pairs, we of the VLDB Endowment (PVLDB), vol. 6, no. 2, pp. 97–108, 2012.
propose a Medical Knowledge Extraction (MKE) system in [13] S. Mukherjee, G. Weikum, and C. Danescu-Niculescu-Mizil, “Peo-
ple on drugs: credibility of user statements in health commu-
this paper. The MKE system can extract knowledge triples nities,” in Proc. of the ACM SIGKDD International Conference on
<question, diagnosis, trustworthiness degree> and estimate Knowledge Discovery and Data Mining (KDD’14), 2014, pp. 65–74.
doctors’ expertise simultaneously without any supervision. [14] D. Wang, L. Kaplan, H. Le, and T. Abdelzaher, “On truth discovery
Three unique challenges in medical knowledge extraction in social sensing: A maximum likelihood estimation approach,”
in Proc. of the International Conference on Information Processing in
tasks are recognized and tackled in the MKE system. We Sensor Networks (IPSN’12), 2012, pp. 233–244.
use entity-based representation to clean noisy text input and [15] L. Su, Q. Li, S. Hu, S. Wang, J. Gao, H. Liu, T. Abdelzaher, J. Han,
merge similar questions; A similarity function is applied X. Liu, Y. Gao, and L. Kaplan, “Generalized decision aggregation
in distributed sensing systems,” in Proc. of the IEEE Real-Time
to model the correlation between answers; And to handle
Systems Symposium (RTSS’14), 2014, pp. 1–10.
the long-tail phenomenon on sources, a pseudo count is [16] C. C. Aggarwal and T. Abdelzaher, “Social sensing,” in Managing
added so that we can estimate reasonable doctor expertise and mining sensor data, 2013, pp. 237–297.
for each doctor. A set of experiments on real-world datasets [17] X. L. Dong, E. Gabrilovich, G. Heitz, W. Horn, K. Murphy, S. Sun,
and W. Zhang, “From data fusion to knowledge fusion,” The
crawled from xywy.com validate the effectiveness of the Proceedings of the VLDB Endowment (PVLDB), vol. 7, no. 10, pp.
proposed MKE system in automatically extracting meaning- 881–892, 2014.
ful knowledge and estimating fine-grained doctor expertise [18] X. L. Dong, E. Gabrilovich, G. Heitz, W. Horn, N. Lao, K. Murphy,
from medical crowdsourced Q&A websites. We also show T. Strohmann, S. Sun, and W. Zhang, “Knowledge vault: A web-
scale approach to probabilistic knowledge fusion,” in Proc. of the
a real-world application, Ask A Doctor, to demonstrate the ACM SIGKDD International Conference on Knowledge Discovery and
impact of the MKE system. Beyond this App, the MKE Data Mining (KDD’14), 2014, pp. 601–610.

2332-7790 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TBDATA.2016.2612236, IEEE Transactions on Big Data
IEEE TRANSACTIONS ON BIG DATA 13

[19] B. Zhao, B. I. P. Rubinstein, J. Gemmell, and J. Han, “A bayesian Yaliang Li is a Ph.D. student in the Depart-
approach to discovering truth from conflicting sources for data ment of Computer Science and Engineering at
integration,” The Proceedings of the VLDB Endowment (PVLDB), SUNY Buffalo. He received his B.S. degree from
vol. 5, no. 6, pp. 550–561, 2012. Nanjing University of Posts and Telecommuni-
[20] R. Pochampally, A. D. Sarma, X. L. Dong, A. Meliou, and D. Srivas- cations in 2010. His research topics include
tava, “Fusing data with correlations,” in Proc. of the ACM SIGMOD truth discovery, text and web mining, privacy-
International Conference on Management of Data (SIGMOD’14), 2014, preserving data mining, and data mining appli-
pp. 433–444. cation in healthcare.
[21] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean,
“Distributed representations of words and phrases and their com-
positionality,” in Advances in Neural Information Processing Systems
(NIPS’13), 2013, pp. 3111–3119.
[22] R. Collobert and J. Weston, “A unified architecture for natural
language processing: Deep neural networks with multitask learn-
ing,” in Proc. of the International Conference on Machine Learning
Chaochun Liu is a Research Scientist in Baidu
(ICML’08), 2008, pp. 160–167.
Research Big Data Lab in Sunnyvale, Califor-
[23] A. Mnih and G. E. Hinton, “A scalable hierarchical distributed
nia. He received his Ph.D. on Mathematics from
language model,” in Advances in Neural Information Processing
Sun Yat-sen University of China in 2009. Af-
Systems (NIPS’09), 2009, pp. 1081–1088.
ter graduation, he worked as Senior Research
[24] Q. Li, Y. Li, J. Gao, L. Su, B. Zhao, D. Murat, W. Fan, and J. Han, “A
Associate in City University of Honkong, then
confidence-aware approach for truth discovery on long-tail data,”
Postdoctoral Researcher and Research Scientist
The Proceedings of the VLDB Endowment (PVLDB), vol. 8, no. 4, pp.
in New York State Department of Health. His
425–436, 2015.
research interest includes deep learning, text
[25] “Jieba, chinese word segmentation package,” https://github.
mining, computer vision and bioinformatics. He
com/fxsjy/jieba.
has published 20+ papers in the field of machine
[26] “Kendall rank correlation coefficient,” https://en.wikipedia.org/
learning and bioinformatics.
wiki/Kendall rank correlation coefficient.
[27] “Spearman’s rank correlation coefficient,” https://en.wikipedia.
org/wiki/Spearman%27s rank correlation coefficient.
[28] “Word2vector package,” https://code.google.com/p/word2vec/.
[29] J. Jeon, W. B. Croft, J. H. Lee, and S. Park, “A framework to predict
the quality of answers with non-textual features,” in Proc. of the Nan Du is a research scientist in the Baidu Re-
ACM SIGIR International Conference on Research and Development in search Big Data Lab in Sunnyvale, California.
Information Retrieval (SIGIR’06), 2006, pp. 228–235. He received his Ph.D. degree from Computer
[30] E. Agichtein, C. Castillo, D. Donato, A. Gionis, and G. Mishne, Science and Engineering department in State
“Finding high-quality content in social media,” in Proc. of the ACM University of New York at Buffalo, NY, with su-
International Conference on Web Search and Data Mining (WSDM’08), pervision by Prof. Aidong Zhang. Prior to that,
2008, pp. 183–194. he received the MS degree from Southern China
[31] C. Shah and J. Pomerantz, “Evaluating and predicting answer University of Technology in 2009. His research
quality in community QA,” in Proc. of the ACM SIGIR International interests are in the areas of machine learning,
Conference on Research and Development in Information Retrieval natural language processing, health care and
(SIGIR’10), 2010, pp. 411–418. bioinformatics.
[32] H. Hu, B. Liu, B. Wang, M. Liu, and X. Wang, “Multimodal DBN
for predicting high-quality answers in cQA portals.” in Proc. of
the Annual Meeting of the Association for Computational Linguistics
(ACL’13), 2013, pp. 843–847.
[33] T. Zhao, C. Li, M. Li, S. Wang, Q. Ding, and L. Li, “Predicting Wei Fan is currently the Senior Director and
best responder in community question answering using topic Head of Baidu Research Big Data Lab in Sunny-
model method,” in Proc. of the International Joint Conferences on Web vale, California. He received his Ph.D. in Com-
Intelligence and Intelligent Agent Technology (WI-IAT’12), 2012, pp. puter Science from Columbia University in 2001.
457–461. His main research interests and experiences are
[34] L. Yang, M. Qiu, S. Gottipati, F. Zhu, J. Jiang, H. Sun, and Z. Chen, in various areas of data mining and database
“Cqarank: Jointly model topics and expertise in community ques- systems, such as, deep learning, stream com-
tion answering,” in Proc. of the ACM Conference on Information and puting, high performance computing, extremely
Knowledge Management (CIKM’13), 2013, pp. 99–108. skewed distribution, cost-sensitive learning, risk
[35] Y. Liu, J. Bian, and E. Agichtein, “Predicting information seeker analysis, ensemble methods, easy-to-use non-
satisfaction in community question answering,” in Proc. of the parametric methods, graph mining, predictive
ACM SIGIR International Conference on Research and Development feature discovery, feature selection, sample selection bias, transfer
in Information Retrieval (SIGIR’08), 2008, pp. 483–490. learning, time series analysis, bioinformatics, social network analy-
[36] L. A. Adamic, J. Zhang, E. Bakshy, and M. S. Ackerman, “Knowl- sis, novel applications and commercial data mining systems. His co-
edge sharing and yahoo answers: Everyone knows something,” in authored paper received ICDM’06/KDD11/KDD12/KDD13/KDD97 Best
Proc. of the International Conference on World Wide Web (WWW’08), Paper & Best Paper Runner-up Awards. He led the team that used his
2008, pp. 665–674. Random Decision Tree (www.dice.com) method to win 2008 ICDM Data
[37] A. Galland, S. Abiteboul, A. Marian, and P. Senellart, “Corrobo- Mining Cup Championship. He received 2010 IBM Outstanding Techni-
rating information from disagreeing views,” in Proc. of the ACM cal Achievement Award for his contribution to IBM Infosphere Streams.
International Conference on Web Search and Data Mining (WSDM’10), He is the associate editor of ACM Transaction on Knowledge Discovery
2010, pp. 131–140. and Data Mining (TKDD). During his times as the Associate Director in
[38] J. Pasternack and D. Roth, “Making better informed trust decisions Huawei Noah’s Ark Lab in Hong Kong from August 2012 to December
with generalized fact-finding,” in Proc. of the International Jont 2014, he has led his colleagues to develop Huawei StreamSMART –
Conference on Artifical Intelligence (IJCAI’11), 2011, pp. 2324–2329. a streaming platform for online and real-time processing, query and
[39] D. Yu, H. Huang, T. Cassidy, H. Ji, C. Wang, S. Zhi, J. Han, C. Voss, mining of very fast streaming data. StreamSMART is 3 to 5 times faster
and M. Magdon-Ismail, “The wisdom of minority: Unsupervised than STORM and 10 times faster than SparkStreaming, and was used
slot filling validation based on multi-dimensional truth-finding,” in Beijing Telecom, Saudi Arabia STC, Norway Telenor and a few other
in Proc. of the International Conference on Computational Linguistics mobile carriers in Asia. Since joining Baidu Big Data Lab, Wei has been
(COLING’14), 2014. working on medical and healthcare research and applications, such as
deep learning-based disease diagnosis based on NLP input as well as
medical dialogue robot.

2332-7790 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TBDATA.2016.2612236, IEEE Transactions on Big Data
IEEE TRANSACTIONS ON BIG DATA 14

Qi Li received the BS degree in Mathematics


from Xidian University and the MS degree in
Statistics from University of Illinois at Urbana-
Champaign, in 2010 and 2012 respectively. She
is currently working toward the Ph.D. degree in
the Department of Computer Science and En-
gineering, University at Buffalo. Her research in-
terest includes truth discovery, data aggregation,
and crowdsourcing. She has published papers
on these topics in SIGMOD, VLDB, KDD, and
WSDM.

Jing Gao is currently an assistant professor


in the Department of Computer Science at the
University at Buffalo (UB), State University of
New York. She received her Ph.D. in Com-
puter Science from University of Illinois Urbana-
Champaign in 2011, and subsequently joined
UB in 2012. She is broadly interested in data and
information analysis with a focus on truth discov-
ery, crowdsourcing, multi-source data analysis,
anomaly detection, information network analy-
sis, transfer learning, data stream mining, and
ensemble learning. She is also interested in various data mining appli-
cations in health care, bioinformatics, transportation, cyber security and
education. She is the recipient of NSF CAREER award and IBM faculty
award.

Chenwei Zhang received the BS degree in com-


puter science and technology from Southwest
University, China, in 2014 and is currently work-
ing toward the Ph.D. degree in the Department of
Computer Science at the University of Illinois at
Chicago. His research interests include crowd-
sourcing, web and healthcare data mining and
deep learning.

Hao Wu is a Ph.D. student in Computer Science


Department at University of Southern Califor-
nia. He obtained M.S. and B.E. degrees from
Zhejiang University, China. His primary research
interests are on Machine Learning, Text Mining
and Social Network Analysis.

2332-7790 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
information.

You might also like