Professional Documents
Culture Documents
fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TBDATA.2016.2612236, IEEE Transactions on Big Data
IEEE TRANSACTIONS ON BIG DATA 1
Abstract—The medical crowdsourced question answering (Q&A) websites are booming in recent years, and increasingly large amount
of patients and doctors are involved. The valuable information from these medical crowdsourced Q&A websites can benefit patients,
doctors and the society. One key to unleash the power of these Q&A websites is to extract medical knowledge from the noisy
question-answer pairs and filter out unrelated or even incorrect information. Facing the daunting scale of information generated on
medical Q&A websites everyday, it is unrealistic to fulfill this task via supervised method due to the expensive annotation cost. In this
paper, we propose a Medical Knowledge Extraction (MKE) system that can automatically provide high quality knowledge triples
extracted from the noisy question-answer pairs, and at the same time, estimate expertise for the doctors who give answers on these
Q&A websites. The MKE system is built upon a truth discovery framework, where we jointly estimate trustworthiness of answers and
doctor expertise from the data without any supervision. We further tackle three unique challenges in the medical knowledge extraction
task, namely representation of noisy input, multiple linked truths, and the long-tail phenomenon in the data. The MKE system is applied
on real-world datasets crawled from xywy.com, one of the most popular medical crowdsourced Q&A websites. Both quantitative
evaluation and case studies demonstrate that the proposed MKE system can successfully provide useful medical knowledge and
accurate doctor expertise. We further demonstrate a real-world application, Ask A Doctor, which can automatically give patients
suggestions to their questions.
2332-7790 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TBDATA.2016.2612236, IEEE Transactions on Big Data
IEEE TRANSACTIONS ON BIG DATA 2
the privacy protection of patients adds to the difficulty of • Last but not least, we demonstrate a real-world medi-
annotation. These issues motivate us to develop unsuper- cal application built upon the proposed method. This
vised learning methods that extract knowledge from noisy application, Ask A Doctor, shows that the extracted
crowdsourced data on medical Q&A websites. knowledge can enable and facilitate many online
The key component in the medical knowledge extraction healthcare applications.
task is to find reliable answers without any supervision.
In the following, we first present an overview of the
Some existing truth discovery methods [6]–[10] assume that
proposed Medical Knowledge Extraction system (MKE sys-
the answers from high-expertise users are reliable, and the
tem for short). Then, in Section 3, we formally formulate
users who provide reliable answers should have high exper-
the problem and present solutions to tackle the challenges
tise. Thus these methods can iteratively estimate the source
in the knowledge extraction task. In Section 4, we conduct
reliability (i.e., user expertise) and infer reliable answers
experiments on real medical Q&A datasets to validate the
from user-contributed data. Although truth discovery may
effectiveness of the proposed system. In Section 5, we
be adopted, there are some unique challenges in the medi-
demonstrate an mobile App that uses the proposed MKE
cal knowledge extraction task that existing truth discovery
system. We discuss related work in Section 6, and conclude
methods cannot handle as discussed below.
the paper in Section 7.
First, truth discovery methods are developed for struc-
tured data (e.g., relational database), but for crowdsourced
Q&A websites, all the inputs are unstructured data (i.e., 2 OVERVIEW OF THE S YSTEM
texts). We address this challenge by transforming unstruc- The objective of this system is to extract knowledge
tured texts to entity-based representation. Second, trustwor- triples <question, diagnosis, trustworthiness degree> from
thy answers to a health-related question are usually not noisy question-answer pairs in medical crowdsourced Q&A
unique. There may be multiple possible answers to the same websites. Meanwhile, for the doctors who give answers in
question, and those answers are likely to be correlated. Such these Q&A websites, their expertise will be automatically
correlation violates the assumptions held by existing truth estimated.
discovery methods. To address this challenge, we model We propose a MKE (Medical Knowledge Extraction)
their correlation via a similarity function defined on the system that can jointly conduct the medical knowledge
word vectors of answers. Third, we observe severe long-tail extraction and doctor expertise estimation without any su-
phenomenon in the medical Q&A data which makes it dif- pervision. We give an overview of the proposed system in
ficult to estimate doctor expertise and trustworthy answers. this section. Figure 1 shows the pipeline of the proposed
Specifically, most doctors only provide a few answers, so MKE system, and a concrete example is adopted for the
it is difficult to estimate their expertise accurately; many purpose of illustration.
questions receive only a few answers which may not include In medical crowdsourced Q&A websites, patients have
any trustworthy answers. To reduce the effect of such long- various intentions when asking questions. For example,
tail phenomenon, we propose to incorporate a pseudo count they may want to find out the possible diseases based on
for each doctor in the estimation of doctor expertise, and we their symptoms, or particular side-effects of a drug. Doctors,
adopt the proposed entity-based representation to merge who play essential roles in these Q&A websites, provide
similar questions which enlarges the answer set for many answers to these questions. For the same question, multiple
questions. doctors may give different answers due to their diverse
To evaluate the proposed medical knowledge extraction expertise.
system, we collect medical Q&A data from xywy.com, a In order to distill the trustworthy medical knowledge,
popular online health website in China. To quantitatively we propose a truth discovery method to automatically esti-
evaluate the extracted knowledge, we compare the knowl- mate doctors’ expertise, and conduct weighted aggregation
edge with expert annotations. We also validate the estimated based on the estimated doctor expertise. To apply the truth
doctor expertise by comparing with external information discovery framework, we first extract entities from texts and
indicating doctor expertise crawled from xywy.com. Further, transform texts into entity-based representations. The new
we provide some cases studies which clearly demonstrate representations will then be fed into the proposed truth
the meaningful knowledge extracted by our system, and we discovery method, which outputs the medical knowledge
demonstrate the usefulness of the extracted knowledge in a triples <question, diagnosis, trustworthiness degree> and the
real-world medical application. estimated doctor expertise.
In summary, our contributions in this paper are: Based on these outputs, various real-world applications
can be built. For example, the extracted medical knowledge
• We propose a truth discovery method to automati- triples can be used to answer medical questions in Auto-
cally extract medical knowledge from noisy crowd- matic Diagnosis and Medical Robot. Besides, the estimated
sourced question answering websites without any doctor expertise can be applied in the tasks such as Doctor
supervision. The proposed method provides a cost- Ranking and Question Routing, which play important roles
efficient and effective way to mine knowledge from in crowdsourced Q&A websites.
crowdsourced question answering websites.
• The proposed truth discovery method is designed to
tackle the new challenges in the medical knowledge 3 M ETHODOLOGY
extraction task, and the experimental results on real- In this section, we present the technique details of the
world dataset confirm its effectiveness. proposed MKE system, which extracts information from
2332-7790 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TBDATA.2016.2612236, IEEE Transactions on Big Data
IEEE TRANSACTIONS ON BIG DATA 3
Q.0/%,(-*89* “I am 29 years old. I have been feeling the pain in my Extracted Entities
throat since the day before yesterday. I got runny <{!"#$%&'()%*+),-$'.--#*-(/01*/-002031*4(561*7(4%('*8>
nose and kept sneezing. Any idea or suggestions?”
<{!"#$%&'()%*+),-$'.--#*-(/01*/-002031*/,-./*,-<04%,(-1*7(4%('*!>
Doctor 19* “Hi, according to the symptoms you have, you
probably got cold due to infection:; <{!"#$%&'()%*+),-$'.--#*-(/01*/-002031*.++0'*'0/+,')%('#*,-<04%,(-1*7(4%('*=>
……
Doctor 29* “Based on your descriptions, while at the same time,
considering the current weather, I would like to say it
is sinus infection caused by cold weather.” Entity Extraction
Doctor 39* “Hello. Your condition very likely belongs to
symptoms of upper respiratory infection.”
……
Truth Discovery
<{!"#$%&'()%*+),-$'.--#*-(/01*/-002031*.++0'*'0/+,')%('#*,-<04%,(-1*>:!>
……
0.58
0.49
……
Fig. 1: Overview of the Medical Knowledge Extraction (MKE) System. The illustrative example is translated from xywy.com,
a Chinese medical crowdsourced question answering website.
noisy question-answer pairs and summarizes into medical • Claim: a claim is a tuple that consists of an question,
knowledge. After formally defining the task, we introduce a a doctor ID, and the corresponding answer from this
basic truth discovery method, and then propose solutions doctor to this question.
to address the unique challenges in medical knowledge • Knowledge triple: a knowledge triple consists of an
extraction task. question, a diagnosis, and a trustworthiness degree
of the diagnosis. Knowledge triples are created by
aggregating claims from multiple doctors.
3.1 Problem Formulation and Notations • Doctor expertise: each doctor who answers the ques-
tions of a certain topic is associated with an expertise
We first introduce some important terms used in this paper: score that indicates his probability of providing trust-
worthy answers on this topic. As we do not know the
• Question: a question from a patient contains a set trustworthiness of answers, nor the doctor expertise
of statements and a particular health concern (for a priori, we need to estimate the doctor expertise from
example, describing symptoms and asking for pos- data, and incorporate estimated doctor expertise into
sible diseases, or describing diseases and asking for a weighted aggregation to derive the knowledge
drugs). It also contains some other information about triples.
the patient such as her/his age.
• Question topic: For each question, we assume that it Here we formally define our task. Suppose there is a
belongs to one particular topic such as Pulmonology. set of medical questions Q and a set of doctors D who
In practice, most of the medical question answer- provide answers. Let xdq denote the answer to the q -th
ing websites such as xywy.com already assign each question provided by the d-th doctor, and wd denotes the
question a pre-defined topic. We directly keep such expertise score of the d-th doctor. The objective is to conduct
question topic information for the our system. weighted aggregation on the noisy data {xdq }q∈Q,d∈D to
• Doctor: a doctor is a person who answers questions derive knowledge triples <question, diagnosis, trustworthiness
on the medical Q&A websites. On the website from degree> and estimate doctor expertise.
which we crawl the data, the “doctors” are real
doctors, though it may not be this case for other 3.2 Basic Truth Discovery Method
websites. Since both answer trustworthiness and doctor expertise are
• Answer: an answer is a diagnosis provided by a unknown, this problem can be formulated as a Truth Dis-
doctor for a particular question. There may be mul- covery problem [10]–[12], which jointly estimates the trust-
tiple answers provided by different doctors for the worthiness of answers and doctor expertise from the Q&A
same question, and these answers may be noisy data. Various truth discovery methods have been developed
and unreliable. Note that in the collected dataset, all and their success has been demonstrated in many real-world
the answers to the same question are independently applications, such as healthcare [13], crowd sensing [14]–
provided by different doctors. [16], and knowledge base construction [17], [18]. In this
2332-7790 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TBDATA.2016.2612236, IEEE Transactions on Big Data
IEEE TRANSACTIONS ON BIG DATA 4
section, we introduce a basic truth discovery method for weights are doctor expertise scores, and Eq. (2) updates
the medical knowledge extraction task. the expertise score for each doctor based on the answer
Truth discovery methods take input tuples of <object, trustworthiness degrees. These equations follow the general
value, source>. In our specific task, the object is a question, principle of truth discovery. To be more specific, the truth
the value is an answer, and the source is a doctor. In the truth discovery method starts with a uniform initialization of
discovery problem setting, an object may receive conflicting the doctor expertise, and then iteratively estimates answer
claimed values from different sources, and a source may trustworthiness degrees and updates doctor expertise. The
provide values for different objects. The goal of truth discov- iterative procedure will end when a stopping criterion is
ery is to resolve the conflicts and find the truth (i.e., the most met, e.g., the maximum number of iterations is reached.
trustworthy answer) for each object by estimating the source The main advantage of truth discovery is that it can
reliability (i.e., doctor expertise). A straightforward solution discover the trustworthy information supported by a few
is voting, which takes the majority answer as the reliable information sources. This advantage is brought by the fact
one. However, this solution has an underlying assumption that truth discovery estimates source reliability score for
that all sources are equally reliable. each source and conducts weighted aggregation. In our
To capture the variety in the source reliability, truth scenario, answers supported by a few high-expertise doctors
discovery methods are developed to jointly conduct truth can be selected as trustworthy answers, which greatly helps
computation and source reliability estimation. These meth- us to extract trustworthy medical knowledge from noisy
ods hold a common principle: If an answer is provided by crowdsourced question answering websites.
doctors with high expertise, it is regarded as a trustworthy
answer; Meanwhile, if a doctor often provides trustworthy
3.3 Challenges and Solutions
answers, she/he is assigned a high expertise score. Based on
this principle, we can iteratively update the trustworthiness Although the medical knowledge extraction task can be
of answers and doctor expertise as follows: formulated as a truth discovery problem, the basic truth
discovery method overlooks some unique challenges of
• Estimate the trustworthiness degree of a possible
the task. Therefore, the basic truth discovery method have
answer xq for the q -th question:
X to be adapted to the medical knowledge extraction task.
T (xq ) = wd · 1(xq , xdq ), (1) In this section, we discuss these challenges and present
d∈D corresponding solutions.
where 1(·, ·) is an indicator function. 1(x, y) = 1, if
x = y ; otherwise, 1(x, y) = 0. Eq. (1) is formulated 3.3.1 Noisy Input
based on the truth discovery principle: the trustwor- The first challenge we are facing is how to clean the noisy
thiness degree of an answer is determined by the input. Existing truth discovery methods can only work on
expertise scores of doctors who provide that answer, structured data, but the knowledge extraction task deals
and the trustworthy answers are the ones that are with unstructured and noisy text data. In order to achieve
supported by doctors with high expertise. If wd ’s better performance, we need to derive better representations
are higher, then correspondingly, the trustworthiness of the questions and answers. In this section, we present
degree T (xq ) is higher. The trustworthiness degrees the solution to convert text into structured data using the
will be normalized such that the sum of all possi- question text as an example, and the answer text can be
ble answers’ trustworthiness degrees for a particular handled in a similar way.
question will be 1. Thus T (xq ) can be interpreted as Specifically, we propose an entity-based representation
the probability that xq is trustworthy. of text. That is, we extract a set of entities from each
• Update doctor expertise score: question q ∈ Q to represent the original question text, where
P an entity can be a particular symptom, disease, drug, etc.
x∈Vd T (x)
wd = − log 1 − , (2) Correspondingly, the entities we are looking for from the
|Vd | answer text are disease, drug, drug side-effect, etc. We use
where Vd is the set of answers provided by the d- an available medical entity dictionary for entity extraction.
th doctor. Eq. (2) is also formulated based on the If a word from a question text exists in the dictionary, then
truth discovery principle: a higher expertise score we put that word into the entity set for this question. As the
is assigned if the doctor provides more
P trustworthy age of patient is important for diagnosis, we also include the
x∈V T (x) age information in the entity set.
answers. In this equation, the term d
|Vd | is the
average trustworthiness degree of the d -th doctor’s For the answer text, similar procedure is performed.
P
x∈V T (x) Note that truth discovery framework requires that each
answers, so we can treat 1 − d
|Vd | as the prob- answer only contains one diagnosis. Therefore, if more than
ability of the d-th doctor providing wrong answers. one answer entities are extracted from a doctor’s answer
Logarithm function is used to re-scale the expertise text, we assume that this doctor provides multiple answers
scores so that the differences among the scores are where each entity corresponds to a different answer.
enlarged. From Eq. (2), it can be seen that a doctor Using the proposed entity-based representation of text
who is more likely to provide wrong answers gets a enables us to convert text input into structured represen-
lower expertise score. tations. During this process, text with similar meanings
Eq. (1) estimates the trustworthiness degree for each is mapped into similar or even the same representations.
possible answer by conducting weighted voting where the For example, “I have a headache, fever, and coughing”
2332-7790 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TBDATA.2016.2612236, IEEE Transactions on Big Data
IEEE TRANSACTIONS ON BIG DATA 5
is converted into a set with three entities <headache, fever, cosine similarity between answers into Eq. (1). Then the
cough>, so as “I cough, and have a fever and a headache”. trustworthiness degree of a possible answer is calculated
Therefore, this entity-based representation is able to merge as:
questions with similar meanings. X X 0
T (xq ) = wd · 1(xq , xdq ) + Sim(vxq , vx0 ) · T (xq ), q
d∈D x0q 6=xq
3.3.2 Multiple Linked Truths
(3)
The second challenge of the medical knowledge extraction
where Sim(v, v0 ) is the cosine similarity between two vec-
task is that there may be multiple trustworthy answers 0
tors, and xq is another possible answer to the q -th question.
to a question, and these answers can be correlated with
The similarity between two possible answers plays the
each other. For example, a patient describes his symptoms
role as a coefficient, and automatically controls how much
as “headache, fever, and cough”, and asks for the disease
influence should be considered. Thus the trustworthiness
he might get. The doctors may suggest that the potential
of an answer is enhanced if it is supported by other similar
disease is common cold or flu based on his description.
answers; on the other hand, if an answer is not supported or
According to common sense, we know that these two an-
even opposed by other answers, then the cosine similarity
swers are both possible, and these two diseases share many
gives a negative value, so the answer’s trustworthiness is
common symptoms, so they are not independent answers.
discounted. By incorporating answer similarity, the correla-
This challenge violates an important assumptions of many
tion among different answers can be modeled.
truth discovery methods [9], [11], [12], which is the single
truth assumption. The assumption is that there exists one 3.3.3 Long-Tail Phenomenon
and only one trustworthy answer for each question. There In many crowdsourcing applications, the long-tail phe-
is some recent work that relaxes the single truth assumption nomenon is observed [24]. This phenomenon holds for the
and considers the multiple truths scenario, such as [19], [20]. medical crowdsourced Q&A data studied in this paper
However, in these work, another assumption has to be held, as well, as demonstrated in Figure 2. That is, for most
that is, the truths for the same question are independent doctors, they provide answers to a few questions, and only
with each other. For example, if A and B are listed as a small set of doctors provide answers to many questions.
the authors of a book by a website, then the probabilities Or for most questions, the number of received answers is
of A being correct and B being correct are independent. small, and only a small set of questions receive a large
Obviously, the knowledge extraction task violates this as- number of answers. Both types of long-tail distribution
sumption, and thus these work [19], [20] is not applicable. exist, but existing truth discovery work that handles long-
We can easily adapt the truth discovery method in tail phenomenon [24] only considers the long-tail from the
Section 3.2 to handle multiple truths scenario. Since we perspective of sources, i.e., most sources provide only a
calculate the trustworthiness degrees for all the answers, few answers. However, Figure 2 clearly demonstrates that
we can regard the answers whose scores are higher than the long-tail phenomenon is also seen on questions: Most
a threshold as trustworthy answers. Some further efforts questions get only one or two answers while only a few
are needed to capture the relationship between answers as questions get plenty of answers. The long-tail phenomenon
discussed in the following. of both types poses challenges to the basic truth discovery
To capture the correlation among multiple possible an- method. Without sufficient answers from each doctor, we
swers, we propose to represent the answer entity using the cannot accurately estimate the doctors’ expertise. Without
neural word embedding method [21]–[23], where each word sufficient answers to each question, the trustworthy answer
is represented by a real-value word vector. The key idea be- cannot be identified since it is probable that none of the
hind neural word embedding is that the meaning of a word answers is trustworthy.
can be characterized by its context words. Therefore, the To tackle the challenge brought by the long-tail phe-
vector representation of words can be obtained by training nomenon on sources, we modify Eq. (2) based on the solu-
on a large corpus without syntax analysis or any manual tion proposed in [24]: when calculating the source weights,
labeling. For example, according to their similar context a chi-square parameter is used to capture the effect of source
words, the neural word embedding methods can automat- size in the calculation of source reliability. The weights
ically learn similar real-value word vectors for “common of those sources that provide a few answers would be
cold” and “flu”, which capture the relations between these discounted. Inspired by this idea, we add a pseudo count
two diseases. The benefit of using neural word embedding Cpseudo for each source when estimating its expertise score:
is that we can easily calculate the similarity of the words P
x∈Vd T (x)
(answer entities) as a real value. If two words have similar wd = − log 1 − . (4)
meanings, then the similarity of their vectors will be high, |Vd | + Cpseudo
and vice versa. In this equation, if a doctor provides only a few answers,
The correlation between words can then be used to then Cpseudo will dominate the term |Vd | + Cpseudo , so the
improve the calculation of the answer trustworthiness. Let’s doctor expertise score would be low. On the other hand, if a
revisit the earlier example. If “common cold” is a trust- doctor provides many answers, then |Vd | will dominate the
worthy answer for that question, then “flu” should also term |Vd | + Cpseudo , and his expertise score would be close
be considered as a trustworthy answer, since “common to the original estimation.
cold” and “flu” are highly correlated. Therefore, we modify The challenge brought by the long-tail phenomenon on
Eq. (1) based on the idea of “implication” (i.e., similarity questions can be naturally solved by the entity-based rep-
function) between answers [7], [11]. We incorporated the resentation introduced in Section 3.3.1. As discussed earlier,
2332-7790 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TBDATA.2016.2612236, IEEE Transactions on Big Data
IEEE TRANSACTIONS ON BIG DATA 6
the answer text. MKE system also extracts the patients’ age
600 from the question text and incorporates it into the repre-
sentation. Next, MKE system constructs the tuples < {age,
500
entities from question text}, entity from answer text, doctor ID>
Number of Doctors
400
as the input to the truth discovery approach. Finally, MKE
system applies the proposed truth discovery method to
300 build knowledge triples <question, diagnosis, trustworthiness
degree> and compute the doctor expertise {wd }.
200
Algorithm 1 : MKE System
100
Input: set of medical questions Q and their corresponding
0
0 5 10 15 20 25 answers {xdq }q∈Q,d∈D , an external entity dictionary with
Number of Answers entity types, and real-value vector representations of
(a) Distribution of Number of Answers per Doctor entities.
Output: discovered knowledge triples <question, diagnosis,
trustworthiness degree>, and doctors expertise {wd }.
14000
2332-7790 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TBDATA.2016.2612236, IEEE Transactions on Big Data
IEEE TRANSACTIONS ON BIG DATA 7
have different expertise on different topics, we use this that on average, the extracted medical knowledge from
provided fine-grained topic information to partition the the MKE system is meaningful and of high quality. This
crawled dataset. Eventually, we obtained the datasets for experiment quantitatively demonstrates the effectiveness of
seven topics, and the number of questions and the number the proposed method.
of involved doctors are summarized in Table 1.
Case Kendall Coefficient Spear Coefficient
Topic Number of Questions Number of Involved Doctors
1 0.5238 0.6071
1 30,104 2,233 2 0.3333 0.4424
2 9,540 1,617 3 0.2143 0.2619
3 10,939 1,689 4 0.8 0.9
4 5,409 1,419 5 0.4 0.6
5 10,470 1,550 6 0.2444 0.3091
6 4,049 1,051 7 0.4222 0.5394
7 4,957 1,347 8 0.2 0.3
9 0.2 0.3
TABLE 1: Statistics of the Datasets 10 0.1429 0.1905
11 0.0667 0.0061
12 0.3333 0.4667
The crawled questions and answers are in Chinese. Dif- 13 0.1556 0.1273
ferent from English, Chinese strings are not divided by word 14 0.3333 0.4424
15 0.3333 0.4303
delimiter. Thus word segmentation is required, where the
16 0.3333 0.3697
Chinese strings are cut into Chinese component words. This 17 0.3056 0.3875
pre-processing step is performed by applying an existing 18 0.8222 0.9394
word segmentation package [25]. 19 0.2889 0.5515
20 0.2778 0.3167
21 0.4222 0.5152
4.2 Quantitative Evaluation on the Extracted Medical 22 0.3889 0.5333
Knowledge 23 0.1111 0.0667
24 0.5556 0.6833
The feedback information from patients are valuable for 25 0.4444 0.5333
the purpose of evaluation. However, such information is so Mean 0.3461 0.4328
limited in the collected dataset that we cannot rely on it. Median 0.3333 0.4424
Thus in this section, we quantitatively evaluate the quality Standard Deviation 0.1855 0.2252
of extracted medical knowledge by comparing with human
annotations from a medical expert. TABLE 2: Evaluation on the Quality of Extracted Medical
For each question, the proposed MKE system outputs Knowledge
possible diagnoses with their corresponding trustworthi-
ness degrees. However, it is much harder for human to
assign precise probabilities to the possible diagnoses, but
4.3 Quantitative Evaluation on the Estimated Doctor
it is relatively easy for them to rank the possible diagnoses
Expertise
based on their likelihood. Therefore, we rank the results of
the MKE system by their estimated trustworthiness degrees, The proposed MKE system automatically learns both the
and adopt ranking comparison between MKE’s output and medical knowledge triples and the expertise for doctors. In
human annotations to quantitatively evaluate the quality of the following experiments, we conduct quantitative evalua-
extracted medical knowledge. We hire a medical expert to tion on the estimated doctor expertise.
annotate for the disease-drug task, where patients describe The xywy.com website maintains a profile for each reg-
their diseases and doctors suggest drugs to take. Due to the istered doctor. Based on the profile as well as the doctor’s
cost of annotation, we randomly select 25 cases for labeling. historical activities such as the number of thanks she/he
Each case is associated with 5 to 15 possible diagnoses. gets, the website assigns a level score to each doctor. This
Two widely used rank correlation coefficients are external information cannot provide a precise measure of
adopted as the performance metrics: Kendall rank correla- the doctor expertise as the score is topic independent, but it
tion coefficient [26] and Spearman’s rank correlation coeffi- can still give us some guidelines regarding the expertise of
cient [27]. Both coefficients compare two ranked lists, and doctors. Generally speaking, if a doctor is assigned a high
output a value in the range of −1 to 1. A positive output level score by xywy.com, it is likely that the doctor is better
indicates that the two ranked lists are correlated positively, than the one who has a low level score.
and a negative output indicates negative correlation. The We compare the estimated doctor expertise distributions
absolute values of the coefficients indicate the strength of for different doctor levels and plot the results for two
correlation. The bigger, the stronger correlation. randomly selected topics in Figure 3. From Figure 3, we
The results for the 25 randomly selected cases are listed observe that for both topics, the mean of the learned doc-
in Table 2. From the table, we can see that for all the tor expertise is positively correlated with the level scores.
cases, both the Kendall and Spearman’s rank correlation For the doctors with higher level scores, the mean of the
coefficients are positive, which indicates positive correla- estimated doctor expertise is also higher, which is expected
tions between the knowledge discovered by the proposed according to our intuition. Although it is difficult to prove
method and the human expert. The mean and median that the estimated doctor expertise is accurate due to the
for both coefficients are considerably high, which implies lack of ground truth information, this observation confirms
2332-7790 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TBDATA.2016.2612236, IEEE Transactions on Big Data
IEEE TRANSACTIONS ON BIG DATA 8
0.3
0.4
0.25
0.2 0.2
0.15
0.1
0
1 2 3 4 5 6 7
0.05
Topic
1 2 3 4
0.35
0.4
Estimated Doctor Expertise
0.3
0.25
0.2
0.2
0.15 0
1 2 3 4 5 6 7
0.1 Topic
0.05
Fig. 4: Various Doctor Expertise over Topics
0
1 2 3 4
Doctor Level Score (disease-drug task), and describing diseases and asking for
(b) topic 5 proper medical examinations (disease-examination task).
Table 3 shows two case studies for symptom-disease
Fig. 3: Evaluation on Doctor Expertise Estimation task. Since the original datasets are in Chinese, we also
provide the English translation in Table 3. We choose these
In fact, the estimated doctor expertise by the proposed cases because their symptoms and diseases are common, so
MKE system may be more useful than the level score as- our general readers without medical background can still
signed by xywy.com, because the former infers fine-grained judge the quality of the results. The first patient describes
topic expertise for each doctor. From common sense, we himself as 40 years old, and having symptoms of a headache
know that doctors have their own specialties. Therefore, and stuffed nose. The proposed method suggests that the
for different topics, the doctor’s expertise scores should be possible diseases he might get are bronchitis with probabil-
different. Here, we confirm this by showing the estimated ity of 0.2254, common cold with probability of 0.2908 and
doctors expertise on different topics. Since there are thou- pharyngitis with probability of 0.2349. The second patient
sands of doctors in the crawled dataset, it is impossible describes himself as 60 years old, and having symptoms of
to plot the estimated doctors expertise for all of them. distending pain around abdomen, inappetency, and ascites.
We randomly select six doctors, and plot their estimated The proposed method provides that the possible diseases he
expertise over the seven topics in Figure 4. might get are cirrhosis with probability of 0.3453, hepatitis
Figure 4 clearly demonstrates that the estimated doctor with probability of 0.3203 and liver cancer with probability
expertise scores are quite different over topics. For example, of 0.3343. From these two cases, we can see that the di-
doctor 1 might be an expert on topic 5, but not so on other agnoses and their trustworthiness degrees provided by the
topics. Doctor 2 has a high expertise score on topic 2, while proposed MKE system are reasonable.
his expertise scores on other topics are quite low. These Tables 4 and 5 show more case studies for disease-
observations confirm our intuition about the necessity of drug task and disease-examination task respectively, and
fine-grained doctor expertise estimation. their English translation are also provided in these tables.
All the case studies demonstrate that the MKE system
4.4 Case Studies can extract valuable medical knowledge from the medical
The above experiments quantitatively validate the perfor- crowdsourced Q&A websites. Therefore, it is helpful for
mance of the proposed MKE system. In this section, we give both patients and doctors.
some case studies in several tasks with different question in-
tentions. The question intentions include describing symp- 4.5 Effectiveness of Similarity Function
toms and asking for possible diseases (symptom-disease In Section 3.3.2, we discussed the challenge of multiple
task), describing diseases and asking for effective drugs linked truths. That is, there may be several trustworthy
2332-7790 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TBDATA.2016.2612236, IEEE Transactions on Big Data
IEEE TRANSACTIONS ON BIG DATA 9
answers for the same question and the answers are corre- trustworthiness degrees of the diagnoses. Similarly, in the
lated with each other. To address this challenge, we add a second example, the trustworthiness degree of the third di-
similarity function to the calculation of the answer trustwor- agnosis increases after using the similarity function, because
thiness degree to quantify the similarity between answers. the third disease, diarrhea, is correlated with the second
In this section, we experimentally illustrate the importance disease, enteritis. Both examples indicate that the proposed
and effectiveness of the proposed similarity function. similarity function significantly improves the answer trust-
To learn the real-value vector representation for enti- worthiness estimation by successfully modeling the link
ties, we adopt a large corpus, which contains 64 million among correlated answer entities.
question-answer pairs. Word2vec package [28] is a popular
tool to train vector representation of words, and thus we use
its Skip-gram architecture in our experiment. The dimen- 4.6 Effect of Question Text Representation
sionality of the learned vectors is set to be 100, the context As mentioned in Section 3.3.3, long-tail phenomenon on
window size is set to be 8, and we specify the minimum questions is severe in the medical Q&A dataset. Most
occurrence count to be 5. For more details, please refer to questions get only one or two answers while only a few
[22]. questions get more answers. However, as the entity-based
To illustrate the effect of similarity function, we show representation for text can merge similar questions, it can
two examples in Table 6. In the first example, if we do enlarge the set of answers for many questions and reduce
not consider the correlation of the answer entities and treat the side-effect of the long-tail phenomenon on questions.
them independently, the second and the third diagnosed We validate this claim in Table 7. We compare the average
diseases have very different trustworthiness degrees: the number of answers for each question before and after us-
second one is significantly higher than the third one. How- ing the entity-based representation for questions. It clearly
ever, the second disease is allergic rhinitis and the third is demonstrates that the size of answer set is boosted.
rhinitis, and these two diseases are highly correlated. If one Note that in Algorithm 1, we include the patients’ age
is trustworthy, so should the other. Successfully capturing information as part of question text representation, because
this correlation, the similarity function helps to correct the patients’ age information plays an important role in doctors’
2332-7790 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TBDATA.2016.2612236, IEEE Transactions on Big Data
IEEE TRANSACTIONS ON BIG DATA 10
Topic The Average Number of Answers for Each Question patient confirms her/his question, the App will automati-
before after cally analyze the patient’s symptoms, retrieve the extracted
1 2.25 4.05 medical knowledge triples of the MKE system, and provide
2 1.87 3.07 a list of possible diseases with corresponding confidence
3 2.49 4.66 probabilities. For example, the patient may have a cold with
4 2.08 3.23
a probability of 0.67, or have rhinitis with a probability of
5 2.60 4.63
6 1.54 5.69 0.07. The App further provides useful information about
7 1.85 3.68 these diseases by linking with Baidu Baike.
Though we only describe one application, the outputs
TABLE 7: Effectiveness of Entity-Based Representation on of the proposed MKE system, both the discovered medical
Long-Tail Phenomenon knowledge and the estimated doctor expertise, have great
potential to benefit various real-world medical applications.
For example, the estimated doctor expertise can help the
diagnosis. In Table 8, we use an example from the extracted task of question routing, which studies how to send ques-
medical knowledge to illustrate how age information in tions to appropriate doctors in order to receive high-quality
question text representation can affect the diagnoses. answers in short time. We plan to build more applications
Table 8 shows the top recommended drugs to cure com- based on the proposed MKE system in the future.
mon cold (感冒) for patients of different ages. For the infant
and young kid (age 1 and age 4), the top recommended
drug is Pediatric Paracetamol; for the teenager and adult 6 R ELATED W ORK
(age 10, 20 and 40), one of the top recommended drugs is 6.1 Crowdsourced Question Answering
amoxicillin; for the elderly (age 60), the recommended drugs The core problem of the proposed medical knowledge ex-
are antibiotics and antiviral drug. Although the patients traction (MKE) system is to evaluate the quality of question-
have the same disease, the recommended drugs are quite answer pairs in crowdsourced Q&A websites. There is some
different for different age groups. The recommendation related work that aims to solve similar problems. The work
for the young kid is weaker and safer compared to the can be categorized into two groups: the first group formu-
recommendation for the adult. This demonstrates that it lates the quality evaluation task as a classification problem,
is necessary to include patients’ age information in the while the second group infers the quality of question-
question text representation. answer pairs based on the expertise of users who provide
the answers.
We first introduce the related work that treats the quality
5 R EAL -W ORLD A PPLICATIONS
evaluation task as a classification problem and solves the
As mentioned in Section 2, various real-world applications task via supervised learning methods. The authors in [29]
can be built based on the extracted medical knowledge extract 13 non-textual features from each question-answer
triples and the estimated doctor expertise. Here, we describe pair, and train a maximum entropy model to classify the
one application called Ask A Doctor. This application can unlabeled question-answer pairs into three quality mea-
analyze the descriptions from patients and automatically surement levels, i.e., Bad, Medium and Good. In [30], the
infer the possible diseases that the patients might get. This authors propose a general classification framework for the
application, Ask A Doctor, has been implemented, and its quality evaluation task. Various features can be incorporated
beta version is available in the App Store (iOS). into the model: content-based features such as n-gram,
For the purpose of illustration, we give a series of link-based features such as asker-answerer link, and usage-
snapshots of this App in Figure 5. The App asks the patient based features such as the number of clicks. The feature
to describe her/his symptoms, and the patient can reply construction and selection for the quality evaluation task are
in speech. The App converts the voice input into question further studied in [31]: The authors ask a crowd of people to
text and display it on the screen, such as the one shown annotate the quality of question-answer pairs in terms of 13
in the first snapshot in Figure 5: ”I am 29 years old. I predefined criteria, such as the novelty of the answers, and
have been feeling the pain in my throat, and I got runny the helpfulness of the answers. By analyzing these annota-
nose and kept sneezing. Any idea or suggestion?” Once the tions, they find that the profile of answerer is quite useful
2332-7790 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TBDATA.2016.2612236, IEEE Transactions on Big Data
IEEE TRANSACTIONS ON BIG DATA 11
Fig. 5: Snapshots of the App: Ask A Doctor. (a) The patient enters his question “I have been feeling the pain in my throat,
and I got runny nose and kept sneezing. Any idea or suggestion?” (b) The App is analyzing the patient’s symptoms. (c) A
list of possible diseases and their corresponding probabilities are provided: cold with a probability of 0.67, rhinitis with a
probability of 0.07, etc. (d) The App links with Baidu Baike to provide more information about cold.
for the quality evaluation task, which indicates the necessity the topic tagging information about questions. In this line of
of considering answerers’ expertise. The above work simply related work, methods require various external information,
concatenates different features. However, textual and non- such as the best answer voting information, to estimate the
textual features usually have different representation, and expertise of answerers. Then the quality of question-answer
the correlations between them are non-linear. Thus in [32], pairs is inferred based on the derived answerer expertise.
the authors propose a multi-modal Deep Belief network to However, on the medical crowdsourced Q&A websites, the
learn an informative unified representations of both textual patients seldom give such feedback, and the interaction
and non-textual features. However, the drawback of this among patients and doctors (answerers) is also very rare,
group of methods is that to train a good classifier, they need which make it difficult to apply these methods. Therefore,
a large amount of labeled data, which is inefficient or even in this paper, we propose method to evaluate the quality of
impossible for the large-scale knowledge extraction on the question-answer pairs without any supervision.
medical crowdsourced Q&A websites.
2332-7790 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TBDATA.2016.2612236, IEEE Transactions on Big Data
IEEE TRANSACTIONS ON BIG DATA 12
6.2 Truth Discovery system has great potential to benefit more applications such
The main component in the proposed MKE system belongs as robot doctors and question routing in Q&A websites.
to the topic of truth discovery [10]–[12]. Truth discovery
methods can automatically estimate source reliability (doc- ACKNOWLEDGMENTS
tor expertise) from the data without any supervision, and
This work was sponsored in part by US National Science
incorporate such estimated source reliability into the aggre-
Foundation under grant IIS-1553411. The views and con-
gation of noisy multi-source information. Some representa-
clusions contained in this paper are those of the authors
tive work includes TruthFinder [6], [11], AccuSim [7], [12],
and should not be interpreted as representing any funding
3-Estimates [37], Investment [8] and CRH [9]. Although the
agency.
mathematical formulations are different, they share a similar
main principle: the information from a reliable source is
more trustworthy, and a source is reliable if it often provides R EFERENCES
trustworthy information. [1] “Baidu health related queries,”
Most of truth discovery methods work on clean struc- http://www.ebrun.com/20150812/144515.shtml.
tured data. Recently, people has begun to pay more at- [2] L. Nie, Y.-L. Zhao, M. Akbari, J. Shen, and T.-S. Chua, “Bridging
the vocabulary gap between health seekers and healthcare knowl-
tention on the noisy textual data, such as [17], [38], [39]. edge,” IEEE Transactions on Knowledge and Data Engineering, vol. 27,
In this work, we also study how to apply truth discovery no. 2, pp. 396–409, 2015.
on textual data. From the perspective of techniques, as [3] L. Nie, M. Wang, L. Zhang, S. Yan, B. Zhang, and T.-S. Chua,
“Disease inference from health-related questions via sparse deep
discussed in Section 3, we address some unique challenges learning,” IEEE Transactions on Knowledge and Data Engineering,
in medical knowledge extraction which cannot be solved by vol. 27, no. 8, pp. 2107–2119, 2015.
existing truth discovery methods. First, most of truth dis- [4] L. Nie, T. Li, M. Akbari, J. Shen, and T.-S. Chua, “Wenzher:
covery methods have the single truth assumption. Recently, Comprehensive vertical search for healthcare domain,” in Proc. of
the ACM SIGIR International Conference on Research and Development
some work [19], [20] studies the scenario of multiple truths in Information Retrieval (SIGIR’14), 2014, pp. 1245–1246.
scenario. However, they assume that the multiple truths [5] L. Nie, M. Akbari, T. Li, and T.-S. Chua, “A joint local-global
are independent. While in our case, this assumption does approach for medical terminology assignment,” in SIGIR 2014
Workshop on Medical Information Retrieval, 2014, pp. 24–27.
not hold as the relations among multiple possible diseases
[6] X. Yin, J. Han, and P. S. Yu, “Truth discovery with multiple
should be taken into account. To capture the linked multiple conflicting information providers on the web,” in Proc. of the ACM
truths challenge, we adopt the similarity function. Although SIGKDD International Conference on Knowledge Discovery and Data
in [7], [11], the similarity function is also adopted, they still Mining (KDD’07), 2007, pp. 1048–1052.
[7] X. L. Dong, L. Berti-Equille, and D. Srivastava, “Integrating con-
make the single truth assumption. Second, we observe the flicting data: The role of source dependence,” The Proceedings of the
long-tail phenomenon on the collected data, and previous VLDB Endowment (PVLDB), vol. 2, no. 1, pp. 550–561, 2009.
work [24] also addresses this challenge. However, it only [8] J. Pasternack and D. Roth, “Knowing what to believe (when you
address the long-tail phenomenon from the perspective of already know something),” in Proc. of the International Conference
on Computational Linguistics (COLING’10), 2010, pp. 877–885.
sources, while in our case, the long-tail phenomenon exists [9] Q. Li, Y. Li, J. Gao, B. Zhao, W. Fan, and J. Han, “Resolving conflicts
on both sources and questions. Moreover, the solution in in heterogeneous data by truth discovery and source reliability
[24] is derived for continuous data, whereas we deal with estimation,” in Proc. of the ACM SIGMOD International Conference
on Management of Data (SIGMOD’14), 2014, pp. 1187–1198.
textual (categorical) data. [10] Y. Li, J. Gao, C. Meng, Q. Li, L. Su, B. Zhao, W. Fan, and J. Han, “A
survey on truth discovery,” arXiv preprint arXiv:1505.02463, 2015.
[11] X. Yin, J. Han, and P. S. Yu, “Truth discovery with multiple
7 C ONCLUSIONS conflicting information providers on the web,” IEEE Transactions
The medical crowdsourced Q&A websites provide valuable on Knowledge and Data Engineering, vol. 20, no. 6, pp. 796–808, 2008.
[12] X. Li, X. L. Dong, K. B. Lyons, W. Meng, and D. Srivastava, “Truth
but noisy health related information. To extract high quality finding on the deep web: Is the problem solved?” The Proceedings
medical knowledge from the question-answer pairs, we of the VLDB Endowment (PVLDB), vol. 6, no. 2, pp. 97–108, 2012.
propose a Medical Knowledge Extraction (MKE) system in [13] S. Mukherjee, G. Weikum, and C. Danescu-Niculescu-Mizil, “Peo-
ple on drugs: credibility of user statements in health commu-
this paper. The MKE system can extract knowledge triples nities,” in Proc. of the ACM SIGKDD International Conference on
<question, diagnosis, trustworthiness degree> and estimate Knowledge Discovery and Data Mining (KDD’14), 2014, pp. 65–74.
doctors’ expertise simultaneously without any supervision. [14] D. Wang, L. Kaplan, H. Le, and T. Abdelzaher, “On truth discovery
Three unique challenges in medical knowledge extraction in social sensing: A maximum likelihood estimation approach,”
in Proc. of the International Conference on Information Processing in
tasks are recognized and tackled in the MKE system. We Sensor Networks (IPSN’12), 2012, pp. 233–244.
use entity-based representation to clean noisy text input and [15] L. Su, Q. Li, S. Hu, S. Wang, J. Gao, H. Liu, T. Abdelzaher, J. Han,
merge similar questions; A similarity function is applied X. Liu, Y. Gao, and L. Kaplan, “Generalized decision aggregation
in distributed sensing systems,” in Proc. of the IEEE Real-Time
to model the correlation between answers; And to handle
Systems Symposium (RTSS’14), 2014, pp. 1–10.
the long-tail phenomenon on sources, a pseudo count is [16] C. C. Aggarwal and T. Abdelzaher, “Social sensing,” in Managing
added so that we can estimate reasonable doctor expertise and mining sensor data, 2013, pp. 237–297.
for each doctor. A set of experiments on real-world datasets [17] X. L. Dong, E. Gabrilovich, G. Heitz, W. Horn, K. Murphy, S. Sun,
and W. Zhang, “From data fusion to knowledge fusion,” The
crawled from xywy.com validate the effectiveness of the Proceedings of the VLDB Endowment (PVLDB), vol. 7, no. 10, pp.
proposed MKE system in automatically extracting meaning- 881–892, 2014.
ful knowledge and estimating fine-grained doctor expertise [18] X. L. Dong, E. Gabrilovich, G. Heitz, W. Horn, N. Lao, K. Murphy,
from medical crowdsourced Q&A websites. We also show T. Strohmann, S. Sun, and W. Zhang, “Knowledge vault: A web-
scale approach to probabilistic knowledge fusion,” in Proc. of the
a real-world application, Ask A Doctor, to demonstrate the ACM SIGKDD International Conference on Knowledge Discovery and
impact of the MKE system. Beyond this App, the MKE Data Mining (KDD’14), 2014, pp. 601–610.
2332-7790 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TBDATA.2016.2612236, IEEE Transactions on Big Data
IEEE TRANSACTIONS ON BIG DATA 13
[19] B. Zhao, B. I. P. Rubinstein, J. Gemmell, and J. Han, “A bayesian Yaliang Li is a Ph.D. student in the Depart-
approach to discovering truth from conflicting sources for data ment of Computer Science and Engineering at
integration,” The Proceedings of the VLDB Endowment (PVLDB), SUNY Buffalo. He received his B.S. degree from
vol. 5, no. 6, pp. 550–561, 2012. Nanjing University of Posts and Telecommuni-
[20] R. Pochampally, A. D. Sarma, X. L. Dong, A. Meliou, and D. Srivas- cations in 2010. His research topics include
tava, “Fusing data with correlations,” in Proc. of the ACM SIGMOD truth discovery, text and web mining, privacy-
International Conference on Management of Data (SIGMOD’14), 2014, preserving data mining, and data mining appli-
pp. 433–444. cation in healthcare.
[21] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean,
“Distributed representations of words and phrases and their com-
positionality,” in Advances in Neural Information Processing Systems
(NIPS’13), 2013, pp. 3111–3119.
[22] R. Collobert and J. Weston, “A unified architecture for natural
language processing: Deep neural networks with multitask learn-
ing,” in Proc. of the International Conference on Machine Learning
Chaochun Liu is a Research Scientist in Baidu
(ICML’08), 2008, pp. 160–167.
Research Big Data Lab in Sunnyvale, Califor-
[23] A. Mnih and G. E. Hinton, “A scalable hierarchical distributed
nia. He received his Ph.D. on Mathematics from
language model,” in Advances in Neural Information Processing
Sun Yat-sen University of China in 2009. Af-
Systems (NIPS’09), 2009, pp. 1081–1088.
ter graduation, he worked as Senior Research
[24] Q. Li, Y. Li, J. Gao, L. Su, B. Zhao, D. Murat, W. Fan, and J. Han, “A
Associate in City University of Honkong, then
confidence-aware approach for truth discovery on long-tail data,”
Postdoctoral Researcher and Research Scientist
The Proceedings of the VLDB Endowment (PVLDB), vol. 8, no. 4, pp.
in New York State Department of Health. His
425–436, 2015.
research interest includes deep learning, text
[25] “Jieba, chinese word segmentation package,” https://github.
mining, computer vision and bioinformatics. He
com/fxsjy/jieba.
has published 20+ papers in the field of machine
[26] “Kendall rank correlation coefficient,” https://en.wikipedia.org/
learning and bioinformatics.
wiki/Kendall rank correlation coefficient.
[27] “Spearman’s rank correlation coefficient,” https://en.wikipedia.
org/wiki/Spearman%27s rank correlation coefficient.
[28] “Word2vector package,” https://code.google.com/p/word2vec/.
[29] J. Jeon, W. B. Croft, J. H. Lee, and S. Park, “A framework to predict
the quality of answers with non-textual features,” in Proc. of the Nan Du is a research scientist in the Baidu Re-
ACM SIGIR International Conference on Research and Development in search Big Data Lab in Sunnyvale, California.
Information Retrieval (SIGIR’06), 2006, pp. 228–235. He received his Ph.D. degree from Computer
[30] E. Agichtein, C. Castillo, D. Donato, A. Gionis, and G. Mishne, Science and Engineering department in State
“Finding high-quality content in social media,” in Proc. of the ACM University of New York at Buffalo, NY, with su-
International Conference on Web Search and Data Mining (WSDM’08), pervision by Prof. Aidong Zhang. Prior to that,
2008, pp. 183–194. he received the MS degree from Southern China
[31] C. Shah and J. Pomerantz, “Evaluating and predicting answer University of Technology in 2009. His research
quality in community QA,” in Proc. of the ACM SIGIR International interests are in the areas of machine learning,
Conference on Research and Development in Information Retrieval natural language processing, health care and
(SIGIR’10), 2010, pp. 411–418. bioinformatics.
[32] H. Hu, B. Liu, B. Wang, M. Liu, and X. Wang, “Multimodal DBN
for predicting high-quality answers in cQA portals.” in Proc. of
the Annual Meeting of the Association for Computational Linguistics
(ACL’13), 2013, pp. 843–847.
[33] T. Zhao, C. Li, M. Li, S. Wang, Q. Ding, and L. Li, “Predicting Wei Fan is currently the Senior Director and
best responder in community question answering using topic Head of Baidu Research Big Data Lab in Sunny-
model method,” in Proc. of the International Joint Conferences on Web vale, California. He received his Ph.D. in Com-
Intelligence and Intelligent Agent Technology (WI-IAT’12), 2012, pp. puter Science from Columbia University in 2001.
457–461. His main research interests and experiences are
[34] L. Yang, M. Qiu, S. Gottipati, F. Zhu, J. Jiang, H. Sun, and Z. Chen, in various areas of data mining and database
“Cqarank: Jointly model topics and expertise in community ques- systems, such as, deep learning, stream com-
tion answering,” in Proc. of the ACM Conference on Information and puting, high performance computing, extremely
Knowledge Management (CIKM’13), 2013, pp. 99–108. skewed distribution, cost-sensitive learning, risk
[35] Y. Liu, J. Bian, and E. Agichtein, “Predicting information seeker analysis, ensemble methods, easy-to-use non-
satisfaction in community question answering,” in Proc. of the parametric methods, graph mining, predictive
ACM SIGIR International Conference on Research and Development feature discovery, feature selection, sample selection bias, transfer
in Information Retrieval (SIGIR’08), 2008, pp. 483–490. learning, time series analysis, bioinformatics, social network analy-
[36] L. A. Adamic, J. Zhang, E. Bakshy, and M. S. Ackerman, “Knowl- sis, novel applications and commercial data mining systems. His co-
edge sharing and yahoo answers: Everyone knows something,” in authored paper received ICDM’06/KDD11/KDD12/KDD13/KDD97 Best
Proc. of the International Conference on World Wide Web (WWW’08), Paper & Best Paper Runner-up Awards. He led the team that used his
2008, pp. 665–674. Random Decision Tree (www.dice.com) method to win 2008 ICDM Data
[37] A. Galland, S. Abiteboul, A. Marian, and P. Senellart, “Corrobo- Mining Cup Championship. He received 2010 IBM Outstanding Techni-
rating information from disagreeing views,” in Proc. of the ACM cal Achievement Award for his contribution to IBM Infosphere Streams.
International Conference on Web Search and Data Mining (WSDM’10), He is the associate editor of ACM Transaction on Knowledge Discovery
2010, pp. 131–140. and Data Mining (TKDD). During his times as the Associate Director in
[38] J. Pasternack and D. Roth, “Making better informed trust decisions Huawei Noah’s Ark Lab in Hong Kong from August 2012 to December
with generalized fact-finding,” in Proc. of the International Jont 2014, he has led his colleagues to develop Huawei StreamSMART –
Conference on Artifical Intelligence (IJCAI’11), 2011, pp. 2324–2329. a streaming platform for online and real-time processing, query and
[39] D. Yu, H. Huang, T. Cassidy, H. Ji, C. Wang, S. Zhi, J. Han, C. Voss, mining of very fast streaming data. StreamSMART is 3 to 5 times faster
and M. Magdon-Ismail, “The wisdom of minority: Unsupervised than STORM and 10 times faster than SparkStreaming, and was used
slot filling validation based on multi-dimensional truth-finding,” in Beijing Telecom, Saudi Arabia STC, Norway Telenor and a few other
in Proc. of the International Conference on Computational Linguistics mobile carriers in Asia. Since joining Baidu Big Data Lab, Wei has been
(COLING’14), 2014. working on medical and healthcare research and applications, such as
deep learning-based disease diagnosis based on NLP input as well as
medical dialogue robot.
2332-7790 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TBDATA.2016.2612236, IEEE Transactions on Big Data
IEEE TRANSACTIONS ON BIG DATA 14
2332-7790 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
information.