Professional Documents
Culture Documents
Wenqing Sun
Jessica Rebollosa
lewaldrop@miners.utep.edu
Marcus Gutierrez
wsun2@miners.utep.edu
rebollosa.jr@gmail.com
mgutierrez22@miners.utep.edu
ABSTRACT
Clinicaltrials.gov [1] houses information regarding clinical trials
that are currently underway. In addition to information about
background, purpose, and design of a specific clinical trial, the
webpages also provide links to affiliated papers that can be found
in PubMed [2] (a warehouse for citations in biomedical research).
These links are explicit, but implicit links between clinical trials
and publications more than likely exist. For example, a researcher
may like to know if a given clinical trial is related to more
publications than just the ones listed on the clinical trial webpage.
This relation could be the result of similar key terms imbedded
within the clinical trial webpages and PubMed abstracts. By
using a dependent clustering algorithm [3], and a novel approach
using Nave Bayes for heterogeneous datasets, we aim to give
scientists in the biological community insight not only into related
terms, but also clinical trials and/or other publications that may
not have explicit links.
Keywords
Clustering
2. DATA DESCRIPTION
The data for this project consists of terms found in webpages from
Clinicaltrials.gov and abstracts in PubMed. The data is organized
into two weighted term document matrices, one for data gathered
from Clinicaltrials.gov, and the other from PubMed abstracts.
Each matrix row is associated with a single document, while each
column is associated with a term found within the document.
1. INTRODUCTION
3. METHODOLOGY
In this project, we are taking two different but complementary
approaches. The first of which involves a dependent clustering
algorithm, developed by Dr. M. Shahriar Hossain in an effort to
find implicit relationships and similar terms between the clinical
trials dataset, and the PubMed abstracts dataset. The second
approach involves the development of a novel technique for
classifying documents from heterogeneous datasets via Nave
Bayes. Following is a short description of both approaches.
The first step of the dependent clustering algorithm is to
separately assign vectors in each of the two datasets to clusters via
k-means. The second step involves preparing contingency tables
based on the clustering results and the pre-existing relationships
between the two datasets. Finally, each of the contingency tables
are evaluated by minimizing a cost function such that
relationships in one cluster of the clinical trials dataset are
exclusive to only a single cluster in the PubMed dataset. These
individual steps are repeated until convergence.
Finally, Nave Bayes Classification will be applied to the
heterogeneous data set. This algorithm uses training data to
predict classes of new data entries. In this context, the Nave
Bayes Classification will either reinforce existing links or suggest
new links that may better cluster the data and provide insight in to
the architecture of the data set. The result may also further
advance the data preprocessing phase by adding previously
implied links further connecting the two data sets.
4. RESULTS
4.1 Term Elimination via Term Variance
In the initial phases of our work, we wanted to see how our data
would cluster using the implementation of dependent clustering
that was authored by Dr. M. Shahriar Hossain. Using cluster sizes
of 2, 3, 4, and 5, it was found that a majority of the documents
(over 98%) were clustering into one group, both before and after
dependent clustering. Its important to note that during this phase,
the datasets used were generated with a TF-IDF cut-off value of
0.3. It was only much later on in the data exploration process that
it proved to be beneficial to decrease this threshold for both
datasets (more on this later).
K-means has been shown to work very hard to place roughly the
same number of instances in each cluster, and since our
preliminary results varied significantly from this school of
thought, the number of terms in each dataset was reduced. To do
this, terms for both PubMed and Clinical Trials datasets were
eliminated by leveraging the variance of each term. Variance is a
good way to measure the differential power of a term. Low
variance of a term indicates that either the term is not present in
any of the documents, or the term is present in most or all of the
documents. Thus, terms with low variance do not maintain any
discriminatory power.
Figure 1. Term Variance Plots for both the PubMed dataset (left)
and the Clinical Trials dataset (right).
Using these plots, a set of minimum and a set of maximum
variance thresholds were established for each dataset. Once the
variance thresholds were determined, k-means was run with
different combinations of minimum and maximum thresholds.
The best combination of thresholds was determined by
investigating the percentage of documents in the largest cluster.
The combination of thresholds that reduced this percentage the
most was considered to be the best. Figure 2 shows the results of
these variance threshold experiments. Figure 2a and 2b illustrate
the results of the experiments using a TF-IDF cutoff of 0.3, while
Figure 2c and 2d illustrate the results of the experiments using a
TF-IDF cutoff of 0.01 for both the PubMed and Clinical Trials
datasets. Combinations of minimum and maximum thresholds are
represented by a combination of an index along the x-axis
(maximum threshold) and a color of a line (minimum threshold).
From the graphs in Figure 2, it was determined that for the
PubMed dataset, the ideal maximum threshold is a value of 1.5 x
10-3 and a minimum threshold of 1.2 x 10-4. These thresholds
effectively reduce the number of terms in the PubMed dataset
from 6,640 to 2,057. For the Clinical Trials dataset, the ideal
maximum threshold is 1.2 x 10-3 and a minimum threshold of 5.0
x 10-5. These thresholds reduce the number of terms in the
Clinical Trials dataset from 7,599 to 4,393. Figure 3 demonstrates
where these thresholds lie in terms of the variance threshold
graphs displayed in Figure 1. Its important to point out that in
Figure 2 the percentage of documents in the largest cluster overall
is lower for the datasets with TF-IDF thresholds of 0.01, as
compared to datasets with TF-IDF thresholds of 0.3. Moving
forward, datasets with TF-IDF thresholds of 0.01 will be utilized
for further testing.
To rule out terms from both datasets with low variance and
exceptionally high variance, document-term matrices were
generated, the variance of each term was calculated, the terms and
their associated variance values were ranked in decreasing order,
and the results were plotted. Figure 1 illustrates the resulting plots
for both cases.
all the term frequencies from document C from corpus A and lists
the classification as document 5. This idea would allow for the
prediction of new links and grant new information from
documents in corpus A to link to documents in corpus B. Note:
This algorithm is unidirectional and if prediction links wished to
be applied to corpus B, the two tables would need to be swapped.
Two large criticisms plagued this approach. The first criticism
being that the terms in corpus B were unused; this is a large
amount of information that simply goes to waste that could help
more accurately predict links. The second criticism is that links
can only be predicted if documents in corpus B are currently
linked. Perhaps document C from corpus A would strongly link
with document 17 in corpus B based on content, which currently
has no links. In this case, the link could not be predicted because
document 17 was never present in the training phase of the
algorithm, disqualifying it from the testing phase. A new
algorithm that uses some of the same principles but combats the
two criticisms had to be considered.
( | ) (( ) ( | ))
=1
=1
The extended tables generated from this algorithm tower over the
extended table from the first algorithm in terms of size. However,
classing by corpus B terms does not allow one to simply apply
Nave Bayes Classification, because the goal is to link documents
from corpus A to documents in corpus B, not to link documents
from corpus A to terms. To manage this, probabilities of terms in
documents from corpus B, given terms from documents in corpus
A established via the explicit links are calculated. The probability
of document 17, given document C, can be expressed as the sum
of the probabilities of the product of the probabilities of each term
in document 17, given each term in document C. The result being:
doc 1
doc 2
doc 3
doc 4
doc 5
doc 6
0.0016
0.0016
0.0020
0.0027
0.0018
0.0020
0.0113
0.0046
0.0075
0.0136
0.0132
0.0161
0.0100
0.0083
0.0060
0.0039
0.0111
0.0035
0.0001
0.0001
0.0004
0.0008
0.0001
0.0005
Z
0.0005
0.0005
0.0010
0.0016 0.0005
0.0012
Figure 10. Document conditional probabilities. Documents A-Z
are from corpus A and Documents 1-6 are from corpus B. Entry
(doc A, doc 1) represents P(doc 1| doc A).
A toy data set was created to test this new algorithm. The toy data
set included terms that appeared in every document, documents
without explicit links, terms that only appeared in one document,
and more. The variety was added in hopes that it would lead to
interesting results. Luckily, the results did end up interesting:
implicit links were discovered for documents that had no explicit
links and even more interesting was the fact that the algorithm
suggested a correlation between document C and document 5
even though no explicit link was there before. In other words,
document C was most correlated to document 5, even though
there was no explicit link was present. The results can be seen on
Figure 10, with the red squares indicating explicit links in the
training data and higher numbers representing higher correlation.
Displaying key terms for each cluster may reveal significant links
that have not previously been explored. For example, if key
words related to a biological pathway and a drug target show up in
the same cluster but their relationship has never been investigated,
then our users may have found something worth exploring. By
depicting a prototype associated with each cluster, we are hoping
to aid our users in getting a quick and stress-free way to find a
cluster that is the most related to their current work. In this way, a
researcher could quickly hone in on their specific cluster of
interest and begin investigating the individual components of that
cluster. Finally, once a researcher has found their cluster of
interest, they would be able to explore the documents within that
cluster.
7. ACKNOWLEDGMENTS
8. REFERENCES
[1] ClinicalTrials.gov, A service of the U.S. National Institutes
of Health. Retrieved February 9, 2015 from:
https://clinicaltrials.gov/
[2] PubMed.gov, US National Library of Medicine, National
Institutes of Health. Retrieved Feruary 9, 2015 from:
http://www.ncbi.nlm.nih.gov/pubmed
6. DATA ANALYSIS