Professional Documents
Culture Documents
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 105
Abstract— Protein sequence clustering has been widely used as a part of the analysis of protein structure and function. In most cases
single link or graph-based clustering algorithms have been applied. Clustering of proteins using SEQOPTICS (sequence clustering with
OPTICS), which is based on OPTICS (Ordering Points To Identify the Clustering Structure),is an attractive approach due to its emphasis
on visualization of results and support for interactive work such as choosing parameters. Ordering Points To Identify the Clustering
Structure can be implemented for Visualization of the sequence clustering structure. The system was evaluated by comparison with other
existing methods.
—————————— ——————————
1. INTRODUCTION
contain most protein sequences and are very popular in
XTRACTING useful information from biological biological research.
sequence. This approach may yield fairly good results, alignments and Hidden Markov Models(HMMs). Pfam
but often a majority of sequences are grouped into one multiple alignments come in two forms. In the first,
single huge cluster resulting from a massive chain “seed” alignments are representative, non-redundant
effect due to multi-domain proteins. sets of sequences that are checked reasonably carefully
Blastclust program, one part of BLAST in a manual alignment editor.
package from NCBI, is an example of single linkage
protein sequence clustering. Another category, graph-
Pairwise
based clustering algorithms, are also commonly
Distance
employed due to the clustering quality. BAG[13] is a
Computing
sequence clustering algorithms based on graph theory.
OPTICS (Ordering Points To Identify the
Clustering Structure)[14] is a density-based clustering
method and is popular because it orders the data into a
density-based clustering structure corresponding to a Data Sets SEQOPTIC
broad range of parameter settings. For density-based Extraction S Clustering
methods, it is difficult to decide the input parameters
that the algorithm is sensitive to. OPTICS is a good
solution to density-based cluster ordering. Although it
does not produce clusters explicitly, OPTICS generates Result
an augmented ordering of data sets representing its Analysis
density-based clustering structure, and this structure
can be visualized. Since OPTICS does not limit cluster Figure. 1 SEQOPTICS Overview
extraction to global parameters, it is possible to extract
cluster information interactively as well as
SEQOPTICS expands the use of OPTICS, a
automatically.
method that has not been used in protein sequence
For any protein sequences clustering method a
analysis. Figure 1 shows the overview of our method.
suitable distance measure needs to be chosen. Some
First, data sets are extracted from data sources (mostly
functionally related sequences share little or no
protein databases), then mixed and randomized. Three
discernible sequence similarity and detection of these
data sources are Pfam, Swiss-Prot and NCBI. Secondly,
relationships is difficult. The general practice to carry
the pairwise distances between any two proteins are
out protein sequence clustering is based on pair-wise
computed. Here we used a normalized Smith-
sequence similarity/dissimilarity computed by
Waterman score. Several other options may be chosen,
algorithms such as Smith-Waterman[15]. Some other
such as BLAST or FASTA, for distance measure. Then
protein distance measurement such as BLAST[9],
the OPTICS algorithm is adopted to execute the
FASTA[16] are also very commonly taken in existing
clustering and the clustering structure is graphically
methods.
presented. Finally the clustering results are analyzed
How to evaluate clustering results quality is
and compared to some other methods by Jaccard
an important issue in clustering analysis. For two-
coefficient.
dimensional data, it is clear that we can plot the data
and read the distribution to tell how good the
clustering results are. But in high dimension data or 2.1 DATA SETS
sequence clustering, direct visualization is normally The alignments are HMM-generated automatic
not feasible. In protein sequence clustering, a popular alignments of every homologous domain [1]. Two other
measure of clustering quality is based on how well the data sets are from NCBI and Swiss-Prot separately.
clusters identified by the clustering algorithm match Each protein sequence is labelled by its original
the protein families defined in some database by notation. This labelling defines the assumed “true”
biological experts[8]. Another method is to compare clusters.
SEQOPTICS with some existing methods by using
certain validation techniques [17].
For example, if a sequence is extracted with
“IGA1” from NCBI, then it is labelled as “IGA1” and
2 METHODS assumed to be in “IGA1” cluster. The size of each data
Four data sets are extracted from different protein set ranges from 197 to 319 sequences just for testing our
repositories as shown in table 1. Two of them are from method.For each data set, protein sequences from
Pfam since it is a protein families database of
JOURNAL OF COMPUTING, VOLUME 2, ISSUE 12, DECEMBER 2010, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 107
different families are mixed and randomized to understanding how OPTICS works: An object p is in the
exclude manual clustering. ε- neighborhood of q if the distance from p to q is less
than ε; A core object has at least MinPts neighbors in its
TABLE 1 ε-neighborhood. The reachability distance of p is the
PROTEIN SEQUENCE DATA SETS smallest distance such that p is density-reachable from
a core object o. A cluster is a set of density-connected
objects which is maximal with respect to density-
reachability.
A reachability plot is a bar chart that shows
each object’s reachability distance in the order the object
was processed which demonstrates the cluster structure
of data. The final clusters can be extracted by either ε-
cutoff or steepness of the plot. For more detailed
information about OPTICS algorithm, please refer the
original paper[14].
SEQOPTICS is implemented with a distance
measure of sequences based on Smith-Waterman
Note: The number in parenthesis is the number of algorithm. The core OPTICS part was tested with the
sequences in each family
data sets from the author. Two parameters need to be
chosen, ε and MinPts. In this paper, since the distance
2.2 COMPUTING DISTANCE between any two protein sequences is between 0 and 1,
Our approach, consonant with others, starts with a we can use a single ε for all data set, for example, set ε
distance measure. When data sets are from different as 0.99, which is slightly smaller than 1. The MinPts
protein families, it is a common practice to use a used here is 10 just for experiment purposes. For the
normalized pairwise local alignment score by Smith- whole protein database, ε can still use any value
Waterman dynamic programming algorithm. There between 0.95 to 0.99, MinPts should be set as the
are several parameters in Smith-Waterman, for average number of sequences in a family.
example, scoring matrix, open gap penalty and There are two main advantages to apply
extending gap penalty. Different scoring matrices OPTICS in protein sequences clustering analysis: 1)
including BLOSUM50 and PAM250 have been tried. OPTICS can find the local density region; 2) OPTICS
BLOSUM50, which is also used in FASTA[16], is used produces an augmented ordering of the database
in this paper. The open gap penalty taken is 12 and the representing its density based clustering structure and
extending gap penalty is 2. The final similarity score this ordering can be visualized, for example, in
between two protein sequences is then normalized by reachability plot. The cluster ordering actually contains
the following: information about every cluster, i.e., OPTICS enables
S ( a , b) the extraction of not only “traditional” cluster
SN (a, b) information, but also intrinsic clustering structure.
Min( S (a, a ), S (b, b))
where S(a, b) is the Smith-Waterman local alignment
score between two sequences a and b; S(a, a) is the
3 RESULTS
score of sequence a to itself; S(b, b) is the score of SEQOPTICS is used to cluster the data sets. These
sequence b to itself; and SN(a, b) is the normalized provide some clues about clustering structure. The final
score. The distance between two protein sequences is density-based cluster sets are defined from the ordering
defined as: Distance (a, b) = 1 − SN (a, b); reachability distance. To judge the resulting clustering
set’s biological accuracy, we need to compare it to a
With this normalization, every distance score is “true” cluster set. However, there is no generally
between 0 and 1. If other scoring methods are used accepted “true” cluster set.
instead of Smith-Waterman, the distance measure All automatical protein clustering methods are
needs to be adjusted appropriately. based on “all against all” sequence comparison. Real
clusters need to be verified by biological expertise.
Since it is impossible to have “real” clustering, we have
2.3 OPTICS CLUSTERING to assume the original database clusters are the “real”
Some preliminary remarks on OPTICS have been given clusters. That is the way of most automatic protein
in the introduction. Some definitions might help us clustering does.
3.1 VISUALIZATION OF THE CLUSTER STRUCTURE a is “true positive”, i.e., the number of
sequence pairs clustered together in both sets,
A reachability distance plot was made for each data
which can be define as :
set. In the figure, the horizontal axis represents the
ordering of each sequence, the vertical axis represents a ( i , j ) M ij 1 Tij 1, i j
the reachability distance, and each valley stands for a b is “false negative”, i.e., the number of
cluster set. For data set 1, there are five valleys in sequence pairs clustered together in the true
Figure 2: The first two valleys are composed of cluster set, but not in the current
sequences from cytochrom B562; The third valley clustersolution defined as:
consists of sequences from glucokinase; The fourth b ( i , j ) M ij 0 Tij 1, i j
valley contains sequences from GABAR family; The
fifth valley are sequences from bac globin family. c is “false positive”, i.e., the number of
sequence pairs clustered in the current
solution, but not in the true cluster set, defined
3.2 EXTRACTION OF THE CLUSTERS as:
The final density-based clusters were extracted by c ( i , j ) M ij 1 Tij 0 , i j
using a cutoff value. For example, in Figure 2, the d is “true negative”, i.e., the number of
cutoff value is set as 0.860 (the line reachability sequence pairs not clustered in either current
distance = 0.860). Under this cutoff condition, each solution or the true cluster set, defined as :
valley in Figure 2 between two sequences with
d ( i , j ) M ij 0 Tij 0 , i j
reachability distance higher than the cutoff is a cluster.
The sequence starting a valley with reachability There are many validation techniques in references[18].
distance higher than the cutoff is also in the same Three parameter based on the above are as follows:
cluster as rest sequences in the valley. Precision is defined as:
a
P ------ (1)
(a c)
Recall is defined as:
a
R ------ (2)
(a b)
Jaccard Coefficient is defined as:
a
S ------ (3)
(a b c)
All three values range between 0 and 1.The
better the clustering, the bigger the values. In a perfect
clustering which is identical to the true cluster, P = 1, R
= 1, and S = 1. Most existing sequence clustering
Figure 2. Cluster Structure of Date Set 1: Pfam (Protein
Familiy) methods perform well in terms of Precision but not in
Recall. This is shown in later validation result in Table 2
Any sequence with reachability distance along with additional comparative values.
higher than the cutoff is noise if it does not start a
valley. Therefore, in Figure 2, there are four clusters The clusters were formed from the same data
give the cutoff value 0.860. Similarly, there are four sets with two other clustering methods, blastclust [9]
clusters in Figure 3 given cutoff 0.745, six clusters in and BAG [13], using default parameters of these
Figure 4 given cutoff 0.860. methods. BAG is a graph based clustering method and
most existing methods are based on graph clustering.
Blastclust is chosen because it is from NCBI blast
3.3 VALIDATION OF THE CLUSTER SET package. This hierarchical sequence clustering method
A cluster set of n data points from experiment can be is very popular in biological research.
n * ( n 1)
represented by m values in a triangular
2 Using the Jaccard Coefficient, Precision and
matrix M, where for i < j, Mi j = 1, if and only if i and j Recall, we developed comparison values shown in the
are in the same cluster and Mi j = 0 otherwise. If T is Table 2. From the Table 2, we see that SEQOPTICS
the matrix of the “true” clusters, two cluster sets produces good results relative to each original cluster
(“true” and “experimental”) can be compared based on set in terms of Jaccard Coefficient. Every SEQOPTICS
the following numbers: Jaccard Coefficient is higher than 0.65 and the highest
being 0.85. It is also seen in the table that SEQOPTICS adequacy of our approach for small-scale data and the
outperforms BAG and blastclust on all the data sets usefulness of the cluster structure visualization.
chosen on Jaccard Coefficient. According to Ankerst[14], one good feature of OPTICS
is that it is unnecessary to limit oneself to a single set of
TABLE 2 global parameters. An augmented cluster ordering
COMPARISON OF CLUSTERING RESULTS OF BAG, contains information equivalent to density based
BLASTCLUST AND SEQOPTICS METHODS clusterings corresponding to a broad range of
parameter settings; as such, the cluster ordering is a
versatile base for both automatic and interactive cluster
analysis. A second good feature lies in the visualization
of the data set distribution. Depending on data set size,
one can either represent the cluster-ordering
graphically for a small data set, or, employ an alternate
technique (appropriate) for large data sets. In this
paper, we demonstrated in SEQOPTICS the
visualization of cluster structure is meaningful.
SEQOPTICS has proved its value for small data sets
(<1000 sequences)in this paper. If we are applying this
method to a large data set, such as whole database,
The performance with BAG exceeds blastclust future improvements are necessary to make it more
for the same reason. However, BAG and blastclust useful. We may list specifically: 1) use some other
tends to give more clusters than the “true” clusters, distance measure for protein sequence distance, e.g.,
explaining why the Precision of those two methods on BLAST or FASTA; 2) apply parallel computing tools, for
all data sets are 1. But neither of these two performs example, Message Passing Interface(MPI) for large data
well in terms of Recall. Overall, SEQOPTICS performs sets; 3) implement visualization techniques for large
better than BAG and blastclust and seems a promising data sets; 4) consider incremental cluster ordering
method in terms of both clustering quality coupled algorithms since protein databases are very frequently
with its graphical representation of clustering being updated.
structure.
REFERENCES
4 EVALUATION OF THE EXECUTION TIME
[1] Alex Bateman, Ewan Birney, Lorenzo Cerruti, Richard Durbin, The
The time complexity of Smith-Waterman is O(n2l2), Pfam Protein Families Database. Nucl. Acids. Res., 30(1):276–280,
where n is the number of sequences and l is the 2002.
average length of the sequence. The time complexity of [2] Dennis A. Benson, Ilene Karsch-Mizrachi, David J. Lipman, James
OPTICS is O(n2). Therefore the total time complexity is Ostell, and David L. Wheeler. Gen- Bank: update. Nucl. Acids. Res.,
O(n2l2). This is an expensive method if Smith- 32(90001):D23–26, 2004.
Waterman is the only choice of the distance measure. [3] Amos Bairoch and Rolf Apweiler. The swiss-prot protein sequence
Fortunately there are some other options for the database and its supplement trembl in 2000. Nucl. Acids. Res.,
distance between two protein sequences, such as 28(1):45–48, 2000.
BLAST or FASTA which will dramatically decrease the [4] Cathy H. Wu, Hongzhan Huang, Leslie Arminski, Jorge Castro-
overall time complexity. Alvear,. The Protein Information Resource: an integrated public
resource of functional annotation of proteins. Nucl. Acids. Res.,
30(1):35–37,2002.
5 CONCLUSION
[5] Evgenia V Kriventseva, Margaret Biswas, and Rolf Apweiler.
In this paper a programmed system, SEQOPTICS, for Clustering and analysis of protein families. Current Opinion in
protein sequences clustering as shown in Figure 1 is Structural Biology, 11(3):334– 339, 2001.
described. A core portion(phase) of the system is based [6] S. Mohseni-Zadeh, P. Brezellec, and J. L. Risler. Cluster-c, an
on OPTICS clustering and visualization method, algorithm for the large-scale clustering of protein sequences based on
which we believe is being used here for the first time in the extraction of maximal cliques. Computational Biology and
protein sequence clustering. Prior to this phase, it is Chemistry, 28:211–218, 2004.
necessary to compute the distance between (protein) [7] P. Pipenbacher, A. Schliep, S. Schneckener, A. Schonhuth, D.
sequences. A normalized Smith-Waterman score is Schomburg, and R. Schrader. ProClust: improved clustering of
used in this paper to compute the required distance. protein sequences with an extended graph-based approach.
The final system phase, Results Analysis, demonstrates Bioinformatics, 18(90002):182S–191, 2002.