You are on page 1of 4

2009 ISECS International Colloquium on Computing, Communication, Control, and Management

A Semi-supervised Clustering via Orthogonal Projection

Cui Peng Zhang Ru-bo

Harbin Engineering University Harbin Engineering University
Harbin 150001, China Harbin 150001, China

Abstract—As dimensionality is very high, image feature space As many semi-supervised clustering methods are based
is usually complex. For effectively processing this space, density or distance, they are difficult to handle high-
technology of dimensionality reduction is widely used. Semi- dimensional data. Thus, reduced feature must be added into
supervised clustering incorporates limited information into semi-supervised clustering process. We propose
unsupervised clustering in order to improve clustering
performance. However, many existing semi-supervised
COPFC(Constrained Orthogonal Projection Fuzzy
clustering methods can not be used to handle high-dimensional Clustering)method to solve this problem.
sparse data. To solve this problem, we proposed a semi-
supervised fuzzy clustering method via constrained orthogonal
projection. With results of experiments on different datasets, it
shows the method has good clustering performance for
handling high dimensionality data.

Keywords-dimension reduction; clustering; projection; semi-

supervised learning

In recent years, because of fast extension of feature
information and volume of image data, many tasks in
multimedia processing have become increasingly
challenging Dimensionality reduction techniques have been
proposed to uncover the underlying low dimensional
structures of the high-dimensional image space [1].These
efforts have proved to be very useful in image retrieval,
classification and clustering. There are a number of Figure 1. COPFC framework
dimensionality reduction techniques in the literature. One of
Figure 1 shows the framework of the COPFC method.
the classical methods is Principal Component Analysis
(PCA) [2], which minimizes the information loss in the Given a set of instances and a set of supervision in the form
of must-link constraints CML={(xi, xj)}, (xi, xj) where (xi, xj)
reduction process. One of the disadvantages of PCA is that
must reside in the same cluster, and cannot-link constraints,
it likely distorts the local structures of a dataset. Locality
Preserving Projection (LPP) [3-4] encodes the local CCL={(xi, xj)}, (xi, xj) where should be in the different
clusters, the COPFC method is composed of three steps. In
neighborhood structure into a similarity matrix and derives a
the first step, a preprocessing method is exploited to reduce
linear manifold embedding as the optimal approximation to
this matrix, but LPP, on the other hand, may overlook the the unlabelled instances and pairwise constraints according
to the transitivity property of must-link constraints. In the
global structures.
Recently, semi-supervised learning has gained much second step, a constraint-guided Orthogonal projection
attention [6-10], which leverages domain knowledge method, called COPFCproj, is used to project the original
represented in the form of pairwise constraints. Various data into a low-dimensional space. Finally, we apply a semi-
reduction techniques have been developed to utilize this form supervised fuzzy clustering algorithm, called COPFCfuzzy,
of knowledge[11-12]. produce the clustering results on the projected low-
The constrained FLD defines the embedding based dimensional dataset.
solely on must-link constraints. Semi-Supervised
Dimensionality Reduction (SSDR) [13], preserves the
intrinsic global covariance structure of the data while
exploiting both constraints.

978-1-4244-4246-1/09/$25.00 ©2009 IEEE CCCM 2009

III. COPFCPROJ - A CONSTRAINED ORTHOGONAL M ( X , Y ) = ( M ( X ) − M (Y ))( M ( X ) − M (Y ))T .Accordingly, we can
PROJECTION METHOD rewrite equation (3) as follows:
∑ ∑ ( Pi x − Pi y )2 = 2 piT ( ML C (ML)) pi (8)
In a typical image retrieval system, each image is
represented by an m -dimensional feature vector x whose jth x∈ML y∈ML

value is denoted as xj. During the retrieval process, the user Similarly, we can rewrite equation (4) as follows:
is allowed to mark several images with must-links which ∑ ∑ (P i
− Pi y ) 2 = piT ( ML CL (C ( X ) + C (Y )
match his query interest, and also to indicate those x∈ML y∈CL

apparently irrelevant with cannot-links. COPFCproj is a + M ( X , Y ))) pi

linear method and depends on a set of l axes pi. For a given Hence, the problem to be solved is min piT Api , subject
image x, its embedding coordinates are the projection of x to piT Bpi = 1, piT p1 = ... = piT pi −1 = 0 , where
onto l axes, which are Pi x = ∑ m x j pij ,  
1≤ i ≤ l .
A = 2 ML C ( ML), B = ML CL (C ( X ) + C (Y ) + M ( X , Y )) .
j =1 2

As the images in the set ML are considered mutually

It is easy to see that both A and B are symmetric and
similar to each other, they should be kept compactly in the
positive semi-definite. The above problem can be solved
new space. In other words, the distances among them should
using the Lagrange Multipliers method. Below we discuss
be kept small, while the irrelevant images in CL are to be
the procedure to obtain the optimal axes.
mapped far apart from those in ML as much as possible. The
The first projection axis is the eigenvector of the
above two criteria can be formally stated as follows:
generalized eigen-problem Ap1=λBp1 corresponding to the
min ∑ ∑ ∑ (P
x∈ML y∈ML i =1
− Pi y ) 2 (1) smallest eigenvalue. After that, we compute the remaining
axes one by one in the following fashion. Suppose we
l already obtained the first (k-1) axes, define:
max ∑ ∑ ∑ (P
x∈ML y∈CL i =1
− Pi y ) 2 (2)
P ( k −1) = [ p1 , p2 ,..., pk −1 ], (10)
Q ( k −1) = [ P ( k −1) ]T B −1 P ( k −1)
Intuitively, equation (1) forces the embedding to have Then the kth axis pk is the eigenvector associated with the
the image points in reside in a small local neighborhood in smallest eigenvalue for the eigen-problem:
the new feature space, and equation (2) reflects our ( I − B −1 P ( k −1) [Q ( k −1) ]−1[ P ( k −1) ]T ) B −1 Apk = λ pk (11)
objective to prevent the points in and close together after the We adopt the above procedure to determine the optimal l
embedding. To construct a salient embedding, COPFCproj orthogonal projection axes, which can preserve the metric
combines these two criteria and finds the axis in the one-by- structure of the image space for the given relevance
one fashion which optimizes the following objective, feedback information. The new coordinates for the image
min ∑ ∑ ( Pi x − Pi y ) 2 (3) data points can then be derived accordingly.
x∈ML y∈ML
subject to min
∑ ∑ (P i
− Pi y ) 2 = 1 (4)
x∈ML y∈CL COPFCfuzzy is new search-based semi-supervised
piT p1 = piT p2 = piT p3 = ... = piT pi −1 = 0 (5) clustering algorithm that allows the constraints to help the
T is the transpose of a vector. The choice of constant 1 on clustering process towards an appropriate partition. To this
the right hand side of equation (4) is rather arbitrary as any end, we define an objective function that takes into account
other value (except 0) would not cause any substantial both the feature-based similarity between data points and
changes in the embedding produced. The constraint in the pairwise constraints [14-16]. Let ML be the set of must-
equation (5) is to force all the axes to be mutually link constraints, i.e.(xi, xj)∈ML implies that xi and xj should
orthogonal. Equations (3) and (4) are implicit functions of be assigned to the same cluster, and CL the set of cannot-
the axes pi , which should be re-written in the explicit forms. link constraints,(xi, xj)∈CL xi and xj should be assigned to
First, we introduce the necessary notations. For a given set different clusters. we can write the objective function
X of image points, the mean of X is an -dimensional column COPFCfuzzy must minimize::
vector M(X) , whose i th component is C N

1 J (V , U ) = ∑∑ (uik ) 2 d 2 (xi , μk )
Mi (X ) = ∑ xi
X x∈X
(6) k =1 i =1

⎛ C C C ⎞ (12)
and its covariance matrix C(X) is an m×m matrix: + λ ⎜ ∑ ∑ ∑ uik u jl + ∑ ∑ uik u jk ⎟
⎜ ( x ,x )∈ML k =1 l =1,l ≠ k ⎟
1 ⎛ ⎝ i j ( xi , x j )∈CL k =1 ⎠

Cij ( X ) = ⎜ ∑
X ⎝ x∈X
xi x j − M i ( X ) M j ( X ) ⎟ (7) C
⎡N ⎤
⎠ − γ ∑ ⎢ ∑ (uik ) ⎥
For two sets X and Y, define an m×m matrix M(X,Y) , in k =1 ⎣ i =1 ⎦

The first term in equation (12) is the sum of squared B. The effectiveness of COPFC
distances to the prototypes weighted by constrained In figure 2, we use three different dimensionality
memberships (Fuzzy C-Means objective function). This reduction methods (COPFCproj, PCA, SSDR) for original
term reinforces the compactness of the clusters. images. Dimensionalities are reduced 15, 20 respectively.
The second component in equation (12) is composed of: For data of reduced dimension, we used Kmeans for
the cost of violating the pairwise must-link constraints; the clustering. The curves in figure 2 show clustering
cost of violating the pairwise cannot-link constraints. This performance of PCA method is independent of number of
term is weighted by λ, a constant factor that specifies the constraints. However clustering performance of SSDR had
relative importance of the supervision. slight changes. For COPFCproj, clustering performance
The third component in equation (12) is the sum of the obtained largely improvement with increasing number of
squares of the cardinalities of the clusters controls the constraints. When there are small amount of constraints,
competition between clusters. It is weighted by γ. clustering performance of COPFCproj is worst in there
When the parameters are well chosen, the final partition methods. In general, COPFCproj outperforms PCA and
will minimize the sum of intra-cluster distances, while SSDR for reducing dimensionalities.
partitioning the data set into the smallest number of clusters 0.85
such that the specified constraints are respected as well as 0.8


0.65 SSDR
A. Dataset selection and evaluation criterion PCA
We performed experiments on COREL image database 10 20 30 40 50 60 70 80 90100
Number of constraints
and 2 datasets from UCI as follows: (a) (b)
(1) We selected 1500 images from COREL image Figure 2. Clustering performance with different number of constraints
database. They were divided into 15 sufficiently distinct
classes of 100 images each. In our experiments, each image Figure 3 shows clustering performance of three methods
was represented by a 37-dimensional vector, which included on Iris and Wine datasets. For all datasets, COPFCfuzzy all
3 types of features extracted for the image. We compared obtained best performance. In three methods, clustering
COPFCproj algorithm against PCA and SSDR. The performance of Kmeans is worst. Though clustering
performance of each technique was evaluated under various performance of PCKmeans is effectively improved, it still is
amounts of domain knowledge and different reduced worse than that of COPFCfuzzy.
dimensionalities. In different scenarios, after the
dimensionality reduction, the Kmeans was applied to 1.01
classify the test images. 0.98 COPFC
0.95 PCKmeans
(2) Iris and Wine datasets from UCI repository. Iris 0.9 Kmeans

dataset contains three classes of 50 instances each and 4 0.8
numerical attributes; Wine dataset contains three classes 178 0.86
instances, and 13 numerical attributes. The simplicity and 0.8 4 5 9 10
low dimension of this data set also allows us to display the 10 20 30 60 70 80
0 of 0 0
Number constraints
constraints that are actually selected. To evaluate clustering (a) Iris dataset (b) Wine dataset
performance of COPFCfuzzy, we compared COPFCfuzzy Figure 3. Clustering performance on UCI datasets
algorithm against Kmeans and PCKmeans algorithm.
(3) Evaluation criterion. In this paper, we use Corrected
Rand Index (CRI) as the clustering validation measure. VI. CONCLUSION AND FUTURE WORK
A−C (13) We propose a semi-supervised fuzzy clustering via
n × (n − 1) / 2 − C orthogonal projection to handle high-dimensional sparse
where A is number of instance pairs which assigned cluster data in image feature space. The method reduces
meets with actual cluster; n is number of all instances in the dimensionalities of images via orthogonal projection, and
dataset, then n × (n − 1) / 2 is number all instance pairs in clusters data of reduced dimensionalities by constrained
dataset; C is number of all constraints. fuzzy clustering algorithm.
For each dataset, we run each experiment 20 times. To There are several potential directions for future research.
study the effect of constraints 100 constraints are generated First, we are interested in automatically identifying the right
randomly for test set. Each point on the learning curve is an number for the reduced dimensionality based on the
average of results over 20 runs. background knowledge other than providing a pre-specified
value. Second, we plan to explore alternative methods to
employ supervision in guiding the unsupervised clustering.


[1] X. Yang, H. Fu and H. Zha. “Semi-Supervised Nonlinear

Dimensionality Reduction”. In Proc. of the 23rdIntl. Conf. on
Machine Learning, 2006.
[2] C. Ding and X. He. “K-Means Clustering via Principal Component
Analysis”. In Proc. of the 21st Intl. Conf. on Machine Learning, 2004.
[3] D. Cai, and X. F. He. “Orthogonal Locality Preserving Projection”. In
Proc. of the 28th Intl. ACM SIGIR Conf. on Research and
Development in information Retrieval,2005.
[4] X. F. He and P. Niyogi. “Locality Preserving Projections”. Neural
Information Processing Systems. NIPS ’03, 2003.
[5] H. Cheng, K. Hua, and K. Vu. “Semi-Supervised Dimensionality
Reduction in Image Feature Space.Technical Report”, University of
Central Florida, 2007.
[6] Wagstaff. K and Cardie C. “Clustering with instance—level
constraints”. Proc. of the 17th Int’1 Conf. on Machine Learning. San
Francisco: Morgan Kaufmann Publishers, 2000.
[7] S. Basu. “Semi-supervised Clustering: Probabilistic Models,
Algorithms and Experiments”. Austin: The University of Texas, 2005
[8] S. Basu , A. Banerjee and R.J. Mooney, “Semi-supervised clustering
by seeding”. Proceedings of the 19th Int’l Conf. on Machine Learning
(ICML 2002). 19−26
[9] Wagstaff K, Cardie C and Rogers S. “Constrained K-means clustering
with background knowledge”. Proc. of the 18th Int’l Conf. on
Machine Learning. Williamstown: Williams College, Morgan
Kaufmann Publishers, 2001. 577−584.
[10] Klein D, Kamvar SD andManning CD. “From instance-Level
constraints to space-level constraints: Making the most of prior
knowledge in data clustering”. In Proc. of the 19th Int’l Conf. on
Machine Learning. University of New South Wales. Sydney: Morgan
Kaufmann Publishers, 2002. 307−314.
[11] Hertz T, Shental N and Bar-Hillel A. “Enhancing image and video
retrieval: Learning via equivalence constraint”. Proc. of the IEEE
Conf. on Computer Vision and Pattern Recognition. Madison: IEEE
Computer Society, 2003. pp.668−674.
[12] T. Deselaers, D. Keysers, and H. Ney. “Features for Image Retrieval
– a Quantitative Comparison”.In Pattern Recognition, 26th DAGM
Symposium, 2004.
[13] D. Zhang, Z. H. Zhou, and S. Chen. “Semi-Supervised
Dimensionality Reduction”. In Proc. of the 2007 SIAM Intl.Conf. on
Data Mining. SDM ’07, 2007.
[14] N. Grira, M. Crucianu, N. Boujemaa. “Semi-supervised fuzzy
clustering with pairwise-constrained competitive agglomeration”, in:
IEEE International Conference on Fuzzy Systems, 2005.
[15] H. Frigui, R. Krishnapuram. “Clustering by competitive
agglomeration”, Pattern Recognition 30 (7) ,1997 1109–1119.
[16] M. Bilenko, R.J. Mooney. “Adaptive duplicate detection using
learnable string similarity measures”. in: International Conference on
Knowledge Discovery and Data Mining, Washington, DC, 2003, pp.