Professional Documents
Culture Documents
Abstract—As dimensionality is very high, image feature space As many semi-supervised clustering methods are based
is usually complex. For effectively processing this space, density or distance, they are difficult to handle high-
technology of dimensionality reduction is widely used. Semi- dimensional data. Thus, reduced feature must be added into
supervised clustering incorporates limited information into semi-supervised clustering process. We propose
unsupervised clustering in order to improve clustering
performance. However, many existing semi-supervised
COPFC(Constrained Orthogonal Projection Fuzzy
clustering methods can not be used to handle high-dimensional Clustering)method to solve this problem.
sparse data. To solve this problem, we proposed a semi-
supervised fuzzy clustering method via constrained orthogonal
II. COPEC METHOD FRAMEWORK
projection. With results of experiments on different datasets, it
shows the method has good clustering performance for
handling high dimensionality data.
I. INTRODUCTION
In recent years, because of fast extension of feature
information and volume of image data, many tasks in
multimedia processing have become increasingly
challenging Dimensionality reduction techniques have been
proposed to uncover the underlying low dimensional
structures of the high-dimensional image space [1].These
efforts have proved to be very useful in image retrieval,
classification and clustering. There are a number of Figure 1. COPFC framework
dimensionality reduction techniques in the literature. One of
Figure 1 shows the framework of the COPFC method.
the classical methods is Principal Component Analysis
(PCA) [2], which minimizes the information loss in the Given a set of instances and a set of supervision in the form
of must-link constraints CML={(xi, xj)}, (xi, xj) where (xi, xj)
reduction process. One of the disadvantages of PCA is that
must reside in the same cluster, and cannot-link constraints,
it likely distorts the local structures of a dataset. Locality
Preserving Projection (LPP) [3-4] encodes the local CCL={(xi, xj)}, (xi, xj) where should be in the different
clusters, the COPFC method is composed of three steps. In
neighborhood structure into a similarity matrix and derives a
the first step, a preprocessing method is exploited to reduce
linear manifold embedding as the optimal approximation to
this matrix, but LPP, on the other hand, may overlook the the unlabelled instances and pairwise constraints according
to the transitivity property of must-link constraints. In the
global structures.
Recently, semi-supervised learning has gained much second step, a constraint-guided Orthogonal projection
attention [6-10], which leverages domain knowledge method, called COPFCproj, is used to project the original
represented in the form of pairwise constraints. Various data into a low-dimensional space. Finally, we apply a semi-
reduction techniques have been developed to utilize this form supervised fuzzy clustering algorithm, called COPFCfuzzy,
of knowledge[11-12]. produce the clustering results on the projected low-
The constrained FLD defines the embedding based dimensional dataset.
solely on must-link constraints. Semi-Supervised
Dimensionality Reduction (SSDR) [13], preserves the
intrinsic global covariance structure of the data while
exploiting both constraints.
356
III. COPFCPROJ - A CONSTRAINED ORTHOGONAL M ( X , Y ) = ( M ( X ) − M (Y ))( M ( X ) − M (Y ))T .Accordingly, we can
PROJECTION METHOD rewrite equation (3) as follows:
∑ ∑ ( Pi x − Pi y )2 = 2 piT ( ML C (ML)) pi (8)
2
In a typical image retrieval system, each image is
represented by an m -dimensional feature vector x whose jth x∈ML y∈ML
value is denoted as xj. During the retrieval process, the user Similarly, we can rewrite equation (4) as follows:
is allowed to mark several images with must-links which ∑ ∑ (P i
x
− Pi y ) 2 = piT ( ML CL (C ( X ) + C (Y )
(9)
match his query interest, and also to indicate those x∈ML y∈CL
1 J (V , U ) = ∑∑ (uik ) 2 d 2 (xi , μk )
Mi (X ) = ∑ xi
X x∈X
(6) k =1 i =1
⎛ C C C ⎞ (12)
and its covariance matrix C(X) is an m×m matrix: + λ ⎜ ∑ ∑ ∑ uik u jl + ∑ ∑ uik u jk ⎟
⎜ ( x ,x )∈ML k =1 l =1,l ≠ k ⎟
1 ⎛ ⎝ i j ( xi , x j )∈CL k =1 ⎠
⎞
Cij ( X ) = ⎜ ∑
X ⎝ x∈X
xi x j − M i ( X ) M j ( X ) ⎟ (7) C
⎡N ⎤
2
⎠ − γ ∑ ⎢ ∑ (uik ) ⎥
For two sets X and Y, define an m×m matrix M(X,Y) , in k =1 ⎣ i =1 ⎦
which
357
The first term in equation (12) is the sum of squared B. The effectiveness of COPFC
distances to the prototypes weighted by constrained In figure 2, we use three different dimensionality
memberships (Fuzzy C-Means objective function). This reduction methods (COPFCproj, PCA, SSDR) for original
term reinforces the compactness of the clusters. images. Dimensionalities are reduced 15, 20 respectively.
The second component in equation (12) is composed of: For data of reduced dimension, we used Kmeans for
the cost of violating the pairwise must-link constraints; the clustering. The curves in figure 2 show clustering
cost of violating the pairwise cannot-link constraints. This performance of PCA method is independent of number of
term is weighted by λ, a constant factor that specifies the constraints. However clustering performance of SSDR had
relative importance of the supervision. slight changes. For COPFCproj, clustering performance
The third component in equation (12) is the sum of the obtained largely improvement with increasing number of
squares of the cardinalities of the clusters controls the constraints. When there are small amount of constraints,
competition between clusters. It is weighted by γ. clustering performance of COPFCproj is worst in there
When the parameters are well chosen, the final partition methods. In general, COPFCproj outperforms PCA and
will minimize the sum of intra-cluster distances, while SSDR for reducing dimensionalities.
partitioning the data set into the smallest number of clusters 0.85
such that the specified constraints are respected as well as 0.8
possible.
0.75
2
dataset contains three classes of 50 instances each and 4 0.8
9
numerical attributes; Wine dataset contains three classes 178 0.86
0.83
instances, and 13 numerical attributes. The simplicity and 0.8 4 5 9 10
low dimension of this data set also allows us to display the 10 20 30 60 70 80
0
0 of 0 0
Number constraints
constraints that are actually selected. To evaluate clustering (a) Iris dataset (b) Wine dataset
performance of COPFCfuzzy, we compared COPFCfuzzy Figure 3. Clustering performance on UCI datasets
algorithm against Kmeans and PCKmeans algorithm.
(3) Evaluation criterion. In this paper, we use Corrected
Rand Index (CRI) as the clustering validation measure. VI. CONCLUSION AND FUTURE WORK
A−C (13) We propose a semi-supervised fuzzy clustering via
CRI =
n × (n − 1) / 2 − C orthogonal projection to handle high-dimensional sparse
where A is number of instance pairs which assigned cluster data in image feature space. The method reduces
meets with actual cluster; n is number of all instances in the dimensionalities of images via orthogonal projection, and
dataset, then n × (n − 1) / 2 is number all instance pairs in clusters data of reduced dimensionalities by constrained
dataset; C is number of all constraints. fuzzy clustering algorithm.
For each dataset, we run each experiment 20 times. To There are several potential directions for future research.
study the effect of constraints 100 constraints are generated First, we are interested in automatically identifying the right
randomly for test set. Each point on the learning curve is an number for the reduced dimensionality based on the
average of results over 20 runs. background knowledge other than providing a pre-specified
value. Second, we plan to explore alternative methods to
employ supervision in guiding the unsupervised clustering.
358
REFERENCES
359