37 views

Original Title: A Semi-supervised Clustering via Orthogonal Projection

Uploaded by Angela Horton

- A Smart Clustering Based Approach to Dynamic Bandwidth Allocation in Wireless Networks
- 4614ijitcs01
- kmeans_em
- A Data Mining Approach to Characterize Road Accident Locations
- An Automatic Medical Image Segmentation using Teaching Learning Based Optimization
- Improved Clustering Technique in Marketing Sector
- Automatic Online Spike Sorting With Singular Value
- Vol5I1p9
- 04 Multivariate Analysis
- DMDW6
- Srs 26022013 for clustering
- L017158389.pdf
- Character Identification in Movie Using Graph Matching Algorithm
- Signalizacija 19891988199019851980
- Algoritmo6.
- Featuring of Electricity Consumption Behavior towards Big-Data Applications
- Hierarchical Analysis of GSM Network Performance Data
- An Efficient Algorithm for Multisensory Data Fusion Under Uncertainty
- fdapaceVignetteKnitr.pdf
- Face Expression Music Player Ieeet Thingy

You are on page 1of 4

Harbin Engineering University Harbin Engineering University

Harbin 150001, China Harbin 150001, China

cuipeng83@163.com zrbzrb@hrbeu.edu.cn

Abstract—As dimensionality is very high, image feature space As many semi-supervised clustering methods are based

is usually complex. For effectively processing this space, density or distance, they are difficult to handle high-

technology of dimensionality reduction is widely used. Semi- dimensional data. Thus, reduced feature must be added into

supervised clustering incorporates limited information into semi-supervised clustering process. We propose

unsupervised clustering in order to improve clustering

performance. However, many existing semi-supervised

COPFC(Constrained Orthogonal Projection Fuzzy

clustering methods can not be used to handle high-dimensional Clustering)method to solve this problem.

sparse data. To solve this problem, we proposed a semi-

supervised fuzzy clustering method via constrained orthogonal

II. COPEC METHOD FRAMEWORK

projection. With results of experiments on different datasets, it

shows the method has good clustering performance for

handling high dimensionality data.

supervised learning

I. INTRODUCTION

In recent years, because of fast extension of feature

information and volume of image data, many tasks in

multimedia processing have become increasingly

challenging Dimensionality reduction techniques have been

proposed to uncover the underlying low dimensional

structures of the high-dimensional image space [1].These

efforts have proved to be very useful in image retrieval,

classification and clustering. There are a number of Figure 1. COPFC framework

dimensionality reduction techniques in the literature. One of

Figure 1 shows the framework of the COPFC method.

the classical methods is Principal Component Analysis

(PCA) [2], which minimizes the information loss in the Given a set of instances and a set of supervision in the form

of must-link constraints CML={(xi, xj)}, (xi, xj) where (xi, xj)

reduction process. One of the disadvantages of PCA is that

must reside in the same cluster, and cannot-link constraints,

it likely distorts the local structures of a dataset. Locality

Preserving Projection (LPP) [3-4] encodes the local CCL={(xi, xj)}, (xi, xj) where should be in the different

clusters, the COPFC method is composed of three steps. In

neighborhood structure into a similarity matrix and derives a

the first step, a preprocessing method is exploited to reduce

linear manifold embedding as the optimal approximation to

this matrix, but LPP, on the other hand, may overlook the the unlabelled instances and pairwise constraints according

to the transitivity property of must-link constraints. In the

global structures.

Recently, semi-supervised learning has gained much second step, a constraint-guided Orthogonal projection

attention [6-10], which leverages domain knowledge method, called COPFCproj, is used to project the original

represented in the form of pairwise constraints. Various data into a low-dimensional space. Finally, we apply a semi-

reduction techniques have been developed to utilize this form supervised fuzzy clustering algorithm, called COPFCfuzzy,

of knowledge[11-12]. produce the clustering results on the projected low-

The constrained FLD defines the embedding based dimensional dataset.

solely on must-link constraints. Semi-Supervised

Dimensionality Reduction (SSDR) [13], preserves the

intrinsic global covariance structure of the data while

exploiting both constraints.

356

III. COPFCPROJ - A CONSTRAINED ORTHOGONAL M ( X , Y ) = ( M ( X ) − M (Y ))( M ( X ) − M (Y ))T .Accordingly, we can

PROJECTION METHOD rewrite equation (3) as follows:

∑ ∑ ( Pi x − Pi y )2 = 2 piT ( ML C (ML)) pi (8)

2

In a typical image retrieval system, each image is

represented by an m -dimensional feature vector x whose jth x∈ML y∈ML

value is denoted as xj. During the retrieval process, the user Similarly, we can rewrite equation (4) as follows:

is allowed to mark several images with must-links which ∑ ∑ (P i

x

− Pi y ) 2 = piT ( ML CL (C ( X ) + C (Y )

(9)

match his query interest, and also to indicate those x∈ML y∈CL

linear method and depends on a set of l axes pi. For a given Hence, the problem to be solved is min piT Api , subject

image x, its embedding coordinates are the projection of x to piT Bpi = 1, piT p1 = ... = piT pi −1 = 0 , where

onto l axes, which are Pi x = ∑ m x j pij ,

1≤ i ≤ l .

A = 2 ML C ( ML), B = ML CL (C ( X ) + C (Y ) + M ( X , Y )) .

j =1 2

It is easy to see that both A and B are symmetric and

similar to each other, they should be kept compactly in the

positive semi-definite. The above problem can be solved

new space. In other words, the distances among them should

using the Lagrange Multipliers method. Below we discuss

be kept small, while the irrelevant images in CL are to be

the procedure to obtain the optimal axes.

mapped far apart from those in ML as much as possible. The

The first projection axis is the eigenvector of the

above two criteria can be formally stated as follows:

l

generalized eigen-problem Ap1=λBp1 corresponding to the

min ∑ ∑ ∑ (P

x∈ML y∈ML i =1

i

x

− Pi y ) 2 (1) smallest eigenvalue. After that, we compute the remaining

axes one by one in the following fashion. Suppose we

l already obtained the first (k-1) axes, define:

max ∑ ∑ ∑ (P

x∈ML y∈CL i =1

i

x

− Pi y ) 2 (2)

P ( k −1) = [ p1 , p2 ,..., pk −1 ], (10)

Q ( k −1) = [ P ( k −1) ]T B −1 P ( k −1)

Intuitively, equation (1) forces the embedding to have Then the kth axis pk is the eigenvector associated with the

the image points in reside in a small local neighborhood in smallest eigenvalue for the eigen-problem:

the new feature space, and equation (2) reflects our ( I − B −1 P ( k −1) [Q ( k −1) ]−1[ P ( k −1) ]T ) B −1 Apk = λ pk (11)

objective to prevent the points in and close together after the We adopt the above procedure to determine the optimal l

embedding. To construct a salient embedding, COPFCproj orthogonal projection axes, which can preserve the metric

combines these two criteria and finds the axis in the one-by- structure of the image space for the given relevance

one fashion which optimizes the following objective, feedback information. The new coordinates for the image

min ∑ ∑ ( Pi x − Pi y ) 2 (3) data points can then be derived accordingly.

x∈ML y∈ML

IV. COPFCFUZZY SEMI-SUPERVISED CLUSTERING

subject to min

∑ ∑ (P i

x

− Pi y ) 2 = 1 (4)

x∈ML y∈CL COPFCfuzzy is new search-based semi-supervised

piT p1 = piT p2 = piT p3 = ... = piT pi −1 = 0 (5) clustering algorithm that allows the constraints to help the

T is the transpose of a vector. The choice of constant 1 on clustering process towards an appropriate partition. To this

the right hand side of equation (4) is rather arbitrary as any end, we define an objective function that takes into account

other value (except 0) would not cause any substantial both the feature-based similarity between data points and

changes in the embedding produced. The constraint in the pairwise constraints [14-16]. Let ML be the set of must-

equation (5) is to force all the axes to be mutually link constraints, i.e.(xi, xj)∈ML implies that xi and xj should

orthogonal. Equations (3) and (4) are implicit functions of be assigned to the same cluster, and CL the set of cannot-

the axes pi , which should be re-written in the explicit forms. link constraints,(xi, xj)∈CL xi and xj should be assigned to

First, we introduce the necessary notations. For a given set different clusters. we can write the objective function

X of image points, the mean of X is an -dimensional column COPFCfuzzy must minimize:：

vector M(X) , whose i th component is C N

1 J (V , U ) = ∑∑ (uik ) 2 d 2 (xi , μk )

Mi (X ) = ∑ xi

X x∈X

(6) k =1 i =1

⎛ C C C ⎞ (12)

and its covariance matrix C(X) is an m×m matrix: + λ ⎜ ∑ ∑ ∑ uik u jl + ∑ ∑ uik u jk ⎟

⎜ ( x ,x )∈ML k =1 l =1,l ≠ k ⎟

1 ⎛ ⎝ i j ( xi , x j )∈CL k =1 ⎠

⎞

Cij ( X ) = ⎜ ∑

X ⎝ x∈X

xi x j − M i ( X ) M j ( X ) ⎟ (7) C

⎡N ⎤

2

⎠ − γ ∑ ⎢ ∑ (uik ) ⎥

For two sets X and Y, define an m×m matrix M(X,Y) , in k =1 ⎣ i =1 ⎦

which

357

The first term in equation (12) is the sum of squared B. The effectiveness of COPFC

distances to the prototypes weighted by constrained In figure 2, we use three different dimensionality

memberships (Fuzzy C-Means objective function). This reduction methods (COPFCproj, PCA, SSDR) for original

term reinforces the compactness of the clusters. images. Dimensionalities are reduced 15, 20 respectively.

The second component in equation (12) is composed of: For data of reduced dimension, we used Kmeans for

the cost of violating the pairwise must-link constraints; the clustering. The curves in figure 2 show clustering

cost of violating the pairwise cannot-link constraints. This performance of PCA method is independent of number of

term is weighted by λ, a constant factor that specifies the constraints. However clustering performance of SSDR had

relative importance of the supervision. slight changes. For COPFCproj, clustering performance

The third component in equation (12) is the sum of the obtained largely improvement with increasing number of

squares of the cardinalities of the clusters controls the constraints. When there are small amount of constraints,

competition between clusters. It is weighted by γ. clustering performance of COPFCproj is worst in there

When the parameters are well chosen, the final partition methods. In general, COPFCproj outperforms PCA and

will minimize the sum of intra-cluster distances, while SSDR for reducing dimensionalities.

partitioning the data set into the smallest number of clusters 0.85

such that the specified constraints are respected as well as 0.8

possible.

0.75

COPFCproj

0.65 SSDR

A. Dataset selection and evaluation criterion PCA

0.6

We performed experiments on COREL image database 10 20 30 40 50 60 70 80 90100

Number of constraints

and 2 datasets from UCI as follows: (a) (b)

(1) We selected 1500 images from COREL image Figure 2. Clustering performance with different number of constraints

database. They were divided into 15 sufficiently distinct

classes of 100 images each. In our experiments, each image Figure 3 shows clustering performance of three methods

was represented by a 37-dimensional vector, which included on Iris and Wine datasets. For all datasets, COPFCfuzzy all

3 types of features extracted for the image. We compared obtained best performance. In three methods, clustering

COPFCproj algorithm against PCA and SSDR. The performance of Kmeans is worst. Though clustering

performance of each technique was evaluated under various performance of PCKmeans is effectively improved, it still is

amounts of domain knowledge and different reduced worse than that of COPFCfuzzy.

dimensionalities. In different scenarios, after the

dimensionality reduction, the Kmeans was applied to 1.01

classify the test images. 0.98 COPFC

0.95 PCKmeans

(2) Iris and Wine datasets from UCI repository. Iris 0.9 Kmeans

CRI

CRI

2

dataset contains three classes of 50 instances each and 4 0.8

9

numerical attributes; Wine dataset contains three classes 178 0.86

0.83

instances, and 13 numerical attributes. The simplicity and 0.8 4 5 9 10

low dimension of this data set also allows us to display the 10 20 30 60 70 80

0

0 of 0 0

Number constraints

constraints that are actually selected. To evaluate clustering (a) Iris dataset (b) Wine dataset

performance of COPFCfuzzy, we compared COPFCfuzzy Figure 3. Clustering performance on UCI datasets

algorithm against Kmeans and PCKmeans algorithm.

(3) Evaluation criterion. In this paper, we use Corrected

Rand Index (CRI) as the clustering validation measure. VI. CONCLUSION AND FUTURE WORK

A−C (13) We propose a semi-supervised fuzzy clustering via

CRI =

n × (n − 1) / 2 − C orthogonal projection to handle high-dimensional sparse

where A is number of instance pairs which assigned cluster data in image feature space. The method reduces

meets with actual cluster; n is number of all instances in the dimensionalities of images via orthogonal projection, and

dataset, then n × (n − 1) / 2 is number all instance pairs in clusters data of reduced dimensionalities by constrained

dataset; C is number of all constraints. fuzzy clustering algorithm.

For each dataset, we run each experiment 20 times. To There are several potential directions for future research.

study the effect of constraints 100 constraints are generated First, we are interested in automatically identifying the right

randomly for test set. Each point on the learning curve is an number for the reduced dimensionality based on the

average of results over 20 runs. background knowledge other than providing a pre-specified

value. Second, we plan to explore alternative methods to

employ supervision in guiding the unsupervised clustering.

358

REFERENCES

Dimensionality Reduction”. In Proc. of the 23rdIntl. Conf. on

Machine Learning, 2006.

[2] C. Ding and X. He. “K-Means Clustering via Principal Component

Analysis”. In Proc. of the 21st Intl. Conf. on Machine Learning, 2004.

[3] D. Cai, and X. F. He. “Orthogonal Locality Preserving Projection”. In

Proc. of the 28th Intl. ACM SIGIR Conf. on Research and

Development in information Retrieval,2005.

[4] X. F. He and P. Niyogi. “Locality Preserving Projections”. Neural

Information Processing Systems. NIPS ’03, 2003.

[5] H. Cheng, K. Hua, and K. Vu. “Semi-Supervised Dimensionality

Reduction in Image Feature Space.Technical Report”, University of

Central Florida, 2007.

[6] Wagstaff. K and Cardie C. “Clustering with instance—level

constraints”. Proc. of the 17th Int’1 Conf. on Machine Learning. San

Francisco: Morgan Kaufmann Publishers, 2000.

[7] S. Basu. “Semi-supervised Clustering: Probabilistic Models,

Algorithms and Experiments”. Austin: The University of Texas, 2005

[8] S. Basu , A. Banerjee and R.J. Mooney, “Semi-supervised clustering

by seeding”. Proceedings of the 19th Int’l Conf. on Machine Learning

(ICML 2002). 19−26

[9] Wagstaff K, Cardie C and Rogers S. “Constrained K-means clustering

with background knowledge”. Proc. of the 18th Int’l Conf. on

Machine Learning. Williamstown: Williams College, Morgan

Kaufmann Publishers, 2001. 577−584.

[10] Klein D, Kamvar SD andManning CD. “From instance-Level

constraints to space-level constraints: Making the most of prior

knowledge in data clustering”. In Proc. of the 19th Int’l Conf. on

Machine Learning. University of New South Wales. Sydney: Morgan

Kaufmann Publishers, 2002. 307−314.

[11] Hertz T, Shental N and Bar-Hillel A. “Enhancing image and video

retrieval: Learning via equivalence constraint”. Proc. of the IEEE

Conf. on Computer Vision and Pattern Recognition. Madison: IEEE

Computer Society, 2003. pp.668−674.

[12] T. Deselaers, D. Keysers, and H. Ney. “Features for Image Retrieval

– a Quantitative Comparison”.In Pattern Recognition, 26th DAGM

Symposium, 2004.

[13] D. Zhang, Z. H. Zhou, and S. Chen. “Semi-Supervised

Dimensionality Reduction”. In Proc. of the 2007 SIAM Intl.Conf. on

Data Mining. SDM ’07, 2007.

[14] N. Grira, M. Crucianu, N. Boujemaa. “Semi-supervised fuzzy

clustering with pairwise-constrained competitive agglomeration”, in:

IEEE International Conference on Fuzzy Systems, 2005.

[15] H. Frigui, R. Krishnapuram. “Clustering by competitive

agglomeration”, Pattern Recognition 30 (7) ,1997 1109–1119.

[16] M. Bilenko, R.J. Mooney. “Adaptive duplicate detection using

learnable string similarity measures”. in: International Conference on

Knowledge Discovery and Data Mining, Washington, DC, 2003, pp.

39–48.

359

- A Smart Clustering Based Approach to Dynamic Bandwidth Allocation in Wireless NetworksUploaded byAIRCC - IJCNC
- 4614ijitcs01Uploaded byijitcs
- kmeans_emUploaded byanggawijayakusuma
- A Data Mining Approach to Characterize Road Accident LocationsUploaded bylogu_thalir
- An Automatic Medical Image Segmentation using Teaching Learning Based OptimizationUploaded byidescitation
- Improved Clustering Technique in Marketing SectorUploaded byEditor IJTSRD
- Automatic Online Spike Sorting With Singular ValueUploaded byChristian F. Vega
- Vol5I1p9Uploaded byJournal of Computer Applications
- 04 Multivariate AnalysisUploaded byafonsopilar
- DMDW6Uploaded byDragoș Nicolae
- Srs 26022013 for clusteringUploaded byMaram Nagarjuna Reddy
- L017158389.pdfUploaded byInternational Organization of Scientific Research (IOSR)
- Character Identification in Movie Using Graph Matching AlgorithmUploaded byInternational Organization of Scientific Research (IOSR)
- Signalizacija 19891988199019851980Uploaded bymilos89milosavljevic
- Algoritmo6.Uploaded byRicardo F
- Featuring of Electricity Consumption Behavior towards Big-Data ApplicationsUploaded byRahul Sharma
- Hierarchical Analysis of GSM Network Performance DataUploaded byharishkumarymca
- An Efficient Algorithm for Multisensory Data Fusion Under UncertaintyUploaded byali rebhi
- fdapaceVignetteKnitr.pdfUploaded bydd
- Face Expression Music Player Ieeet ThingyUploaded byjannat00795
- StataUploaded byHector Garcia
- WP61Uploaded byEugenio Martinez
- Determinants of Customer Relationship Marketing of Mobile Services Providers in Sri LankaUploaded byAlexander Decker
- Week 5 Distributed Data Management Presentation - Part2 (1)Uploaded bylalit chaudhary
- Evaluations of Thinning Algorithms for Preprocessing of Handwritten CharactersUploaded byEditor IJRITCC
- Computation Accuracy of Hierarchical and Expectation Maximization Clustering Algorithms for the Improvement of Data Mining SystemUploaded byIRJET Journal
- 06836758.pdfUploaded byGabriel González Castañé
- A New Method for Image SegmentationUploaded bymiusay
- IRJET-PCA BASED FACE RECOGNITION ON REDUCED DIMENSIONSUploaded byIRJET Journal
- eGovernment utilization in european landscape: hiearchical cluster analysisUploaded byMirjana Pejic Bach

- HOW TO PERFORM MARKET ORIENTATION IN NEW PRODUCT DEVELOPMENTUploaded byTarun kumar
- Five_Meter_Radio_Telescope.pptxUploaded byaisha
- siop lessonUploaded byapi-284126190
- The Accountant Oct-Dec'17Uploaded byRakibSawrav
- Amilo M1420Uploaded byroscrib
- Logistics Goals and StrategiesUploaded byRegina Jocky
- Financial InclusionUploaded byrahul vanama
- Derakane 411 350 TdsUploaded byNadeem Ahmed
- Met a Cognitive LoopUploaded byVECTORSPACE
- Matriz HP Storage Enterprise Backup SolutionsUploaded byVofere Rodriguez
- 21850 JS240-JS260 T4 Brochure en-GB (3)Uploaded bybalajiboss005
- Schermerhorn Mgmt9 Ch09Uploaded byapi-3738694
- Chapter 4 QuizUploaded byMarZiYa
- 1Uploaded byLeo Angelo Barcelo
- decision makingUploaded byKashi Malik
- 978-3-642-23681-5_18Uploaded bythoma111s
- Priority List1Uploaded byFeliseee_
- Types of PaintUploaded byRichard Williamson
- Scope of Supply Chain Management in Fruits and Vegetables in India 2157 7110 1000427Uploaded byDilipSamanta
- 1- Data WarehousesUploaded bymoh7711753
- track design handbook for LRT.pdfUploaded byzakwanramli
- Ch7_G4010-R1Uploaded byMosh
- Final Gsm & 3G ReportUploaded bynaresh21apr1989
- G M Codes for Turning and MillingUploaded byParanthaman Raman
- Civil Miniproject DixitUploaded byPrabhumk07
- Oneshift Monthly Digital Oct2011Uploaded byspellxr
- Pile DrivingUploaded byPinky Rose Alfaro Gregana
- AN9771 - Selecting a VaristorUploaded bynapoleon_velasc3617
- EPB-Electric-Power-(Chattanooga)-Residential---Dispersed-Power-ProgramUploaded byGenability
- Harga Jual Genset YanmarUploaded byAgung Setiawan