Professional Documents
Culture Documents
XinRui DuoChunhong
HeBei Electric Power Research Institute, Institute of Computer, North China Electric
Shijiazhuang 050021,China Power University,
Baoding 071003,China
duochunhong@163.com
Authorized licensed use limited to: Jabalpur Engineering College. Downloaded on December 25, 2008 at 06:17 from IEEE Xplore. Restrictions apply.
to obtain new clustering centers. The larger the data 3. The improved DBSK Algorithm
number is, the more the running time will spend. The
time complexity should be analyzed in order to DBSCAN operates directly on the whole data set.
improve the application scopes of k-means algorithm. When the data number is very large, it requires
This paper improves k-means algorithm by sampling volumes of memory support and incurs substantial I/
technology. The choice of initial points and the O costs [9]. If the dataset is partitioned before
adjustment after one iterative are operated on the
processing according to some rules, the I/O costs will
sample data. The convergence speed will be greatly
be decreased, and the consumed time will be reduced.
improved.
The data distributing may be uneven, so several local
datasets are gained. The data number of each local
2.2. DBSCAN Algorithm dataset is smaller than the initial dataset, according to
their characters, we choose local parameter values, in
First, DBSCAN algorithm selects an arbitrary object this way the clustering result will be better. However,
o from database, then researches all objects that are one large cluster may be partitioned into two different
density-reach from o about Eps and MinPts. If the data local datasets, or the data point which belongs to one
number of Eps-neighborhood of o is larger than MinPts, local dataset is plotted into another, and is marked as
then o is marked as core object, and the data in its “noise”, so the clustering results should be processed
Eps-neighborhood is extending object next time, if not, in order to clear up influences of partition.
o is marked as “noise”. If o is a core object, there is a Firstly, it optimizes K-means algorithm by sampling
cluster C about Eps and MinPts, which is determined technology, and partitions the dataset. Secondly, it
by any core object in it. Until there is a integrated computes the value of MinPtsi, and applies DBSCAN
cluster, querying area and extending cluster are algorithm to each local dataset. Thirdly, it merges the
executed constantly [6]. clustering results of each local dataset, and gains the
DBSCAN algorithm partitions the area as a cluster clustering result of the entire dataset.
which has high-density, discovers clusters of arbitrary
shape and effectively handles noise in spatial databases
3.1. To determine the value of the local
[7-8]. However , it has some disadvantages as
parameter MinPtsi
following:
① It requires large volumes of memory support The following method is used to determine each
and needs a lot of I/O costs when dealing with local parameter: choose a fixed value of Eps, calculate
large-scale databases, because it operates directly on MinPtsi for each local data set:
the entire databases;
EpsVolumei
② It is sensitive to input parameters, and different MinPtsi = × Ni .
parameter values may lead to different clustering totalVolumei
results. Ni is the number of the data points in the local data
③ Because there are holistic variables Eps and set whose cluster center is Ii; totalVolumei represents
MinPts in DBSCAN algorithm, clustering quality will the volume of the super-cuboid whose cluster center is
degrade when the cluster density and the distance Ii [10]; According to different dimensions, EpsVolume
between clusters are not even. If we choose a smaller select different values:
value of Eps according to dense clusters, the data One-dimensional data: EpsVolumei = 2 × Eps
number in the neighborhood of sparse clusters may be Two-dimensional data: EpsVolumei = π × Eps
2
395
Authorized licensed use limited to: Jabalpur Engineering College. Downloaded on December 25, 2008 at 06:17 from IEEE Xplore. Restrictions apply.
3.2. The merger of each local data set’s ( xi,Zj(I )),j=1,2,3,...,n} is satisfied, then xi wk;
clustering results the convergence function Jc
k nj
∑∑ ( j) 2
and MinPtsB, then MinPts= MinPtsA, unite cluster A k new cluster centers, Zj ( I ) = ∑ xi j = 1, 2,
and cluster B. This is shown in Figure 1 and Figure 2; n i =1
(2) a is density-connected from b about Eps and 3,…, k, return (2);
MinPtsB, b is not density-connected from a about Eps join the rest points into the nearest cluster,
and MinPtsA, then MinPts= MinPtsB, unite cluster A each cluster is a local data set;
and cluster B; Step4 do the following operations for each local
(3) b is density-connected from a about Eps and data set:
MinPtsA, a is density-connected from b about Eps and calculate each local parameters MinPts,
MinPtsB, then MinPts= min{ MinPtsB , MinPtsA}, π × Eps
2
396
Authorized licensed use limited to: Jabalpur Engineering College. Downloaded on December 25, 2008 at 06:17 from IEEE Xplore. Restrictions apply.
Figure 3 The time complexity of the three algorithms 6. Conclusions
5. The simulative experiment Combining the sampling technique, K-means
algorithm and DBSCAN algorithm, an improved
DBSK algorithm has been proposed. How to determine
In order to test the feasibility and effectiveness of the the size of the sample (is also called sampling
DBSK algorithm, simulative experiment is carried out. complexity) is a question that needs further study. In
this paper DBSK algorithm only applies in the
two-dimensional data set, there is no discussion and
research to three dimensional or multi-dimensional
complex data, but the massive data sets often have
multi-dimensional attribute, so more work should be
carried out on the data mining.
7. References
Figure 4 The clustering result of DBSCAN [1] Chen Lei-da, “ Data mining method, application, tools”,
algorithm Eps=18 Minpts=4 Information Systems Management,2001,7(1),65-70.
[2] David Hand, Heikki Mannila, and Padhraic Smyth,
“Principles of Data Mining”, The MIT Press, 2001, 162-195.
[3] D T Pham, S S Dimov, and C D Nguyen, “Selection of K
in K-means clustering. Mechanical Engineering Science”,
2004,219(C),103-119.
[4] Kuo R.J., Ho L.M., and Hu C.M., “Integration of
Self-organizing Feature Map and K-means Algorithm for
Market Segmentation”, Computers and Operations Research,
2002,29(11), 1475-1493.
[5] D.S.Modha, W.S.Spangler, “Feature Weighting in
Figure 5 The clustering result of DBSK algorithm K-means Clustering”, 2003, 217-237.
[6] Martin Ester, Hans-Peter Kriegel, “A Density-Based
K=3 Eps=18 sampling rate =30% Algorithm for Discovering Clusters in Large Spatial
In Figure 4 and Figure 5, different shapes represent Databases with Noise”, Proceedings of 2nd international
different clusters, hollow circles indicate noise points. conference on knowledge discovery and data mining, 2002,
17-25.
From the clustering result we can see: DBSCAN [7] Chen Ning, Chen An, and Zhou Longxiang, “An
algorithm can not do very well to uneven data set; it is Effective Clustering Algorithm in Large Transaction
not good enough to the sparse data points on the left, Databases”, Journal of software, 2001, 12(7), 476-484.
and a few data points are handled to noise which [8] Rong Qiusheng, Yan Junbiao, and Guo Guoqiang,
should belong to some cluster. DBSK algorithm “Research and Implementation of Clustering Algorithm
partitions the data set, and selects different parameter Based on DBSCAN”, Computer Applications, 2004, 24(4),
values according to each local data set. DBSK 45-47.
algorithm recognize more clusters that have small data [9] HE Zhong-sheng, LIU Zong-tian, and ZHUANG Yan-bin,
points and are relatively sparse than DBSCAN “Data-partition-based Parellel DBSCAN Algorithm”,
algorithm. To the sparse data points on the left, the MINI-MICRO System, 2006, 27(1), 114-116.
[10] SunSi, “Study on data partition DBSCAN using genetic
number of the data points that handled to noise is
algorithm”, College of computer science Chongqing
distinctly reduced. University, 2005,38-43
397
Authorized licensed use limited to: Jabalpur Engineering College. Downloaded on December 25, 2008 at 06:17 from IEEE Xplore. Restrictions apply.