Professional Documents
Culture Documents
4, November 2014
26
I.
METHODOLOGY
INTRODUCTION
For resampled data X(h), let I(h) denote the indicator matrix
such that its (i,j)-th entry is equal to 1 if both the genes i and j
are selected in X(h). The combined connectivity matrix, known
as the consensus matrix M, for all the resampled data, is
C
defined as, M (i, j ) =
I
DOI: 10.9756/BIJDM.6140
ISSN: 2277-5048| 2014 Bonfring
(h)
(i, j )
(h)
(i, j )
27
B. Consensus Clustering
We propose to directly use the consensus matrix to obtain
a cluster solution instead of using this matrix as an input data
for another clustering algorithm. Suppose we set a threshold
on the proportion of times, the pairs of genes appear together
among the set of cluster solutions then it is possible to obtain a
cluster solution based on this threshold parameter .
a
1
1
0
0
0
1
0
b
1
1
0.57
0.17
0.14
0
1
0.17
1
c
0
0.57
1
0.71
0.57
0.11
0.38
0
0.29
d
0
0.17
0.71
1
1
0.43
0.14
0
0
e
f
g
h
i
0
0
1
0
1
0.14
0
1
0.17
1
0.57
0.11
0.38
0
0.29
1
0.43
0.14
0
0
1
0.43
0.14
0
0
0.43
0
1
0
0
0.14
0
1
0.17
1
0
0
0.17
1
0.2
1
0.2
1
0
0
v1
n11
n21
.
nR1
n(R+1)1
v2
n21
n22
nR2
n(R+1)2
vC
n2c
n2c
nRc
n(R+1)c
vC+1
n2(c+1)
n2(c+1)
nR(c+1)
n(R+1)
Total
n1.
n2.
nR.
nR+1.
(c+1)
Total
n.1
n.2
n.c
n.(c+1)
n..=n
The standardized rand index proposed by Huber and
Arabie (1985) is given by
.
.
=1 =1 2 =1 2 =1 2 / 2
.
.
.
0.5=1 =1 2 + =1 =1 2 =1 . /2
2
2
Rand(R,C)=
(1)
+1
+1 .
+1 .
+1
=1 =1 2 =1 2 =1 2 / 2
+1 .
+1 .
+1 .
0.5+1
=1 =1 2 + =1 =1 2 =1 / 2
2
2 (R,C) =
=1 =1
=1 2 =1 / 2
2
2
.
.
.
.
0.5
=1 =1 2 + =1 2 =1 2 =1 2 / 2
SIMULATED DATASETS
28
V.
Reference
Golub et. al (1999)
Su et. al. (2002)
Bhattacharjee(2001)
Ramaswamy et.al. (2001)
Datasets
Leukemia
Novartis
Lung Cancer
Normal tissues
K
3
4
4
13
d
38
103
197
90
n
999
1000
1000
1277
29
For the real datasets (Table 4), our estimates of the number
of clusters are very close to the number of classes available in
these datasets. Note that the values for these datasets are
very small due to the fact that the sample sizes in these
datasets are small. A high value for will split the samples
into many groups and the level of noise increases, which may
not be optimal, but by setting a high value for it is possible
to identify closely related samples. In general it is a trade-off
between the level of noise and the number of clusters.
The delta for which the estimated number of clusters is
close to the true clusters is presented in the table 6. However
other values are delta and the rand index comparing the cluster
labels and the original labels of the real data sets are presented
in Figures 1 to 4. As expected, the pattern of the estimated
clusters and the rand indices are not uniform across all data
sets. The consensus solution obtained using hierarchical
cluster (Diana) is not very much affected the delta values and
hence the solution is stable across various values of delta.
Parameter
Rand
Mod Rand
N clusters
Parameter
Rand
Mod Rand
N clusters
Type I
combined
Diana
kmeans
Mclust
%Noise = 0; SD = 0; delta = 0.5725; Ncluster = 15
1
1
1
1
1
1
1
1
15
15
15
16
%Noise = 5; SD = 0; delta = 0.7125; Ncluster = 16
1
0.346
0.944
0.997
1
0.805
1
0.949
16
7
17
17
Parameter
Rand
Mod Rand
0.972
N clusters
Parameter
16
6
19
17
17
%Noise = 20; SD = 0; delta = 0.6860; Ncluster = 16
15
15
15
15
15
%Noise = 0; SD = 0.2; delta = 0.5700; Ncluster=16
Rand
Mod Rand
N clusters
1
1
16
1
1
15
Parameter
Rand
Mod Rand
0.879
0.646
0.87
0.959
0.894
0.98
0.968
0.979
0.98
0.995
N clusters
15
16
25
13
19
13
13
13
13
15
Parameter
Rand
Mod Rand
0.362
0.802
0.025
0.704
0.206
0.793
0.512
0.851
0.269
0.815
0.581
0.923
0.668
0.942
0.835
0.977
0.526
0.893
0.62
0.943
N clusters
10
26
22
12
16
10
13
13
11
14
Parameter
Rand
.06
.215
.36
.972
.057
0.373
0.475
0.664
0.298
0.376
Mod Rand
N clusters
.0.512
49
0.648
55
0.72
52
0.988
15
0.61
48
0.835
7
0.897
11
0.928
11
0.78
6
0.893
13
0.799
0.401
0.815
7
0.977
0.93
0.995
19
0.999
0.969
0.998
18
Pam
1
1
16
0.943
0.896
18
0.993
0.747
0.98
19
Type II
combined
Diana
kmeans
Mclust
Pam
%Noise = 0; SD = 0; delta = 0.5625; Ncluster =15
1
1
1
1
1
1
1
1
1
1
15
15
15
16
16
%Noise = 0; SD = 0.05; delta = 0.5300; Ncluster=16
1
1
1
1
1
1
1
1
1
1
15
15
15
15
15
0.979
0.989
1
16
0.993
1
1
15
1
1
15
0.973
0.992
0.999
16
30
Table 3: Summary of Results for Type III Simulations (100% noise) and
Type IV Simulation (200% Noise)
Type III Simulations
Combined
Parameter
Diana
kmeans
Type IV Simulation
Mclust
Pam
Combined
Diana
kmeans
Mclust
Pam
Rand
0.362
0.025
0.206
0.512
0.269
0.06
0.205
0.365
0.972
0.057
Mod Rand
0.802
0.704
0.793
0.851
0.815
0.612
0.648
0.724
0.988
0.61
N clusters
10
26
22
12
16
49
55
52
15
48
Parameter
Rand
0.493
0.031
0.237
0.63
0.221
0.25
0.589
0.817
0.991
0.086
Mod Rand
0.822
0.705
0.798
0.882
0.811
0.685
0.811
0.916
0.996
0.623
N clusters
28
19
12
17
67
25
24
17
50
Parameter
Rand
0.308
0.042
0.5
0.709
0.302
0.061
0.175
0.613
0.977
0.054
Mod Rand
0.721
0.68
0.828
0.905
0.808
0.614
0.647
0.704
0.996
0.611
N clusters
12
12
50
47
60
15
48
Parameter
Rand
0.802
0.055
0.185
0.853
0.244
0.174
0.0108
0.036
0.625
0.049
Mod Rand
0.832
0.707
0.79
0.944
0.805
0.65
0.581
0.597
0.835
0.609
N clusters
11
23
32
11
19
22
38
62
15
37
Parameter
Rand
0.405
0.002
0.247
0.726
0.244
0.249
0.006
0.063
0.935
0.111
Mod Rand
0.82
0.695
0.81
0.911
0.805
0.671
0.576
0.611
0.968
0.628
N clusters
Parameter
11
25
21
12
20
SD = 0.8; delta = 0.24; N clusters = 16
13
23
34
12
21
SD = 0.8; delta = 0.20; N clusters = 16
Rand
0.28
0.08
0.375
0.422
0.095
0.139
0.073
0.183
0.547
0.085
Mod Rand
0.78
0.731
0.815
0.722
0.764
0.616
0.6
0.651
0.766
0.606
N clusters
10
11
13
16
12
17
13
Parameter
Rand
Mod Rand
0.776
0.725
0.777
0.704
0.744
0.603
0.615
0.81
0.665
0.583
N clusters
11
17
15
87
66
35
10
77
Data sets
St. jude Leukemia
N True Clusters
6
Delta
0.280
Rand
Mod Rand
N clusters
Lung Cancer
13
Mclust
Pam
Diana
0.918
0.958
7
0.945
0.974
6
0.946
0.971
7
0.918
0.974
6
0.457
0.786
6
0.553
0.813
5
0.426
0.732
6
0.395
0.744
4
0.375
0.650
5
0.544
0.751
4
0.306
0.803
5
0.301
0.839
7
0.233
0.767
6
0.373
0.858
7
0.213
0.744
6
0.260
Rand
Mod Rand
N clusters
Leukemia
kmeans
0.200
Rand
Mod Rand
N clusters
Normal Tissues
Combined
0.333
31
Rand
Mod Rand
N clusters
CNS Tumors
0.527
0.822
4
0.742
0.902
3
0.496
0.792
5
0.516
0.764
4
0.443
0.839
5
0.368
0.813
5
0.525
0.856
5
0.581
0.894
5
0.311
0.667
3
0.89
0.972
5
0.881
0.953
6
0.890
0.972
6
0.862
0.95
7
0.778
0.927
6
0.300
Rand
Mod Rand
N clusters
Novartis
0.588
0.823
4
0.260
Rand
Mod Rand
N clusters
Lung cancer
0.5
0.45
rand index
0.4
Kmeans
PAM
diana
Mclust
combined
6
0.35
0.3
0.25
0.2
0.15
0.1
0.05
9
12
0.5
8
10
10
11
14
0.6
12
10
10
13
11
0.7
delta
Figure 1
11
12
9
0.8
7
4
0.9
32
Lukemia
Kmeans
rand index
0.9
8
0.8
PAM
10
10
13
10
0.6 10
0.5
Mclust
combined
0.7
diana
12
13
11
10
0.4
8
10
11
16 14
0.3
0.2
0.5
0.6
0.7
11
16
12
10
10 12
0.8
delta
0.9
Figure 2
Normal tissue
Kmeans
0.5
0.45
0.46
rand index
0.35
0.3 7
0.25
0.2
8
7
6
5
4
0.15
PAM
diana
Mclust
0.05
0.5
0.6
0.1
0
combined
0.7
delta
Figure 3
5
6
5
0.8
5
3
7
5
0.9
33
Novartis
0.8 6
rand index
7
0.7
0.6
0.5
Kmeans
7
PAM
diana
Mclust
12
7
0.4
0.3
7
8
0.5
0.6
Linear (diana)
6
6
0.7
delta
0.2
0.1
combined
0.8
7
6
4
0.9
Figure 4
REFERENCES
[1]
[2]
[3]
[4]
[12]
[13]
[14]
[15]
[16]
[17]
[5]
[18]
[19]