Professional Documents
Culture Documents
data t;
input cid $ 1-2 income educ;
cards;
c1 5 5
c2 6 6
c3 15 14
c4 16 15
c5 25 20
c6 30 19
run;
The NONORM option prevents the distances from being normalized to unit mean or
unit root mean square with most methods.
The VAR statement lists numeric variables to be used in the cluster analysis.
If you omit the VAR statement, all numeric variables not listed in other
statements are used.
1
The TREE procedure produces a tree diagram, also known as a dendrogram or
phenogram, using a data set created by the CLUSTER procedure. The CLUSTER
procedure creates output data sets that contain the results of hierarchical
clustering as a tree structure. The TREE procedure uses the output data set to
produce a diagram of the tree structure.
The NCLUSTERS= option specifies the number of clusters desired in the OUT=
data set.
The ID variable is used to identify the objects (leaves) in the tree on the
output. The ID variable can be a character or numeric variable of any length.
Cluster History
RMS Centroid
NCL -Clusters Joined-- FREQ STD SPRSQ RSQ Distance
The statistics above provide information about the cluster solution. RMSSTD
is the pooled standard deviation of all the variables forming the cluster.
Since the objective of cluster analysis is to form homogeneous groups, the
RMSSTD of a cluster should be as small as possible. SPRSQ (semipartial R-
2
sqaured) is a measure of the homogeneity of merged clusters, so SPRSQ is the
loss of homogeneity due to combining two groups or clusters to form a new
group or cluster. Thus, the SPRSQ value should be small to imply that we are
merging two homogeneous groups. RSQ (R-squared) measures the extent to which
groups or clusters are different from each other (so, when you have just one
cluster RSQ value is, intuitively, zero). Thus, the RSQ value should be high.
Centroid Distance is simply the Euclidian distance between the centroid of the
two clusters that are to be joined or merged. So, Centroid Distance is a
measure of the homogeneity of merged clusters and the value should be small.
CLUSTER=1
1 c1 5 5
2 c2 6 6
CLUSTER=2
3 c3 15 14
4 c4 16 15
CLUSTER=3
5 c5 25 20
6 c6 30 19
3
20
D
i
s
t
a
n
c
e 15
B
e
t
w
e
e
n 10
C
l
u
s
t
e
r
5
C
e
n
t
r
o
i
d
0
s
c1 c2 c3 c4 c5 c6
cid
4
Title Non-Hierarchical Cluster Analysis of Hypothetical Data;
data t2;
input cid $ 1-2 income educ;
cards;
c1 5 5
c2 6 6
c3 15 14
c4 16 15
c5 25 20
c6 30 19
run;
You must specify either the MAXCLUSTERS= or the RADIUS= argument in the PROC
FASTCLUS statement
The RADIUS= option establishes the minimum distance criterion for selecting
new seeds. No observation is considered as a new seed unless its minimum
distance to previous seeds exceeds the value given by the RADIUS= option. The
default value is 0.
FULL
requests default seed replacement.
PART
requests seed replacement only when the distance between the observation
and the closest seed is greater than the minimum distance between seeds.
NONE
suppresses seed replacement.
RANDOM
selects a simple pseudo-random sample of complete observations as
initial cluster seeds.
The MAXITER= option specifies the maximum number of iterations for recomputing
cluster seeds. When the value of the MAXITER= option is greater than 0, each
observation is assigned to the nearest seed, and the seeds are recomputed as
the means of the clusters.
The LIST option lists all observations, giving the value of the ID variable
(if any), the number of the cluster to which the observation is assigned, and
the distance between the observation and the final cluster seed.
5
The DISTANCE option computes distances between the cluster means.
The VAR statement lists the numeric variables to be used in the cluster
analysis. If you omit the VAR statement, all numeric variables not listed in
other statements are used.
Initial Seeds
1 5.00000000 5.00000000
2 30.00000000 19.00000000
3 16.00000000 15.00000000
Iteration History
Here, the cluster solution at the second iteration is the final cluster
solution because the change in cluster seeds at the second iteration is less
than the convergence criterion. Note that a zero change in the centroid of
the cluster seeds for the second iteration implies that the reallocation did
not result in any reassignment of observations.
Cluster Listing
Distance
from
Obs cid Cluster Seed
1 c1 1 0.7071
2 c2 1 0.7071
3 c3 3 0.7071
4 c4 3 0.7071
5 c5 2 2.5495
6 c6 2 2.5495
6
Cluster Summary
Maximum Distance
RMS Std from Seed Radius Nearest
Cluster Frequency Deviation to Observation Exceeded Cluster
1 2 0.7071 0.7071 3
2 2 2.5495 2.5495 3
3 2 0.7071 0.7071 2
Cluster Summary
Distance Between
Cluster Cluster Centroids
1 13.4536
2 13.0000
3 13.0000
The statistics used for the evaluation of the cluster solution are the same as
in the hierarchical cluster analysis.
The cluster solution can also be evaluated with respect to each lustering
variable. If the measurement scales are not the same, then for each variable
one should obtain the ration of the respective Within STD to the Total STD,
and compare this ration across the variables.
WARNING: The two values above are invalid for correlated variables.
Cluster Means
1 5.50000000 5.50000000
2 27.50000000 19.50000000
3 15.50000000 14.50000000
7
Cluster Standard Deviations
1 0.707106781 0.707106781
2 3.535533906 0.707106781
3 0.707106781 0.707106781
Nearest
Cluster 1 2 3
1 . 26.07680962 13.45362405
2 26.07680962 . 13.00000000
3 13.45362405 13.00000000 .