Professional Documents
Culture Documents
CLUSTER ANALYSIS
Hal Hagood
u06a1
CLUSTER ANALYSIS 2
“Clustering or cluster analysis is a generic name for a group of related techniques (such as
constructions, Q-analysis, and so on) that automatically try to find natural groupings in the data. One
crucial difference between clustering and a typical classification model is the absence of any target
variable (where classes or groups are known a priori) in the data. In the context of textual data, this
means that no labeled training examples are needed before documents can be clustered into groups.
As a conceptual activity, the assignment of objects into groups is something humans do routinely
all through their lives to reduce the complexity of the environment that they have to work with. The natural
grouping of objects and observations is extremely important to many disciplines (such as statistics,
psychology, sociology, biology, engineering, economics, and business). Each of these disciplines, in turn,
has used its own label to describe cluster analysis. Although the names might differ across disciplines, all
disciplines share the fundamental concept of separating data suggested by the natural groupings in the
data. In essence, cluster analysis attempts to group objects so that each object in a cluster is similar to
the other objects in the same cluster. However, objects in different clusters are dissimilar to each other. In
the context of textual data, objects are the documents that must be assigned to clusters so that within a
cluster, documents are similar, but between clusters, documents are different.
The basic idea is that documents within a cluster should be similar to each other, and documents
in different clusters should be dissimilar to each other. The similarity between two documents is based on
the similarity of features (such as terms or words) between documents in the vector space model. In this
context, we discuss latent semantic indexing (LSI), which provides a method for determining the similarity
of words and passages by the analysis of large text corpora. Then, we discuss the concept of topic
extraction from a collection of documents. A topic is conceptualized as a collection of terms that capture
the main themes or ideas in the document. Unlike cluster groups, where each document is assigned to
only one cluster, the same document can be assigned to multiple topics, depending on how many ideas
Uses text mining statistical software to set up a cluster analysis that accurately provides
meaningful insight into a business question
CLUSTER ANALYSIS 4
Survey_text_numeric
(Text Cluster)
CLUSTER ANALYSIS 6
Identifies text clusters for a given text mining context and the meaning of the text clusters
applications. Clustering partitions records in a data set into groups so that the subjects within a group are
similar and the subjects between the groups are dissimilar. The goal of cluster analysis is to derive
clusters that have value with respect to the problem being addressed, but this goal is not always
achieved. As a result, there are many competing clustering algorithms. The analyst often compares the
quality of derived clusters, and then selects the method that produces the most useful groups. The
clustering process arranges documents into nonoverlapping groups. Each document can fall into more
than one topic area after classification. This is the key difference between clustering and the general text
classification processes, although clustering provides a solution to text classification when groups must
In this particular analysis the cluster ID’s show the following Terms …
(SAS, 2017)
CLUSTER ANALYSIS 7
(Text Topic)
CLUSTER ANALYSIS 8
Uses a survey data set to illustrate how text can generate deep and meaningful insights into
customers' perceptions and expectations
CLUSTER ANALYSIS 9
survey_textual
(Text Cluster)
In this particular analysis the cluster ID’s show the following Terms …
CLUSTER ANALYSIS 10
(Text Topic)
CLUSTER ANALYSIS 11
(Topic Viewer)
CLUSTER ANALYSIS 12
After experimenting I found it was possible to merge the two data files, survey_text _numeric and
survey_textual. Combined results are very similar but yet slightly different, the results can be seen below.
In this particular analysis the cluster ID’s show the following Terms … culture, discharge, nurse,
block, bed, site, staff, too, multiple, measure, set, show, model and order.
CLUSTER ANALYSIS 13
Filter viewer using Term “Data” it has a frequency of 2143 and is present in 36documents.
As one can see below the results of expanding the Term “Data”. As you can see the results are
almost limitless in their possibilities. This illustrates how text can generate deep and meaningful insights
Reference
Text Mining and Analysis, (2017). Text Mining and Analysis: Practical Methods, Examples, and Case
Studies Using SAS Chapter 6 - Clustering and Topic Extraction. Retrieved August 10, 2017 from
http://viewer.books24x7.com/assetviewer.aspx?bookid=59026&chunkid=342485391&resume=ye
s&resumebookmarkid=dc367bed-ce7d-e711-a9c3-00505686029c