You are on page 1of 15

Running head: CLUSTER ANALYSIS 1

CLUSTER ANALYSIS

Hal Hagood

u06a1
CLUSTER ANALYSIS 2

“Clustering or cluster analysis is a generic name for a group of related techniques (such as

unsupervised pattern recognition, unsupervised classification analysis, numerical taxonomy, typology

constructions, Q-analysis, and so on) that automatically try to find natural groupings in the data. One

crucial difference between clustering and a typical classification model is the absence of any target

variable (where classes or groups are known a priori) in the data. In the context of textual data, this

means that no labeled training examples are needed before documents can be clustered into groups.

This is why clustering is often referred to as unsupervised classification.

As a conceptual activity, the assignment of objects into groups is something humans do routinely

all through their lives to reduce the complexity of the environment that they have to work with. The natural

grouping of objects and observations is extremely important to many disciplines (such as statistics,

psychology, sociology, biology, engineering, economics, and business). Each of these disciplines, in turn,

has used its own label to describe cluster analysis. Although the names might differ across disciplines, all

disciplines share the fundamental concept of separating data suggested by the natural groupings in the

data. In essence, cluster analysis attempts to group objects so that each object in a cluster is similar to

the other objects in the same cluster. However, objects in different clusters are dissimilar to each other. In

the context of textual data, objects are the documents that must be assigned to clusters so that within a

cluster, documents are similar, but between clusters, documents are different.

The basic idea is that documents within a cluster should be similar to each other, and documents

in different clusters should be dissimilar to each other. The similarity between two documents is based on

the similarity of features (such as terms or words) between documents in the vector space model. In this

context, we discuss latent semantic indexing (LSI), which provides a method for determining the similarity

of words and passages by the analysis of large text corpora. Then, we discuss the concept of topic

extraction from a collection of documents. A topic is conceptualized as a collection of terms that capture

the main themes or ideas in the document. Unlike cluster groups, where each document is assigned to

only one cluster, the same document can be assigned to multiple topics, depending on how many ideas

are represented in a documet” (Text Mining and Analysis, 2017).


CLUSTER ANALYSIS 3

Uses text mining statistical software to set up a cluster analysis that accurately provides
meaningful insight into a business question
CLUSTER ANALYSIS 4

(Ignore parts of speech)


CLUSTER ANALYSIS 5

Survey_text_numeric
(Text Cluster)
CLUSTER ANALYSIS 6

Identifies text clusters for a given text mining context and the meaning of the text clusters

“Cluster analysis is a popular technique used by data analysts in numerous business

applications. Clustering partitions records in a data set into groups so that the subjects within a group are

similar and the subjects between the groups are dissimilar. The goal of cluster analysis is to derive

clusters that have value with respect to the problem being addressed, but this goal is not always

achieved. As a result, there are many competing clustering algorithms. The analyst often compares the

quality of derived clusters, and then selects the method that produces the most useful groups. The

clustering process arranges documents into nonoverlapping groups. Each document can fall into more

than one topic area after classification. This is the key difference between clustering and the general text

classification processes, although clustering provides a solution to text classification when groups must

be mutually exclusive, as in the classified ads example” (SAS, 2017).

In this particular analysis the cluster ID’s show the following Terms …

(SAS, 2017)
CLUSTER ANALYSIS 7

(Text Topic)
CLUSTER ANALYSIS 8

Uses a survey data set to illustrate how text can generate deep and meaningful insights into
customers' perceptions and expectations
CLUSTER ANALYSIS 9

survey_textual
(Text Cluster)

In this particular analysis the cluster ID’s show the following Terms …
CLUSTER ANALYSIS 10

(Text Topic)
CLUSTER ANALYSIS 11

(Topic Viewer)
CLUSTER ANALYSIS 12

After experimenting I found it was possible to merge the two data files, survey_text _numeric and

survey_textual. Combined results are very similar but yet slightly different, the results can be seen below.

In this particular analysis the cluster ID’s show the following Terms … culture, discharge, nurse,

block, bed, site, staff, too, multiple, measure, set, show, model and order.
CLUSTER ANALYSIS 13

Filter viewer using Term “Data” it has a frequency of 2143 and is present in 36documents.

As one can see below the results of expanding the Term “Data”. As you can see the results are

almost limitless in their possibilities. This illustrates how text can generate deep and meaningful insights

into customers' perceptions and expectations.


CLUSTER ANALYSIS 14
CLUSTER ANALYSIS 15

Reference

Text Mining and Analysis, (2017). Text Mining and Analysis: Practical Methods, Examples, and Case

Studies Using SAS Chapter 6 - Clustering and Topic Extraction. Retrieved August 10, 2017 from

http://viewer.books24x7.com/assetviewer.aspx?bookid=59026&chunkid=342485391&resume=ye

s&resumebookmarkid=dc367bed-ce7d-e711-a9c3-00505686029c

You might also like