Cluster Cat Vars

Cluster analysis and categorical data
Hana ezankov Vysok kola ekonomick v Praze, Praha
1. Introduction
Methods of cluster analysis are placed between statistics and informatics. They play an important role in the area of data mining. The main aim of cluster analysis is to assign objects into groups (clusters) in such a way that two objects from the same cluster are more similar than two objects from different clusters. We can consider respondents in market research, firms, states or products as the objects. The similarity is investigated on the basis of certain features (variables), which can be quantitative (length of a certain activity) or qualitative (evaluation of respondent relationships to the employer, qualitative features of the products). These variables are often denoted as categorical, see bellow. The aim of this paper is to present some approaches to clustering in categorical data. Whereas methods for cluster analysis of quantitative data are currently implemented in all software packages for statistical analysis and data mining, and the differences among them in this area are small, the differences in implementation of methods for clustering in qualitative data are substantial. Both special methods designed for clustering such a type of data and advantages of some statistical software packages (S-PLUS, SPSS, STATISTICA, SYSTAT) in this area are presented.
2. Methods of cluster analysis

For the methods of multivariate statistical analysis, vectors of observations (vectors of values of individual variables) form the base. Clustering of observation vectors (objects) is the most frequently used application. However, clusters of variables can be created, or objects and variables can be clustered simultaneously. Moreover, clustering of categories of qualitative variable can be applied on the basis of contingency table. There are different ways of classifying the cluster analysis methods. We can distinguish partitioning methods (denoted as flat) which optimize assignment of the objects into a certain number of clusters, and methods of hierarchical cluster analysis with graphical outputs which make assignment of objects into different numbers of clusters possible. In the first group, k-centroids and k-medoids methods are used for disjunctive clustering. The former is based on initial assignment of the objects into k clusters. For this purpose, k initial centroids are selected which are the centers of the clusters. Different approaches are applied for selection of the initial centroids; for example, the first k objects can be used. After that, the distances of each object from all centers are calculated. The object is assigned to the closest centroid. Further, the elements of the new centroids are computed;
216
usually they are the average values of individual variables. Then the distances of each object from all centroids are calculated again. If an object is closer to the centroid of any other cluster, it is moved to that cluster. This process is repeated as long as any object can be moved. If the centroid is created from average values of individual variables, the method is called k-means. If the centroid is created from medians, the method is called k-medians. In the first case, the Euclidean distance (see [13]) is used. However, some software systems (SYSTAT) also offer further measures (see below). In the k-medoids method, a certain vector of observations is taken for the center of the cluster. Methods of hierarchical cluster analysis can be agglomerative (step-by-step clustering of objects and groups to larger groups) or divisive (step-by-step splitting of the whole set of objects into the smaller subsets and individual objects). Further, we can distinguish clustering monothetic (only one variable is considered in individual steps) and polythetic (all variables are considered in individual steps). Methods of hierarchical cluster analysis (as well as some other methods) are based on the proximity matrix. This matrix is symmetric with zero on the diagonal. The values out of the diagonal express dissimilarities for the corresponding pairs of objects, variables or categories. These dissimilarities are the values of certain coefficients, or they are derived from the values of similarity coefficients. For example, if we consider a similarity measure then the dissimilarity measure is obtained by subtraction of this value from one, i.e. D = 1 S. More information and examples of the methods of cluster analysis can be found in books [2], [5], [9] and [13]. Implementation in the SAS system is described in [14].
3. Categorical data
Categorical variables are characterized by values which are categories. Two main types of these variables can be distinguished: dichotomous, for which there only are two categories, and multi-categorical. Dichotomous variables are often coded by the values zero and one. For similarity measuring it is necessary to take into account whether the variables are symmetric or asymmetric. In the first case, both categories have the same importance (male, female). In the second case, one category is more important (presence of the word in a textual document is more important than its absence). Multi-categorical variables can be classified into three types: nominal, ordinal and quantitative. Unlike the other types, categories of nominal variables cannot be ordered (from the point of view of intensity etc.). Categories of ordinal variables can be ordered but we cannot usually do the arithmetic operations with them (it depends on the relations among categories, see below). We can do arithmetic operations with quantitative variables (number of children). We can apply traditional distance measures in this case and so this type will not be considered in the paper. For this reason, we will further denote nominal, ordinal and dichotomous variables as categorical. These variables are also called qualitative. We will suppose that dichotomous variables are binary with categories zero and one. The same similarity measures are usually used for clustering of both objects and variables in the case of binary data.
3/2OO9
217
If binary variables are symmetric, one can apply the same measures as for quantitative data. Moreover, many specific coefficients have been proposed for this kind of data, as well as for data files with asymmetric binary variables. If there are no special means for clustering multi-categorical data in a software package, then transformation of the data file to a file with binary data is usually needed. The distinction between nominal and ordinal types is necessary. First, we will mention the data file with nominal variables. In comparison with classification tasks involving a target variable (regression and discriminant analyses, decision trees), the number of dummy variables must be equal to the number of categories, see Table 1. In this way it is guaranteed that one can obtain only two possible values of similarity: one for the matched categories, and the second for unmatched categories. Table 1 Recoding of the nominal variable School for three binary variables P1 to P3
There are two processes for transforming ordinal data. The first one consists of transformation of a data file to a binary data file. In comparison to the case with nominal variables, k possible values of similarity should be considered where k is a number of categories. It is guaranteed by the coding shown in Table 2. Table 2 Recoding of the ordinal variable Reaction for three binary variables P1 to P3
The second process makes use of the fact that values of an ordinal variable can be ordered. Under the assumption of the same distances between categories, the arithmetic operations can be done. It is recommended to code categories from 1 to k and divide these codes by the maximum value. In this way, the values will be in the interval from 0 to 1. Then we can apply the techniques designed for quantitative data.
4. Object clustering
In the following text, we will consider a simplifier case in which all variables are of the same type (for other case see [13]). Binary variables If objects are only characterized by binary variables, then the usual process consists of creating the proximity matrix, followed by application of hierarchical cluster analysis.
218
Some software systems offer special measures (the SPSS system offers 26 measures including general ones; the SYSTAT system offers 5 measures). Formulas of these measures are usually expressed by means of frequencies from the contingency table. Let the symbols from Table 3 be given. Table 3 Two-way frequency table for objects xi a xj
In the case of symmetric variables, Sokal and Micheners simple matching coefficient is used for example. For two objects, it is a ratio of the number of variables with the same values (0 or 1) in both objects, and the total number of variables: (1) The similarity between two objects characterized by asymmetric variables can be measured by Jaccards coefficient. Its value is expressed as a ratio of the number of variables which have the value 1 for both objects, and the number of variables with at least one value equal to 1: (2) Further, we can apply Yules Q which is calculated by the formula (3) The publications [11] and [13] provide a more detailed treatment of these measures for binary variables. However, general measures can be also applied. For example, Euclidean distance and coefficient of disagreement (designed for data files with nominal variables, see below) can be used. The latter is a complement of the simple matching coefficient to the value 1. Further, gamma coefficient (designed for clustering ordinal variables, see below) is a measure suitable for this purpose. In the case of binary variable analysis, it is called Yules Q, see formula (3). Coefficient of disagreement is provided by systems SYSTAT and STATISTICA, gamma coefficient is provided by SYSTAT. In addition, a proximity matrix created by other means can serve as an input for hierarchical cluster analysis. The SYSTAT system provides a possibility to create such matrices on the basis of 13 measures applicable to binary variables. Monothetic divisive cluster analysis can be applied to objects characterized by symmetric binary variables. It starts from one cluster which is split into two clusters. Any variable can serve for this purpose (one group will contain ones in this variable, the second group will contain zeros). If we denote the number of variables as m, then m possibilities
3/2OO9
219
exist for splitting a data file into two groups of objects. For the next splitting, m 1 possibilities exist, etc. The criterion for splitting is based on measurement of dependency of two variables. This method is called MONA (MONothetic Analysis) in [8] and in the S-PLUS system. In this algorithm, the measure
is used for evaluation of dependency between the k-th and l-th variables where akl, bkl, ckl and dkl are frequencies in the contingency table created for these variables. For each l-th variable the value
is calculated. The objects are split according to the variable for which the maximum of these values is achieved. Further, the k-means and k-medians methods with Yules Q, see formula (3), can be applied on data files with asymmetric dichotomous variables in SYSTAT. Nominal variables Typical process for the data files with nominal variables is creation of the proximity matrix on the basis of the simple matching coefficient and application of hierarchical cluster analysis. The simple matching coefficient is a ratio of the number of pairs with the same values in both elements, and the total number of variables (when objects are clustered). Sokal-Michener coefficient (1) is a special case of that. For the i-th and j-th objects can be written as
where m is the number of variables, Sijl = 1 xil = xjl (the values of the l-th variable are equal for the i-th and j-th objects) and in other cases Sijl = 0. Dissimilarity is a complement of the simple matching coefficient to the value 1, i.e. Dij = 1 Sij. This coefficient of disagreement expresses a ratio of the number of pairs with distinct values and the total number of variables (it is implemented in the STATISTICA and SYSTAT systems). Another measure of the relationship between two objects (and also between two clusters) is the log-likelihood distance measure. Its implementation in the software systems is linked with two-step cluster analysis in SPSS. This method has been designed for clustering a large number of objects and it is based on the BIRCH method, which uses the principle of trees, see [15] and [16]. The log-likelihood distance measure is determined for data files with combinations of quantitative and qualitative variables. Dissimilarity is expressed on the basis of variability, whereas the entropy is applied to categorical variables. For the l-th variable in the g-th cluster, the formula for the entropy can be written as
220
where Kl is the number of categories of the l-th variable, nglu represents the frequency of the u-th category of the l-th variable in the g-th cluster, and ng is the number of objects in the g-th cluster. Two objects are the most similar if the cluster composed of them has the smallest entropy. Other specific methods exist in addition to the mentioned techniques, for clustering objects characterized by nominal variables. There are both the k-clustering methods and modifications of the hierarchical approaches. The k-means and k-medians methods are the basis for the former. It assumes that each variable has values vlu (u = 1, 2, ..., Kl). Each cluster is represented by the m-dimensional vector which contains either the categories with the highest frequencies (in the k-modes method, see [6] and [7]), or the figures about frequencies of all individual-variable categories (in the k-histograms method, see [4]). These vectors are special types of centroids. Specific dissimilarity measures are applied. In the case of the k-modes algorithm, measurement based on the simple matching coefficient is used. However, we obtain only a locally optimal solution which depends on the order of objects in a data file as well as in the case of the clustering by the k-means algorithm. ROCK and CACTUS are additional special methods. The ROCK (RObust Clustering using linKs) algorithm, see [3], is based on the principle of hierarchical clustering. First, a random sample of objects is chosen. These objects are clustering to the desired number of clusters, and then the remaining objects are assigned to the created clusters. The method uses a graph concept, whose main terms are neighbors and links. A neighbor of a certain object is such an object to which similarity with the investigated object is equal to or greater than a predefined threshold. A link between two objects is the number of common neighbors of these objects. The principle of the ROCK methods lies in maximization of the function which takes into account both maximization of sums of links for the objects from the same cluster, and minimization of sums of links for the objects from different clusters. Let us denote by S(xi, xj) the similarity measure between objects xi and xj; this measure can achieve values between 0 and 1. If we define the threshold T in such a way that T 0; 1 , then the objects xi and xj are neighbors if the condition S(xi, xj) T is satisfied. For the binary data, Jaccards similarity coefficient, see formula (2), is used in the algorithm. The similarity in the case of the data files with multi-categorical variables is investigated within the same principle. If a value is missing, the corresponding variable is omitted from the comparison. The second means to be used is a link, i.e., the number of common neighbors of objects xi and xj. It will be denoted as link (xi, xj) in the text that follows. The greater value of the link implies the greater probability of objects xi and xj belonging to the same cluster. The resulting clusters are determined by minimization of the function
where nh is the size of cluster Ch. Each object belonging to the h-th cluster has approximaf tely nh (T) neighborhoods in this cluster, whereas for the binary data the f(T) function is determined by the formula
3/2OO9
221
The value n1+2f(T) is the expected number of links between pairs of objects in the h-th clush ter. The merging of clusters Ch and Ch' is realized by means of the measure
The pair most suitable for clustering is the pair of clusters for which this measure attains its maximum value. In the final phase, remaining objects are assigned to the created clusters. From each h-th cluster, the set of objects is selected according to which the remaining objects should be assigned (this set will be denoted as Lh and the number of objects in this set as Lh ). Each remaining object is assigned to a cluster in which it has the most neighborhoods from the set after normalization. If we denote the number of neighbor-hoods in the Lh set as Nh, then the object is assigned to such a cluster for which the value of the expression
is maximal whereas ( Lh + 1)f(T) is the number of neighborhoods for the objects compared with the Lh set. Algorithm CACTUS (CAtegorical ClusTering Using Summaries), see [1], is based on the idea of the common occurrences of certain categories of different variables. If the difference in the number of occurrences for the categories vkt and vlu of the k-th and l-th variable, and the expected frequency (on the assumption of uniform distribution in the frame of the certain categories of the remaining variables, and the assumption of the independency) is greater than a user-defined threshold, the categories are strongly connected. The algorithm has three phases: summarization, clustering and verification. During clustering, the candidates for clusters are chosen from which the final clusters are determined in the verification phase. Ordinal variables From the specialized methods we can recall the k-median method, in which the vectors of medians of the individual variables are used as centroids. Application of the Manhattan distance (city block distance) is recommended, which is defined as
for vectors xi a xj . In the SYSTAT system, the gamma coefficient can be used. It will be described in the following section in connection with measurement of ordinal variable similarities.
222
5. Variable clustering
Clustering of categorical variables is usually realized by application of hierarchical cluster analysis on a proximity matrix, performed on the basis of suitable similarity measures. Binary variables If variables are binary and symmetric, then one can use both the simple matching coefficient, see formula (1), and Pearsons correlation coefficient, which can be expressed (with symbols from Table 1) as (4) For asymmetric variables, gamma coefficient can be applied for example. It is called Yules Q in the binary data analysis, see formula (3). Moreover, some other specific coefficients and proximity matrices created by different means can be used. Nominal variables For determination of dissimilarity of nominal variables, the coefficient of disagreement is offered in some software packages (STATISTICA, SYSTAT). It expresses a ratio of the number of the pairs of different values and the total number of objects. It can be calculated from the simple matching coefficient by subtracting it from one. For the k-th and l-th variables, the simple matching coefficient can be expressed as
where n is the number of objects, Skli = 1 xik = xil (the values of the l-th and k-th variables are the same for the i-th object). The disagreement coefficient is then calculated according to the formula Dkl = 1 Skl. Theoretically, there is a wider range of possibilities because symmetric measures of dependency can be used. They do not occur in the procedures for cluster analysis but the proximity matrix created by different means can serve as a basis for the analysis. The wellknown measures are derived from Pearsons chi-square statistic, which is calculated according to the formula (5) where Kk is the number of categories of the k-th variable, Kl is the number of categories of the l-th variable, nrs is the frequency in the contingency table (in the r-th row and s-th column), and Mrs is the expected frequency under the assumption of independency, i.e.
3/2OO9
223
where nr+ and n+s are marginal frequencies, expressed as
This statistic is a basis for phi coefficient, which is calculated by the formula
Further, we can mention Pearsons coefficient of contingency, calculated as
Cramrs V is another example of this type of similarity coefficients. It is expressed as
where q = min{Kk, Kl}. For two binary variables, the value is the same as the value of the phi coefficient. More symmetric dependency measures can be derived from the pairs of asymmetric measures. The contingency table is again the basis. We can distinguish row variables Xk and column variables Xl. If we investigate the dependency of a column variable Xl on a row variable Xk, two situations can occur: {1}the columns are independent of the rows, {2}the columns depend on the values of variable Xk . Let us have a new object for which we know the values of variable Xk but we do not know the values of variable Xl. If we suppose situation {1}, we will estimate the value of variable Xl according to the category with relation p+Mo = maxs(p+s) where p+s is the column subtotal of relative frequencies (p+s = n+s/n). The probability of error can be expressed as P{1} = (1 p+Mo). When we suppose situation {2}, we will estimate the value of variable Xl according to the row maximum corresponding to the known value of variable Xk. Let us denote this maximum as prMo = maxs(prs) where prs is a relative frequency in the r-th row and s-th column (prs = nrs/n). Then the probability of error equals P{2} = (1 prMo). The proportional reduction in error can be calculated according to the scheme
Goodman and Kruskals lambda coefficient is based on this formula. Asymmetric coefficient can be written as
224
For symmetric coefficients, the following probabilities are considered:
The final formula is either
or
The uncertainty coefficient investigates dependency in more detail. It is based on the principle of analysis of variance. If variability is expressed by other characteristics than variance, then the measure of dependency of variable Xl on variable Xk can be written as the following ratio:
where var(Xl) is variability of the dependent variable, var(Xl|Xk) is variability within a group and vkr is the r-th category of the independent (explanatory) variable Xk. Variability of the nominal variable can be measured by different means. The uncertainty coefficient is based on the entropy, which can be written as
where plu is relative frequency of the u-th category of the l-th variable. The symmetric measure is calculated as a harmonic means of both asymmetric measures. The final formula is usually written in the simplified form as
Ordinal variables Dependency of the ordinal variables is denoted as a rank correlation and their intensity is expressed by correlation coefficients. The best known among them is Spearmans correlation coefficient. If investigated ordinal variables express the unambiguous rank, the following formula can be used:
3/2OO9
225
If this assumption is not satisfied, the process described in [12] must be applied. Further measures investigate pairs of objects. If, in a pair of objects, the values of both investigated variables are greater (less) for one of these objects, this pair is denoted as concordant. If for one variable the value is greater and for the second one it is less, then the pair is denoted as discordant. In other cases (the same values for both objects exist for at least one variable), the pairs are tied. For the sake of simplification, we will use the following symbols: k l number of concordant pairs, number of discordant pairs, number of pairs with the same values of variable Xk but distinct values of variable Xl, number of pairs with the same values of variable Xl but distinct values of variable Xk. Goodman and Kruskals gamma is a symmetric measure. It is expressed as
For two binary variables, it can be written as
and it coincides with Yules Q, see formula (3). Another symmetric measure is Kendalls tau-b (Kendalls coefficient of the rank correlation). It is expressed as
For two binary variables, the value of this coefficient is the same as the value of Pearsons correlation coefficient, see formula (4). Another correlation coefficient is the tau-c coefficient, which is denoted either Kendalls tau-c (SPSS) or Stuarts tau-c (SYSTAT, SAS). The formula is following:
where q = min{Kk, Kl}. Further, Somers d is used. Both symmetric and asymmetric types of this measure exist. The asymmetric one is expressed as
The symmetric measure is calculated as a harmonic mean of both asymmetric measures, i.e., the final formula is
Features of measures mentioned in this chapter are described in [10] and [12].
226
As concerns the possibilities of software packages in the area of creation of dependency matrices that can be used as input matrices for cluster analysis, the SPSS system offers Pearsons and Spearmans coefficients and Kendalls tau-b. The offer of the SYSTAT system is larger. It includes phi coefficient, Goodman and Kruskals lambda, uncertainty coefficient, Pearsons and Spearmans correlation coefficients, Kendalls tau-b, Stuarts tau-c, and Goodman and Kruskals gamma.
6. Category clustering
In the case of clustering categories of a nominal variable, hierarchical cluster analysis is usually applied to the proximity matrix, which is created on the basis of a suitable measure. The contingency table for two categorical variables is an input for the corresponding procedure in a software package. These processes can be applied in SPSS and SYSTAT. Moreover, in the SYSTAT system the suitable similarity measures can also be used in the k-means and k-medians methods. The relationships between categories can be measured by means of special coefficients based on the chi-square statistic, see formula (5). For the determination of dissimilarity between categories vki and vkj of the k-th (row) variable, we consider the contingency table of the dimension 2 x Kl, where Kl is the number of categories of the l-th (column) variable. We can use chi-square dissimilarity measure which is written as
where Further, phi coefficient can be used. It is calculated according to the formula
In both cases, the coefficients measure dissimilarity and Drr = 0.
7. Examples of applications
In this chapter, two examples will be presented variable clustering and category clustering. The data file is from the research Male and female with university diploma, No. 0136, Institute of Sociology of the Academy of Sciences of the Czech Republic. The author of this research is the Gender in sociology team; the data collection was performed by Sofres-Factum (Praha, 1998).
3/2OO9
227
Example 1 variable clustering For this purpose, 13 variables expressing satisfaction concerning a job of the respondent from different points of view were analyzed. Respondents evaluated their satisfaction on the scale from 1 (very satisfied) to 4 (very dissatisfied). Similarity matrix based on Kendalls tau-b was created in the SPSS system. This matrix was transformed to dissimilarity matrix by subtraction of the values from 1 in Microsoft Excel. The transformed matrix was analyzed by complete linkage method of hierarchical cluster analysis (the distance between two clusters is determined by the greatest distance between two objects from these clusters) in the STATISTICA system (for the reason of better quality of graphs). The resulting dendrogram is shown in Figure 1. If we do a cut in the distance 0.6 in the dendrogram, we obtain 6 clusters. The first cluster represents satisfaction with salary, remuneration and evaluation of working enforcement. The further clusters represent the following groups of variables: satisfaction with perspective with the company and possibility of promotion, satisfaction with relationships in the company and relationships between males and females, satisfaction with scope of employment and use of respondent degree of education, satisfaction with management of the company, respondent supervisor and possibility to express own opinion, and satisfaction with working burden. Figure 1 Dendrogram of relationships among variables
228
Example 2 category clustering In this case, the categories of the variable expressing specialization in university studies were clustered on the basis of the categories of the variable containing information about the following university diploma: magister Mgr., engineer Ing., doctor Dr. (RNDr., MUDr. JUDr. etc.). The respondents with the bachelor diploma (Bc.) were omitted from the analysis. The contingency table for these two variables is in Table 4. Table 4 Contingency table for variables Diploma and Specialization
In Table 5, the proximity matrix with using the chi-square dissimilarity measure is displayed. It was created with using the SPSS system. Table 5 Proximity matrix for categories of variable Specialization
The proximity matrix was analyzed by complete linkage method of hierarchical cluster analysis in the STATISTICA system. The resulting dendrogram is shown in Figure 2.
3/2OO9
229
Figure 2 Dendrogram of relationships among categories
This example is only illustrative. It is well known that graduates from certain faculties and universities have a specific diploma. Graduates from faculties with specialization for natural a social sciences (including law) usually have the Mgr. diploma first but some of them continue their studies for a doctoral diploma (the RNDr. diploma for natural sciences and JUDr. diploma for law). The physicians have the MUDr. diploma. Graduates from faculties with specialization for pedagogy and art usually have the Mgr. diploma. Graduates from universities with specialization for technical sciences, economy and agricultural sciences usually have the Ing. diploma. We obtain these three clusters if we do a cut in the distance 10 in the dendrogram.
8. Further directions of development

Although a lot of approaches and methods for clustering of categorical data have been proposed in the literature, capabilities of statistical software packages are limited. One expected direction of development is implementation of more algorithms into software products. Besides clustering, the programs should offer other processing and analyses: missing value imputation, choice of variables for object clustering and dimensionality reduction, identification of outliers, and determination of the optimal number of clusters. Researchers are presently focusing on two areas: clustering of large data files, and online clustering when some additional objects arise during analysis (web pages). Another
230
area which should be solved is clustering of data files with different types of variables. In commercial software packages, only two-step cluster analysis in the SPSS system makes such a clustering possible.
References
0[1] Ganti, V., Gehrke, J., Ramakrishnan, R. CACTUS Clustering categorical data using summaries. Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego: ACM Press, 1999, 7383. 0[2] Gordon, A. D. Classification, 2nd ed. Boca Raton: Chapman & Hall/CRC, 1999. 0[3] Guha, S., Rastogi, R., Shim, K. ROCK: A robust clustering algorithm for categorical attributes. Information Systems, 25 (5), 2000, 345366. 0[4] He, X., Ding, C. H. Q., Zha, H., Simon, H. D. Automatic topic identification using webpage clustering. Proceedings of the 2001 IEEE International Conference on Data Mining (ICDM 01), 2001, 195203. 0[5] Hebk, P., Hustopeck, J., Peckov, I., Plail, M., Pra, M., ezankov, H., Svobodov, A., Vlach, P. Vcerozmrn statistick metody (3). 2nd ed. Praha: Informatorium, 2007. 0[6] Huang, Z. A fast clustering algorithm to cluster very large categorical data sets in data mining. Proc. of the SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, University of British Columbia, 1997, 18. 0[7] Huang, Z. Extensions to the k-means algorithm to clustering large data sets with categorical values. Data Mining and Knowledge Discovery, 2, 1998, 283304. 0[8] Kaufman, L., Rousseeuw, P. Finding Groups in Data: An Introduction to Cluster Analysis. Hoboken: Wiley, 2005. 0[9] Mirkin, B. Clustering for Data Mining: A Data Recovery Approach. Boca Raton: Chapman & Hall/CRC, 2005. [10] Peckov, I. Statistika v ternnch przkumech. Praha: Professional Publishing, 2008. [11] ezankov, H. Measurement of binary variables similarities. Acta Oeconomica Pragensia, 9 (3), 2001, 129136. [12] ezankov, H. Analza dat z dotaznkovch eten. Praha: Professional Publishing, 2007. [13] ezankov, H., Hsek, D., Snel, V. Shlukov analza dat. 2nd. ed. Praha: Professional Publishing, 2009. [14] Stankoviov, I., Vojtkov, M. Viacrozmern tatistick metdy s aplikciami. Bratislava: Iura Edition, 2007. [15] Zhang, T., Ramakrishnan, R., Livny, M. BIRCH: An efficient data clustering method for very large databases. ACM SIGMOD Record, 25(2), 1996, 103114. [16] ambochov, M. Algoritmus BIRCH a jeho varianty pro shlukovn velkch soubor dat. Mezinrodn statisticko-ekonomick dny [CD-ROM]. Praha: VE, 2008.
Hana ezankov, Fakulta informatiky a statistiky VE v Praze, nm. W. Churchilla 4, 130 67 Praha 3 ikov, e-mail: hana.rezankova@vse.cz
3/2OO9
231
Abstract
This paper deals with specific techniques proposed for cluster analysis if a data file includes categorical variables. Nominal, ordinal and dichotomous variables are considered as categorical. Three types of clustering are described: object clustering, variables clustering and category clustering. Both specific coefficients for measurement of similarity and specific methods are mentioned. Two illustrative examples are included in the paper. One of them shows variable clustering (variables express satisfaction concerning a job of the respondent from different points of view) and the second one concerns category clustering (specializations of respondents are clustered according to the type of the university diploma); combination of the SPSS and STATISTICA software systems is applied in both example. Key words: Cluster analysis, categorical data analysis, similarity measures, dissimilarity measures.
232

Cluster Cat Vars

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cluster Cat Vars

Uploaded by

Copyright:

Available Formats

Cluster analysis and categorical data

Hana ezankov Vysok kola ekonomick v Praze, Praha

2. Methods of cluster analysis

where nr+ and n+s are marginal frequencies, expressed as

Further, we can mention Pearsons coefficient of contingency, calculated as

Cramrs V is another example of this type of similarity coefficients. It is expressed as

For symmetric coefficients, the following probabilities are considered:

The final formula is either

For two binary variables, it can be written as

In both cases, the coefficients measure dissimilarity and Drr = 0.

Figure 2 Dendrogram of relationships among categories

8. Further directions of development

You might also like