Categorical Similarity Matrix

This summary calculates pairwise similarity measures between cases, using one or more selected categorical variables. The similarity values are presented in the form of a matrix with heatmap coloring to represent the strength of each relationship.

The dialog for displaying a similarity matrix for one or more categorical variables

Categorical similarities are calculated between rows of the data table. Each row must therefore have a unique identifier (a candidate key). If the data table does not have a candidate key, one can be created with the Table/Add unique row IDs menu item.

Five different similarity measures can be calculated. These are described and contrasted by Boriah et al. (2008). The similarity measures are:

  • Lin – A measure based on information theory that gives greater weight to matches on frequently-occurring categorical values, and less weight to mismatches on infrequently-occurring categorical values.

  • Goodall3 – A measure based on the probability that a similarity value would be observed in a random sample of two rows, given the frequency distribution of category values over all rows. Matches of infrequently-occurring categorical values are given relatively high weight.

  • OF – The Occurrence Frequency measure gives greater weight to mismatches on frequently-occurring categorical values than to mismatches on infrequently-occurring categorical values.

  • IOF – The Inverse Occurrence Frequency measure gives greater weight to mismatches on infrequently-occurring categorical values than to mismatches on frequently-occurring categorical values–the opposite of the OF measure.

  • Overlap – A measure that considers the values of a categorical variable to be similar only if the values are identical on the two rows being compared.

Different similarity measures may be most appropriate for different data sets. Boriah et al. (2008) found that the Lin, Goodall3, and OF measures performed well for outlier detection across a variety of data sets.

All text variables are considered to be potential categorical variables, and can be used to calculate categorical similarities.

At least two cases and one categorical variable must be specified to calculate categorical similarities. If there are missing values in the matrix of selected rows and variables, they will be eliminated by deletion of all rows with any missing value.

In addition, a candidate key variable must also be specified; this must also be a text variable. If there are multiple candidate keys in the data set, the categorical similarity dialog will prompt for the one to use; if there is only one candidate key, it will be automatically selected. If the data set contains a numeric candidate key, a corresponding text variable can be created using the Table/Recode data menu item. The axes of the categorical similarity matrix will be labeled with the values of the candidate key.

The categorical similarity dialog prompts for:

  • Whether to use all data in the data table or just the subset that has been selected (e.g., by clicking on the table or map).

  • The candidate key column to use, if there is more than one.

  • One or more categorical variables, from a list that is displayed at the left side of the dialog.

  • The type of categorical similarity to calculate.

A matrix of categorical similarity values is then displayed on the right side of the dialog. The matrix is immediately updated if any changes are made to the data selections on this dialog or, if only selected data are being used, changes are made to the selected data in the data table.

The numeric value of the categorical similarity is shown in each cell of the matrix by default. The Alt-L hotkey will toggle display of the numeric values on or off. When the numeric values are not shown, similarities will be represented by only the heatmap coloring.

The values in the matrix are symmetric about the main diagonal.

The “Source Data” button will display a table of the selected data after removal of missing values. This table can be saved using the Ctrl-S hotkey.

The “Similarities” button will display a table of the categorical similarity values in the matrix. This table can be saved using the Ctrl-S hotkey.