t-SNE Analysis

This analysis tool evaluates the similarity of data rows in the main data table, using values of three or more numeric variables. The method used is t-Distributed Stochastic Neighbor Embedding (t-SNE; van der Maaten and Hinton, 2008), which reduces the dimensionality of a multivariate data set, in this case to two dimensions, where closer points in the two-dimensional space are more similar to each other in multi-dimensional space than they are to more distant points.

The dialog for carrying out a t-SNE analysis

The t-SNE dialog prompts for:

  • Whether to carry out this analysis for all data in the data table or only for the subset that is selected on the map and in the data table.

  • Whether to remove missing values from the data set by eliminating variables (columns) with missing values, eliminating cases (rows) with missing values, or replacing missing values with zeroes. If missing values are replaced with zeroes, any real zeroes in the data set will also be treated as missing values.

  • Whether and how to standardize data values. Standardized values may be Z scores for each variable or L1-norm-standardized values for each row (i.e., the proportion of the row sum).

  • Three or more variables from the list displayed at the left of the dialog.

  • A value for perplexity, which affects the separation of points. Low values of perplexity show finer structure in the data, but may also show random variation. Larger values produce output that shows only major groupings of points. If a perplexity value is entered that is greater than the number of selected data rows, it will be automatically set equal to half of the number of selected data rows.

  • The distance metric to be used to evaluate the multivariate similarity between data points. Available metrics are: Bray-Curtis, Canberra, Chebyshev, Cosine-theta, Correlation, Euclidean, Manhattan, and Minkowski. Euclidean distance is the default.

  • Optionally, a grouping variable. This will not affect the t-SNE calculation, but will used to color points by group in the scatter plot that is produced.

After these values have been set, the ‘Calculate’ button will initiate the t-SNE calculation.

Note that even if three or more variables are selected, removal of missing values by variable deletion may reduce the number of variables to fewer than three. If that occurs, the calculation will not be carried out.

The results of the t-SNE calculation will be displayed in a scatter plot on the right side of the dialog. If points are labeled on the map, hovering over a point on the scatter plot with the mouse will display the label for that point.

The transparency (alpha value) of the points can be modified with the Alt-A keystroke.

The “Data Table” button below the plot will display a table of the coordinates of every point in the two dimensions shown in the scatter plot. Labels and grouping values will be included in this table if they have been specified. The table can be saved using the Ctrl-S keystroke.

The “k-Means clusters / Create” button will perform a k-means cluster analysis (MacQueen 1967) of the t-SNE coordinate data. A prompt will be presented to select the number of clusters. Uniquely colored wedge symbols will be displayed on the scatter plot to identify the cluster for each data point, and the plot legend will display cluster identifiers.

The “k-Means clusters / Add Column” button will prompt for the name of a new or existing column in the data table, and then will add or update that column with the cluster identifiers for each row. This allows the spatial locations of clusters to be highlighted on the map.

The output of the t-SNE analysis is not deterministic, so multiple runs with the same data and the same perplexity values may produce different sets of reduced (2-dimensional) coordinates.