UMAP Analysis

This analysis tool evaluates the similarity of data rows in the main data table, using values of three or more numeric variables. The method used is Uniform Manifold Approximation and Projection (UMAP; McInnes *et al.*, 2020), which represents a multivariate data set in two dimensions, where points that are closer to one another in two-dimensional space are more similar to one other in multidimensional space.

The dialog for carrying out a UMAP analysis

The UMAP dialog prompts for:

  • Whether to carry out this analysis for all data in the data table or only for the subset that is selected on the map and in the data table.

  • Whether to remove missing values from the data set by eliminating variables (columns) with missing values, eliminating cases (rows) with missing values, or replacing missing values with zeroes. If missing values are replaced with zeroes, any real zeroes in the data set will also be treated as missing values.

  • Whether and how to standardize data values. Standardized values may be Z scores for each variable or L1-norm-standardized values for each row (i.e., the proportion of the row sum).

  • Three or more variables from the list displayed at the left of the dialog.

  • The number of neighbors around each point, in multivariate space, to be used when evaluating the structure of the data. Values may range from 2 to one-quarter of the total number of data rows. Smaller values emphasize fine structural features, and larger numbers emphasize more global structure.

  • The minimum distance of adjacent points in the projected (two-dimensional) space. Values range from 0 to 0.99. Smaller numbers cause similar points to be displayed more closely together.

  • The distance metric to be used to evaluate the multivariate similarity between data points. Available metrics are: Bray-Curtis, Canberra, Chebyshev, Cosine-theta, Correlation, Euclidean, Manhattan, and Minkowski. Euclidean distance is the default.

  • Optionally, a grouping variable. This will not affect the UMAP calculation, but will used to color points by group in the scatter plot that is produced.

After these values have been set, the ‘Calculate’ button will initiate the UMAP calculation.

Note that even if three or more variables are selected, removal of missing values by variable deletion may reduce the number of variables to fewer than three.

The results of the dimensionality reduction calculation will be displayed in a scatter plot on the right side of the dialog. If points are labeled on the map, hovering over a point on the scatter plot with the mouse will display the label for that point. The transparency (alpha value) of the points can be modified with the Alt-A keystroke.

The “Data Table” button below the plot will display a table of the coordinates of every point in the two dimensions shown in the scatter plot. Labels and grouping values will be included in this table if they have been specified. The table can be saved using the Ctrl-S keystroke.

The “k-Means clusters / Create” button will perform a k-means cluster analysis (MacQueen 1967) of the UMAP coordinate data. A prompt will be presented to select the number of clusters. Uniquely colored wedge symbols will be displayed on the scatter plot to identify the cluster for each data point, and the plot legend will display cluster identifiers.

The “k-Means clusters / Add Column” button will prompt for the name of a new or existing column, and then add or update that column with the cluster identifiers for each row.