Statistics¶

Several types of statistical summaries are available from the Stats menu. These allow exploration of the distributions of individual numeric variables, of the relationship betwen pairs of numeric variables, and of the co-occurrence frequency of categorical variables. The different types of summaries are briefly described in the documentation for the Stats menu. Following sections provide more detail on each of these options.

Univariate Statistics¶

This summary provides descriptive statistics to characterize the univariate distribution of selected numeric variables. In addition, two tests of normality and two estimates of the number of outliers are shown for each variable.

The dialog for displaying descriptive statistics for selected variables

Missing values are excluded from all of these summaries.

These descriptive statistics and test results are shown for both the un-transformed data values and for log₁₀-transformed values. The statistics for untransformed and log₁₀-transformed data are shown in two separate tables, each on a different tab of the dialog box.

The complete set of univariate statistics is:

Minimum

Maximum

Mean

Median

Mode

Geometric mean, for untransformed data only

Sample standard deviation

Coefficient of Variation (C.V.)

Sum

5^th percentile

95^th percentile

The p value for the Anderson-Darling test of normality. Low p values (e.g., less than 0.05) indicate that the distribution is non-normal.

The p value for the Lillefors test of normality. Low p values (e.g., less than 0.05) indicate that the distribution is non-normal.

The number of outliers found by Rosner’s test (or the Generalized Extreme Studentized Deviate test) (Rosner 1983). This test is only carried out when there are at least 15 observations, and it tests for a maximum of 5 outliers for 15-99 observations, and a maximum of 10 outliers for 100 or more observations. This test assumes that the distribution is approximately normal, so if the distribution is not normal the results of this test are not reliable. An alpha value of 0.05 is used for the t distribution to calculate the critical values for the test.

The number of values (outliers) greater or less than Tukey’s fences, or 1.5 times the interquartile range. This value should match the number of outliers shown on the box plot. This is calculated only when there are at least 5 observations.

The table of statistics for untransformed data can be exported to a file using the keystroke Ctrl-S, and the table of statistics for log₁₀-transformed data can be exported using the keystroke Ctrl-Z.

Fit Univariate Distribution¶

This summary displays graphical and tabular results of the fitting of one or more selected distributions to the values of a single numeric variable. All variables are considered to have been drawn from a continuous distribution, even if the values are all integers.

The dialog for evaluating fits of parametric distributions to numaric data

The distributions that may be fit to a variable depend on the range of the variable. If the variable includes values less than zero, the following distributions may be fit:

Laplace

Logistic

Normal

von Mises

If the variable includes values that are all positive, and at least one exceeds 1.0, then all of the preceding distributions may be fit, and also the following distributions:

Exponential

Gamma

Gompertz

Lognormal

Pareto

Rayleigh

If the values of the variable are all positive but are all within the range of (0, 1), then the followind distributions may be fit:

Beta

Uniform.

These range-based constraints do not guarantee that any of the allowable distributions can successfully (or reasonably) be fit to a specific data set. The user should choose distributions that are plausible for the data.

Calculation of goodness-of-fit statistics is done using Monte-Carlo sampling from the fitted distribution. This process will take longer for some distributions (e.g., Gamma) than for others.

The statistics that are displayed for each fitted distribution include parameters of the fitted distribution and measures of goodness of fit. These are:

The location or mean of the fitted distribution

The scale or standard deviation of the fitted distribution

Shape parameter(s) of the fitted distribution, if applicable

The sum of squared errors (SSE) between the scaled probability density function of the fitted distribution and of the data values. The SSE is most useful for comparing the fits of different distributions, with lower value indicating a better fit.

The Anderson-Darling (AD) goodness of fit statistic, or, if the AD statistic cannot be computed, the Kolmogorov-Smirnov or Cramer-von Mises statistic.

An estimate of the p-value for the goodness-of-fit statistic. This is the approximate probability of observing a value of the goodness-of-fit statistic as large as the one reported, if the data were drawn from the fitted distribution. Smaller values indicate poorer fits.

Akaike’s Information Criterion (AIC). The AIC value is most useful for comparing the fits of different distributions, with lower values indicating a better fit.

Bivariate Statistics¶

This summary provides statistics describing the relationship between two numeric variables. It presents several measures of the presence, or strength, of a linear or monotonic relationship between two selected variables. The relationship can be assessed for the untransformed variables, or one or both variables can be log₁₀-transformed.

The dialog for displaying statistics about the relationship between two numerical variables

The summary statistics can be displayed for either all data in the data table or for only the data rows that have been selected (e.g., by clicking on the map or table). Data rows where either of the variables are missing are not included in the summary, and there must be at least three rows of data for the calculations to be carried out.

The summary includes a table of statistics on the left and a scatter plot, with a linear regression line, on the right. The statistics that are displayed in the table are:

The name of the X variable

The name of the Y variable

N, the number of data points for which neither variable is missing

Covariance

Pearson’s correlation coefficient r (Wright 1921)

p value for the hypothesis that r is zero

Spearman’s correlation coefficient rho (Spearman 1904)

p value for the hypothesis that rho is zero

Kendall’s correlation coefficient tau (Kendall 1938)

p value for the hypothesis that tau is zero

Chatterjee’s correlation coefficient xi (Chatterjee 2020)

Slope of an ordinary least-squares (OLS) linear regression

Intercept of the OLS linear regression

R-squared (R²) for the regression

Adjusted R-squared for the regression

Total sum of squares (SS) for the regression

Sum of squares explained by the regression

Residual sum of squares, not explained by the regression

p value for the hypothesis that the regression slope is zero

p value for the hypothesis that the regression intercept is zero

Akaike’s Information Criterion (AIC) for the regression

Bayesian Information Criterion (BIC) for the regression

p value for a Mann-Kendall Trend Test of the Y variable. There must be at least four rows of data for this to be evaluated.

p value for a Runs Test of the Y variable, using the median as a cutoff and with a small-sample correction for N < 50 (per NIST)

Theil-Sen slope

95% confidence interval on the Theil-Sen slope

Theil-Sen intercept, calculated using the median and the Theil-Sen slope.

If a date or date/time column is chosen for the X variable, only the number of points and the results of the Mann-Kendall and Runs tests will be shown.

The plot to the right of the data table includes two panes on separate tabs. The first tab contains a scatter plot showing all of the individual data points with the OLS regression line and 95% confidence bands about the regression line. The second tab shows the distribution of residuals for the regression.

The table of statistics can be exported to a file using the keystroke Ctrl-S.

The scatter plot can be modified using the following hotkeys:

The opacity (alpha value) of the symbols on the plot can be changed using the Alt-A keystroke.

Display of a Theil-Sen line on the plot can be toggled on and off using the Alt-S keystroke.

The Alt-T, Alt-X, and Alt-Y keystrokes can be used to modify the plot title, X axis label, and Y axis label, respectively.

The scatter plot is similar to that which can be created using the Plot/New dialog. Differences are that this one does not display multiple groups, natural breaks, or a LOESS line, but does show the confidence band of the OLS regression line and the regression residuals.

Correlation Matrix¶

This summary displays all pairwise correlation coefficients for a set of two or more selected numeric variables. The values are presented in the form of a correlation matrix with heatmap coloring to represent the direction and strength of each relationship.

Two or more variables to use should be selected from the list at the left side of the dialog. Multiple values can be selected by shift-clicking or control-clicking.

The type of correlation coefficient to display can be selected from the dropdown box that is below the list of variables. Available types of correlation coefficients are:

Pearson’s r for parametric data (Wright 1921)

Spearman’s rho for non-parametric data (Spearman 1904)

Kendall’s tau for non-parametric data (Kendall 1938)

Chatterjee’s xi for non-monotonic data (Chatterjee 2020).

Pearson, Spearman, and Kendall correlation coefficients can all range from -1.0 to 1.0. Chatterjee’s correlation coefficient only ranges from 0.0 to 1.0, but the same range of heatmap colors is used for all correlation coefficients.

The correlation coefficients can be displayed for either all data in the data table or for only the data rows that have been selected (e.g., by clicking on the map or table).

Data rows where either of the variables are missing are not included in the summary.

Data may be log₁₀-transformed for the correlation calculations, but if any value for any of the selected variables cannot be log-transformed, a warning will be displayed and the correlation matrix will not be calculated.

The values in the matrix are symmetric about the main diagonal.

Contingency Table¶

This summary provides a grahic representation of a 2x2 contingency table for two variables that have their values subdivided into two groups (nominally positive and negative). The summary includes a table of values including the results of statistical tests for independence of the two groups, the risk ratio, odds ratios and related statistics, and total and conditional probabilities.

Either numeric or categorical variables may be used. For numeric variables, a threshold must be specified. All values either above or below this threshold (as specified on the form) will be in the ‘positive’ group, and all other values will be in the ‘negative’ group. For categorical variables, all values of the variable will listed. Multiple values can be selected by left-clicking on them. All of the selected values will be in the ‘positive’ group and the un-selected values will be in the ‘negative’ group.

The contingency table shows the number of co-occurrences of all combinations of ‘positive’ and ‘negative’ values for the two variables.

The complete set of statistics that may be shown is listed below. Not all of these values will be show if there are empty cells (zeroes) in the contingency table.

The Chi-square statistic.

The p value for the Chi-square statistic.

The degrees of freeedom for the Chi-square test.

The Fisher exact test statistic.

The p value for the Fisher exact test.

The Barnard exact test statistic.

The p value for the Barnard exact test.

The Boschloo exact test statistic.

The p value for the Boschloo exact test.

The risk ratio (Ranganathan et al. 2015).

The odds ratio (Bland 2000).

The log odds ratio: the natural log of the odds ratio.

The standard error (SE) of the log odds ratio.

The 95% confidence interval (CI) for the log odds ratio.

Yule’s (1900) measure of association (Q).

Yule’s (1912) coefficient of colligation (Y).

Yule’s (1912) phi coefficient–this is equivalent to the Pearson correlation coefficient for binary data.

The probability (frequency) of a positive value of the row variable.

The proability of a positive value of the row variable given that the column variable is positive.

The proability of a positive value of the row variable given that the column variable is negative.

The probability (frequency) of a positive value of the column variable.

The proability of a positive value of the column variable given that the row variable is positive.

The proability of a positive value of the column variable given that the row variable is negative.

The probability of positive values of both row and column variables.

The probability of a positive value of either the row variable or the column variable.

Tests of independence will not be carried out when there are fewer than 20 total observations.

The table of statistics can be exported to a CSV file or spreadsheet with the Ctrl-S keystroke.

Receiver Operating Characteristics¶

This summary displays a Receiver Operating Characteristics (ROC) curve (Fawcett 2006) and a table of related statistics for a user-specified condition variable and predictor variable.

The condition variable may be either numeric or categorical. The values of this variable must be divided into ‘positive’ and ‘negative’ conditions. For numeric variables, a threshold value must be specified, and all values above or below the threshold, as specified on the form, will represent the positive condition. For categorical variables, all the values are displayed in a list, and the values representing the positive condition must be specified by clicking on them.

The predictor variable must be numeric. The ROC curve and statistics will be displayed as soon as the condition variable, positive condition, and predictor variable are specified. The statistics that are initially displayed will be for the default predictor threshold value of 0.0. Modifying the predictor threshold will change the ROC statistics but not affect the ROC curve.

Terminology used for ROC curves and associated statistics varies depending on the field and author. The senses of condition and predictor variables may be reversed, for instance, and the statistics listed below may be known by other names. For example, sensitivity is also known as the true positive rate, hit rate, probability of detection, and power.

The ROC curve and statistics can be computed either for all data in the data table or for a subset that has been selected on the map or table.

The ROC statistics that are displayed are:

The total number of observations.

The number of actual positive values.

The number of actual negative values.

The name of the predictor variable.

The minimum value of the predictor variable in the data set.

The maximum value of the predictor variable in the data set.

The prediction threshold.

The number of predicted positive values.

The number of predicted negative values.

The number of correctly predicted positive values.

The number of correctly predicted negative values.

The number of false positives.

The number of false negatives.

The sensitivity, or the number of correctly predicted positive values as a fraction of the number of actual positive values.

The specificity, or the number of correctly predicted negative values as a fraction of the number of actual negative values.

The precision, or the number of correctly predicted positive values as a fraction of the total number of correct predictions.

The positive likelihood ratio (LR+) for the threshold value (Hajian-Tilaki 2013, Nahm 2022).

The negative likelihood ratio (LR-) for the threshold value (Hajian-Tilaki 2013, Nahm 2022).

The maximum value of LR+ for all data points.

The false positive rate, or the number of false positives as a fraction of the number of actual negative values.

The false negative rate, or the number of false negatives as fraction of the number of actual positive values.

The critical success index, or the number of correctly predicted positive values as a fraction of the total number of correct predictions, false positives, and false negatives.

The accuracy, or the total number of correct predictions as a fraction of the total number of observations.

Youden’s J statistic (Nahm 2022): The difference in sensitivity between the threshold value and the diagonal line that represents random predictive ability.

The maximum value of Youden’s J statistic that is found for all data points.

The Euclidean distance (ED: Nahm 2022) from the ROC point represented by the threshold value and the upper-left corner of the ROC plot.

The minimum value of ED for all data points.

Some of these statistics will not be displayed if there are not any actual positives or actual negatives in the selected data.

The table of ROC statistics can be exported to a CSV file or spreadsheet with the Ctrl-S keystroke.

t-SNE Analysis¶

This tool evaluates the similarity of three or more numeric variables using the t-Distributed Stochastic Neighbor Embedding method (van der Maaten and Hinton 2008).

After appropriate data selections have been made on the left side of the dialog and the “Calculate” button pressed, the calculation is performed and a two-dimensional scatter plot of the results is shown on the right side of the dialog. If points are labeled on the map, hovering over a point on the scatter plot will pop up the label for that point. The transparency (alpha value) of the points can be modified with the Alt-A keystroke.

The “Data Table” button below the plot will display a table of the coordinates of every point in the two dimensions shown in the scatter plot. Labels and grouping values will be included in this table if they have been specified. The table can be saved using the Ctrl-S keystroke.

UMAP Analysis¶

This tool evaluate the similarity of three or more numeric variables using the Unified Manifold Approximation and Projection method (McInnes et al., 2010).

At least three numeric variables must be selected from the list at the left side of the dialog. The UMAP analysis can use a sparse data matrix–missing values will be replaced with zeroes for the analysis.

Additional input parameters may be selected to alter the focus of the analysis to emphasize either fine structure or overall structure. A grouping variable may also be selected to distinguish different groups in the output.

After all input values have been chosen, the “Calculate” button will start the calculation. The results will be shown as a scatter plot on the right side of the dialog. The “Data Table” button below the plot will display a table of the coordinates of every point in the two dimensions shown in the scatter plot. Labels and grouping values will be included in this table if they have been specified. The table can be saved using the Ctrl-S keystroke.

Categorical Correspondence¶

This summary shows the number of occurrences in the data set, and the frequency of occurrences as a percentage, of each unique combination of values for two categorical variables.

The dialog for displaying the correspondence between two categorical variables

Missing values are included.

This summary can be produced either for all data in the data table or for only a selected subset (e.g., that has been selected by clicking on the map or table).