Fit Univariate Distribution

This summary displays graphical and tabular results of the fitting of one or more selected distributions to the values of a single numeric variable. All variables are considered to have been drawn from a continuous distribution, even if the values are all integers.

The dialog for fitting distributions to a numeric variable.

The univariate fitting dialog prompts for:

  • The numeric variable to be fit (the X variable). Integer variables are treated as if they are continuous variables.

  • Whether to use log-transformed values for each of them.

  • Whether to use all data in the data table or just the subset that has been selected

  • The type of distribution to be fit to the data.

As soon as a numeric variable has been selected, a histogram of the values of that variable will be displayed in the lower left. After a distribution has been selected, the density distribution for the fitted distribution will be overlain on the histogram, and the parameters and goodness-of-fit statistics for that distribution will be shown to the right of the plot.

Multiple distributions can be fitted and visualized on the same histogram. Any change to the data will cause the histogram to be re-created without any of the previously fitted distributions or statistics.

The types of distributions that can be selected depend on the range of the data. For example, if the data values are all within the (0, 1) interval, only the Beta and Uniform distributions can be fitted.

The number of bins displayed on the histogram can be altered with the Alt-B keystroke, and the table of statistics can be saved to a CSV file with the Ctrl-S keystroke.

The distributions that may be fit to a variable depend on the range of the variable. If the variable includes values less than zero, the following distributions may be fit:

  • Laplace

  • Logistic

  • Normal

  • von Mises

If the variable includes values that are all positive, and at least one exceeds 1.0, then all of the preceding distributions may be fit, and also the following distributions:

  • Exponential

  • Gamma

  • Gompertz

  • Lognormal

  • Pareto

  • Rayleigh

If the values of the variable are all positive but are all within the range of (0, 1), then the following distributions may be fit:

  • Beta

  • Uniform.

These range-based constraints do not guarantee that any of the allowable distributions can successfully (or reasonably) be fit to a specific data set. The user should choose distributions that are plausible for the data.

Calculation of goodness-of-fit statistics is done using Monte-Carlo sampling from the fitted distribution. This process will take longer for some distributions (e.g., Gamma) than for others.

The statistics that are displayed for each fitted distribution include parameters of the fitted distribution and measures of goodness of fit. These are:

  • The location or mean of the fitted distribution

  • The scale or standard deviation of the fitted distribution

  • Shape parameter(s) of the fitted distribution, if applicable

  • The sum of squared errors (SSE) between the scaled probability density function of the fitted distribution and of the data values. The SSE is most useful for comparing the fits of different distributions to the same data (the same number of data points). Lower value indicate a better fit.

  • The root mean squared error (RMSE) between the scaled probability density function of the fitted distribution and of the data values. The RMSE is most useful for comparing fits when the number of data points may vary. Lower values indicate a better fit.

  • The Anderson-Darling (AD) goodness of fit statistic, or, if the AD statistic cannot be computed, the Kolmogorov-Smirnov or Cramer-von Mises statistic.

  • An estimate of the p-value for the goodness-of-fit statistic. This is the approximate probability of observing a value of the goodness-of-fit statistic as large as the one reported, if the data were drawn from the fitted distribution. Smaller values indicate poorer fits.

  • Akaike’s Information Criterion (AIC). The AIC value is useful for comparing the fits of different types of distributions, particularly distributions with different numbers of parameters. Lower values indicate a better fit.