NMF Unmixing¶
In some cases, observations in the data table may have arisen through a mixing process. A mixing process requires two or more distinct sources of material. Each of the sources should contain a unique combination of values of different constituents or variables–that is, a source pattern. Different sources may contain the same constituents, but in different amounts. For example, different paints may contain different kinds and amounts of colorants, solvents, binders, and other additives.
A mixing process combines material from two or more sources in varying proportions, producing one or more mixtures. The result of such a mixing process can be represented in a table format with the mixtures as rows (cases), and the source constituents as columns (variables). A data table used with mapdata may have that form.
After mixtures have been created, the source patterns are unknown (or at least, no longer evident). Unmixing is a mathematical process that produces an estimated reconstruction of the source patterns and the proportion of each source in each mixture. The process used by mapdata to unmix a data set is non-negative matrix factorization (NMF; Lee and Seung 1999). This process takes the data table (or some subset thereof) as input and produces two tables, or matrices, as output. Each of the output matrices shares one dimension with the data table (i.e., either rows or columns); the other dimension of each output matrix corresponds to the estimated sources. For consistency with the mixing paradigm, the estimated sources are referred to here as end members.
The two matrices produced by the unmixing process are:
A composition matrix. This shows the quantity of each variable (constituent) in each end member–that is, the composition of each end member. There is one row for each end member and one column for each variable (variables are in columns, just as in the data table). This matrix defines a characteristic pattern of values for each end member.
A contribution matrix. This shows the proportion of each end member that makes up each case (or mixture)–that is, the contributions of each end member to each case. There is one row for each case and one column for each end member (cases are in rows, just as in the data table). The values in each row of this matrix sum to 1.0.
If the contribution matrix is combined with the composition matrix by matrix multiplication, the result will be approximately the same as the original data matrix. The result may be only approximate because:
The original data matrix may not actually be the result of mixing of end members, so both the matrix factorization and reconstruction will be imprecise.
The unmixing process does not inherently identify the number of end members; the user must do so, and if an inappropriate value is chosen, the mixing process will be inaccurately modeled.
There is no simple formula for factoring a matrix into two constituent matrices. The unmixing process is iterative, and the best solution may not be an exact solution. There may also be multiple different factorizations possible.
Matrix factorization does not necessarily produce constituent matrices for which one of them (the contribution matrix) has row values that sum to 1.0. The additional constraint that contribution matrix rows sum to 1.0 is what establishes these results as a mixing model and not just a pair of matrix factors.
An unmixing analysis can be launched from the Statistics/NMF unmixing menu item. The following dialog is then displayed, initially showing only a list of variables at the left side.
The unmixing analysis requires that the data set have a candidate key column (which must be a text variable). If there are multiple candidate keys in the data set, the dialog will prompt for the one to use; if there is only one candidate key, it will be automatically selected. If the data set contains a numeric candidate key, a corresponding text variable can be created using the Table/Recode data menu item.
This dialog prompts for:
Whether to carry out this analysis for all data in the data table or only for the subset that is selected on the map and in the data table.
Whether to remove missing values from the data set by eliminating variables (columns) with missing values or by eliminating cases (rows) with missing values.
The candidate key column to use, if there is more than one.
Three or more variables from the list displayed at the left of the dialog.
The number of end members to evaluate. This input option cannot be selected until after the first unmixing operation is carried out. The first unmixing operation will evaluate up to 12 end members and compile diagnostics to assist in the identification of the most appropriate number of end members.
No transformation or standardization of the data can be done because a) mixing operates on original data values, not transformed values, and b) there is no requirement that data conform to any statistical distribution and so there is no need to transform data for that reason.
After these values have been set, the ‘Unmix’ button will initiate the unmixing calculation. The first time that unmixing is carried out for a set of selected data (i.e., when the number of end members cannot be chosen), unmixing will be performed for up to 12 end members. Fewer end members may be evaluated if there is a smaller number of cases or variables in the selected data.
Before unmixing is carried out, missing values will be eliminated from the data set by deleting either rows or columns that contain missing values. If this results in a data set with fewer than three cases or three variables, unmixing will not be conducted.
The results of each unmixing run will be shown in the set of six tabs on the right side of the dialog. The information on these tabs is as follows:
EM compositions – This is the end member composition matrix. Each row represents a single end member. The first column contains end member identifiers, which take the form “EM1”, “EM2”, etc. Following columns are for the data set variables. Table cells contain the quantity of each variable in the corresponding end member.
EM contributions – This is the end member contribution matrix. There is one row for each case, and one column for each end member. Table cells contain the proportion of the end member in the sample. The values in each row sum to (approximately) 1.0. If any values greater than 1.0 appear in this table, this is an indication that an unmixing solution could not be found for the specified number of end members, and the results should not be used.
EM values – This table is derived from the contribution matrix, but shows, for each case and end member, the sum of the products of the end member composition and contribution. The sum for each case, across end members, should be approximately equal to the sum of all variable values for that case. The last column of this table (“Residual”) shows the difference between the sum across variables in the data table and the sum across end members in this table. Residuals should be relatively small if the data set is well represented by the composition and contribution matrices.
Profile plots – This tab contains vertically stacked bar plots, one for each end member, that depict the contribution matrix graphically. Variables are shown on the X axis, and the value in each end member is shown on the Y axis. The Y axes of all plots are scaled identically so that bar heights can be compared across end members.
Diagnostics – The first run of the unmixing calculation for a data set compiles several diagnostic values for each number of end members evaluated. These are intended to assist the user in choosing the most appropriate number of end members (which may be none, if the data set is not actually derived from mixing of end member patterns). After unmixing is completed for each number of end members, the composition and contribution matrices are multiplied to produce an estimated reconstruction of the original data matrix. The diagnostics are based on comparison of the original and reconstructed data matrices. The diagnostics are:
Total residuals (Tot. resid.) – The difference between the sums of all values in these two matrices.
Residual Sum of Squares (RSS) – The sum of the squared differences between corresponding cells of the two matrices.
Akaike’s Information Criterion (AIC) – An estimation of prediction error based on information theory, taking account of the number of parameters (end members) used in the prediction. Calculated from the RSS.
Root Mean Squared Error (RMSE) – The square root of the mean of the squared differences between corresponding cells of the two matrices.
Frobenius norm (Frob. norm) – The Euclidean distance between the two matrices in multivariate space.
Kullback-Liebler divergence (K-L div.) – The change in entropy between the original and reconstructed matrices.
Diagnostic plots – A figure that can show a plot of any of the diagnostic measures against the number of end members.
The first unmixing operation will compile diagnostics for 1 to 12 possible end members. Smaller values of the diagnostics indicate a better match between the original and reconstructed matrices for all of these diagnostics. Most of these metrics get smaller as the number of end members increases because more fitting parameters (i.e., end members) generally allows a better fit. These diagnostics will not necessarily all reach their smallest value at the same number of end members, and the smallest value is not necessarily indicative of the true number of end members.
After the initial set of unmixing calculations is completed, mapdata will populate the first three tables with the results for the end member at which there is an elbow in the sequence of RMSE or Frobenius norm values. The results for just one end member are never selected because: a) the composition matrix for one end member represents an average pattern for the entire data set, and not the components of an actual mixture; and b) the rate of change in the RMSE value can’t be calculated for one end member. As a consequence, mapdata will always identify the presence of at least two end members, even if the data set was not produced by (or can be characterized as the results of) a mixing process. Users must determine whether the unmixing results make sense by using domain knowledge and possibly by using additional and follow-on analyses (e.g., by mapping the distributions of dominant end members).
The accuracy of automatic end member identification is greatest when there are fewer than five actual end members, a large number of samples, and little noise (or uncertainty) in the data set. The profile plots and the diagnostic plots can be used to evaluate whether a different number of end members may be more accurate.
After the initial set of unmixing calculations is completed, the end member selection dropdown box that is under the list of variables will be enabled. The user can then choose to perform the unmixing calculation for alternate numbers of end members.
Each of the output data tables can be saved to a file using the Ctrl-S keystroke.
The “Source Data” button at the bottom of the dialog will display a table of all of the data used for unmixing, after elimination of missing values.
The “Add Columns” button at the bottom of the dialog is only active when the EM values table is displayed. This button will append those EM values to the data table. The names of the new data table columns will be “EM1”, “EM2”, etc., prefixed with additional text that can be used to distinguish the results of different unmixing calculations. Mapdata will prompt for this prefix, and will separate this prefix from the rest of each column name with an underscore. The results of the unmixing analysis can then be used to select and highlight data based on end member values, change map symbology, create data plots, or to carry out other statistical analysis supported by mapdata.