Data Plots

The following sections describe how to produce and customize data plots. The data plotting features supported by mapdata provide univariate and bivariate representations of data that can reveal features and relationships that are not easily discerned by viewing just the map or the data table.

The dialog for choosing and displaying a data plot

A new plot can be created by using the Plot/New menu item, the Alt-G keystroke from the main application, or the Clone button on an existing plot dialog. These actions will create a new plot dialog. If the Plot/New menu item has been used, the plotting area of the new plot will be empty. If the Clone button has been used, the new plot will have the same settings and appearance as the plot that was cloned. Any number of plots may be open at the same time, so that you can easily compare and contrast different data selections or different plot types.

Working message

When it starts, mapdata.py will begin evaluating data types and compiling statistics for every column in the data source. This information is needed to ensure that only appropriate types of columns can be selected for the X and Y axes of the plot. If this process has not completed before the Plot/New menu option is selected, a notice will be displayed indicating that evalution of data types is still ongoing. When this evaluation is complete, the notice will disappear and the plot dialog will appear.

Types of Data Plots

All of the types of plots that can be created from the Plot / New menu are described and illustrated in the following subsections. Pair plots are created from a separate menu item, and use a different approach to selecting variables to be plotted, but otherwise the following descriptions of scatter plots and kernel-density plots apply to pair plots also.

Box plot

Boxplot

The box plot, or box-and-whisker plot, displays the distribution(s) of a single variable, for one or more categories. The central box shows the range of values encompassing the second and third quartiles (the inter-quartile range, or IQR), the colored bar in the middle of the box is the median, the whiskers that extend beyond the box show the data range exclusive of outliers, and any points (dots) beyond the whiskers represent outliers. Points are identified as outliers if they are more than 1.5 times the IQR above the third quartile or below the first quartile. The number of outliers shown on a box plot is also shown in the Bivariate statistics dialog.

Creation of a boxplot requires specification of a quantitative X variable. If a grouping variable is also specified, a separate box plot will be displayed for each unique value of the grouping variable.

By default, box plots are oriented vertically. They can be rotated horizontally with the Alt-R keystroke.

Other types of plots that can be used to visualize data distributions are histograms, stripcharts, kernel density plots, and violin plots.

Breaks groups

Groups defined by Jenks Natural Breaks

The Jenks Natural Breaks algorithm is a deterministic method of clustering one-dimensional data (i.e., values of a single quantitative variable) into two or more groups. Clusters are defined by sets of contiguous samples for which the average deviation of each sample from the mean of the group is less than the deviation from the means of other groups. Clusters, or groups, of values for a single variable may arise because of the presence of outliers, because the data are derived from a mixture of distributions, and for other reasons. The method does not assume or require that the data conform to any statistical distribution. Logarithmically-transformed data may have a different number of breaks than the un-transformed data.

Mapdata allows the groups defined by the Jenks Natural Breaks algorithm to be viewed in several different ways. The ‘Breaks groups’ plot shows the different groups on the X axis, and the values within each group on the Y axis. Creation of a ‘Breaks groups’ plot requires specification of a single quantitative variable, as the X variable. The result shows a scatter plot of the data divided into two or more groups. The number of groups shown is the optimum in the range from 2 to 8. The method of determining the optimum number of groups can be visualized with a ‘Breaks optimum’ plot.

At least two groups will always be identified and shown on a ‘Breaks groups’ plot. If only two groups are shown, the user must decide whether subdivision of the data into at least two groups is warranted. The ‘Breaks optimum’ plot and other plots that show data divided into Natural Breaks groups can assist with this determination.

Other types of plots that can show groups defined by Jenks Natural Breaks are Normal Q-Q plots, scatter plots, and line plots.

Breaks optimum

Optimum number of groups identified by Jenks Natural Breaks

The Jenks Natural Breaks method does not itself identify how many clusters, or groups, are present within a data set. Mapdata uses a simple heuristic to estimate the optimum number of breaks in a data set. This heuristic is based on the goodness of variance fit. The GVF is 1.0 minus the ratio between the sum of squared deviations from the group means divided by the sum of squared differences from the overall mean, expressed as a percentage. The GVF increases with the number of groups, reaching 100 when the number of groups equals the number of data points.

The heuristic used by mapdata is to select the number of groups where there is the largest relative change in slope in a plot of GVF versus the number of groups. The plot produced shows the GVF values for each number of groups, with a dot superimposed on the curve at the point where there is the greatest relative change in slope. Such an inflection points is commonly referred to as the ‘knee of the curve’.

Creation of a ‘Breaks optimum’ plot requires specification of a single quantitative variable, as the X variable–just as for the ‘Breaks groups’ plot.

Because it is not possible to find a change in slope when there is just one group, the minimum number of groups that can be identified by this heuristic is two. Thus, even a data set with a smooth continuous distribution will be assessed as having two groups. When only two groups are identified, the user should use the ‘Breaks optimum’ plot and other plots that separate data by Jenks Natural Breaks groups, to decide whether this heuristic produces a reasonable result. Other statistical methods not supported by mapdata may also be useful for this purpose.

Bubble plot

Bubble plot

This plot type shows the relationships between three numeric variables, and optionally, a fourth categorical variable. The relationship between two of the numeric variables, X and Y, are shown as on a scatter plot. The third numeric variable, Z, is represented by the size of the symbol on the plot. The area of each symbol is proportional to the value of the Z variable. The maximum symbol size can be adjusted using the “Max. size” setting on the plot dialog.

If a grouping variable is also specified, then each symbol will be colored according to the value of the grouping variable.

Symbols will be partially transparent by default. The transparency (alpha value) can be adjusted with the Alt-A keystroke. The Y axis can be flipped with the Alt-F keystroke.

Count by category

Plot of category counts

This plot type shows the number of rows in the data table for each unique value of a categorical variable. This data summary is shown as a histogram, with the categorical values on the X axis and the counts on the Y axis.

Missing values are not shown on this plot. The number of missing values for each variable can be viewed using the Table/Data types menu option.

Production of this plot requires specification of only a single categorical variable (as X).

The axes of this plot can be rotated with the Alt-R keystroke. The Y axis can be flipped with the Alt-Y kestroke when the axes are rotated–that is, when the categories are listed on the Y axis.

Empirical CDF

Empirical CDF

The empirical cumulative distribution function (CDF) plot shows the cumulative frequency, on the Y axis, of the values of a quantitative variable, on the X axis. The left end of the function is at 0,0, and the right end has a Y value of 1.0 at the maximum X value. Each point on the curve identifies the fraction of all data points that are less than the corresponding X value.

The empirical CDF function is commonly convex upward for distributions with a central mode. There is no requirement, or assumption, that the data follow any particular statistical distribution (hence, “empirical”). For data that do conform to a statistical distribution (such as the Gaussian or Normal), the curve will be smooth. Irregularity in the curve may indicate the presence of a mixture distribution.

Creation of an empirical CDF plot requires specification of a single quantitative variable as X.

Histogram

Histogram

The histogram shows the number of data points (on the Y axis) within certain ranges of values of a quantitataive variable (on the X axis).

Creation of a histogram requires specification of a single quantitative variable (as X). By default, the range of X values is subdivided into a number of bins determined by Doane’s rule (Doane 1976). The number of bins can be changed using the Alt-B keystroke. (The number of bins selected also affects Min-max by bin plots.)

A grouping variable may also be specified. If it is, stacked bars will be created, with each value of the grouping variable colored differently.

Other types of plots that can be used to visualise distributions of a variable for different categories are box plots, stripcharts, kernel density plots, and violin plots.

Kernel density (KD) plot

Kernel-density plot

A kernel density (KD) plot shows a smoothed representation of the probability density of a quantitative variable. The smoothing method does not assume, or require, that the data conform to any statistical distribution. The smooth representation of the data density may, however, help to reveal whether a particular statistical distribution might apply.

This plot type produces a KD plot of a single quantitative variable, which is specified as the X variable.

If a grouping variable is also specified, each category corresponding to a different value of the grouping variable is shown in a different color, and the plots for individual categories are partially transparent so that areas of overlaps can be distinguished. The opacity (alpha value) can be changed using the Alt-A keystroke. The area under the curve for each category is proportional to the number of observations in that category. Consequently, the KD plot represents the relative sizes of all the categorical subsets, as well as their distributions.

Other types of plots that can be used to visualise distributions of a variable for different categories are box plots, stripcharts, histograms, and violin plots.

Line plot

Line plot (temporal) Line plot with regression line

This plot shows the relationship between two quantitative variables, X and Y. Successive data points, ordered by the X variable, are connected by a line. Creation of this plot requires, at a minimum, specification of quantitative X and Y variables. Date and date/time values can be used as the X variable. A grouping variable may also be specified; if it is used, a separate line, uniquely colored, will be produced for each value of the grouping variable. Data for which the X value, the Y value, or the grouping value are missing will not be included in the plot.

In addition to grouping points based on columns of categorical data in the data table, points can be grouped by Jenks Natural Breaks of the X variable, if the X variable is quantitative (not a date or date/time). This grouping option is listed as “* Breaks in X”. Vertical lines delineating Jenks Natural Breaks in the X value can be toggled on and off with Alt-B.

Regression statistics

In addition, if the X variable is numeric (not a date or date/time), a least-squares linear regression line can be fitted to the data and displayed on the plot; this can be toggled on and off with Alt-R. The fitted regression is shown as an orange line. A dialog box showing the regression slope, intercept, and R-squared value can be displayed when the regression is calculated; display of this dialog box can be turned on or off using the Plot / Configure menu item or a configuration file setting. Regression statistics can also be seen by using the Stats / Bivariate menu item.

Further, if the X variable is numeric, a local polynomial regression (LOESS) line can be fitted to the data points and displayed on the plot. Fitting and display of the LOESS line can be toggled on and off with Alt-L. The LOESS line is displayed as a black line. With large data sets, there may be a perceptible delay while LOESS fitting is done.

Also, if the X variable is numeric, a Theil-Sen line can be displayed. The Theil-Sen slope is calculated and drawn through the median X and Y values. The Theil-Sen line is shown in green.

The opacity (alpha value) of the plotted data lines can be adjusted using the Alt-A keystroke. This may be useful when a grouping variable is used and there are multiple categories, or when greater emphasis is to be given to the regression or LOESS lines.

Mean by category

Plot of means by category

This option produces a bar chart showing the mean, or average, of a numeric variable for each value of a categorical, or grouping, variable. To produce this plot, the numeric variable must be specified as X. Bars are vertically oriented by default.

The orientation of the chart can be rotated with the Alt-R keystroke. The Y axis can be flipped with the Alt-F keystroke when the plot is rotated–that is, the categories are shown on the Y axis.

Min-max by bin

Min-max range by bin

This plot shows the largest and smallest, or first and last, values of a numeric variable, on the X axis by default, for distinct ranges (bins) of another numeric variable (on the Y axis by default). The minimum and maximum values are connected by a solid line.

This type of plot is related to a boxplot or categorical stripchart, but:

  • emphasizes only the minimum and maximum values of the X variable,

  • allows alternate binning of the Y variable, instead of using fixed categories, and

  • by default is rotated 90 degrees.

The number of bins used for the Y variable can be changed with the Alt-B keystroke. The number of bins selected is also used for histogram plots.

The X and Y axes of the plot can be reversed with the Alt-R keystroke, and the order of values on the Y axis can be flipped with the Alt-F keystroke.

Min-max by category

Min-max

This plot shows the largest and smallest, or first and last, values of a quantitative variable, on the X axis by default, for each unique value of a categorical variable on the Y axis (by default). The minimum and maximum values are connected by a solid line.

This type of plot is related to a boxplot or categorical stripchart, but emphasizes only the minimum and maximum values of the quantitative X variable and by default is rotated 90 degrees.

Both date and date/time variables can be used on the X axis, and date variables can be used on the Y axis. This type of plot is therefore better suited than some others for showing temporal limits or ranges.

The X and Y axes of the plot can be reversed with the Alt-R keystroke, and the order of values on the Y axis can be flipped with the Alt-F keystroke.

Normal Q-Q plot

Q-Q plot

The Normal quantile-quantile, or Q-Q, plot plots the actual quantiles of a quantitative variable against the theoretical quantiles if the data followed a Normal (Gaussian) distribution. Each data point is shown as a separate symbol (a circle). The plot includes a 1:1 line that represents the relationship that is expected if the data set is indeed Normally distributed.

Creation of a Q-Q plot requires only the specification of a quantitative X variable.

Deviations of the plotted points from the 1:1 line indicate deviations from Normality. These deviations may be because the data set is not Normally distributed, contains outliers, or consists of a mixture distribution. Subgroups of samples that are defined by Jenks Natural Breaks can be seen–colored differently–by using the Alt-G keystroke.

Scatter plot

Scatter plot (grouped)

The scatter plot shows the relationship between two quantitative variables (X and Y). Each data point is represented by a symbol (a circle).

A grouping variable may also be specified; if it is used, a separate set of dots, uniquely colored, will be displayed for each value of the grouping variable. Data for which the X value, the Y value, or the grouping value are missing will not be included in the plot. In addition to grouping points based on columns of categorical data in the data table, points can be grouped by Jenks Natural Breaks. These grouping options are listed as “* Breaks in X” and “* Breaks in Y”.

Dots on the scatter plot will be partially transparent by default. The opacity (alpha value) can be changed using Alt-A.

Regression statistics

A least-squares linear regression line can also be displayed on the plot; this can be toggled on and off with Alt-R. The regression line is shown as an orange line. A dialog box showing the regression slope, intercept, and R-square can be displayed when the regression is calculated; display of this dialog box can be turned on or off using the Plot / Configure menu item or a configuration file setting. The Stats / Bivariate menu option can also be used to display a similar scatter plot with additional regression details and other bivariate statistics.

In addition, a local polynomial regression (LOESS) line can be fitted to the data points and displayed on the plot; this can be toggled on and off with Alt-L. The LOESS line is shown as a black line.

Also, if the X variable is numeric, a Theil-Sen line can be displayed. This line can be toggled on and off with the Alt-S keystroke. The Theil-Sen slope is calculated and drawn through the median X and Y values. The Theil-Sen line is shown in green.

Lines delineating Jenks Natural Breaks can be toggled on and off with Alt-B (at least one line, or two groups, will always be shown; the user must decide if these are reasonable).

The Y axis can be flipped with the Alt-F keystroke.

Stripchart

Stripchart

This option displays the distribution of a single quantitative variable as a jittered stripchart. A symbol (a circle) is shown at each value of the quantitative variable. The variable’s range of values is shown on the Y axis by default. To help visualize data density when there are many identical or similar values, each data point is jittered slightly on the X axis (by default) to spread them out and reduce overplotting. In addition, symbols are partially transparent so that overplotted symbols are darker than isolated single symbols. The opacity (alpha value) can be changed using the Alt-A keystroke.

If a grouping variable is also specified, a separate stripchart is displayed for each unique value of the grouping variable.

The orientation of the chart can be rotated with the Alt-R keystroke.

Other types of plots that can be used to visualise distributions of a variable for different categories are box plots, histograms, kernel density plots, and violin plots.

Total by category

Plot of totals by category

This option produces a bar chart showing the total, or sum, of a numeric variable for each value of a categorical, or grouping, variable. To produce this plot, the numeric variable must be specified as X. Bars are vertically oriented by default.

The orientation of the chart can be rotated with the Alt-R keystroke. The Y axis can be flipped with the Alt-F keystroke when the plot is rotated–that is, the categories are shown on the Y axis.

Violin plot

Violin plot

The violin plot shows the distribution of a quantitative variable as a symmetrical smoothed density distribution throughout the range of the data. A single quantitative variable must be specified. On the vertical (by default) central axis of the violin plot is a small box-and-whisker plot.

If a grouping variable is also specified, a separate violin plot will be displayed for each unique value of the grouping variable.

The orientation of the chart can be rotated with the Alt-R keystroke.

Other types of plots that can be used to visualise distributions of a variable for different categories are box plots, stripcharts, kernel density plots, and histograms.

Y range plot

Y-range plot

This plot shows the minimum and maximum values of one quantitative variable, on the Y axis, over a range of different values of another quantitative variable on the X axis. The area between minimum and maximum Y values, for a range of contiguous X values, is filled with a solid color.

Creating a Y-range plot requires specification of quantitative variables for both X and Y axes, Dates and date/time variables can be used in addition to numerical variables.

The order of values on the Y axis can be flipped with the Alt-F keystroke.

Working With Data Plots

Selecting Data for Plotting

By default, all data in the data table will be used for a plot. If the “Selected data only” checkbox on the plot dialog is checked, only the data points that are highlighted in the table will be used for plotting.

By default, plots are updated automatically with any changes made to data selections in the table or map. If the “Auto-update” checkbox is un-checked, a plot of selected data only will no longer be updated when changes are made to selections on the map and table.

Missing data are not included on any plots. The Table / Data types menu option can be used to view the number of missing values for each column of the data table.

Transforming Data Used for Plotting

The checkboxes to the right of the X and Y variable selectors on the plotting dialog will transform the data by calculating the base-10 logarithm of the data values; the transformed values will be used for plotting instead of the original values. If the logarithm cannot be computed for any X or Y value, the corresponding checkbox will be disabled and the un-transformed data will be used. Logarithms cannot be computed for non-positive numbers and for date and date/time variables.

Viewing Data Used for Plotting

After a plot type and the required data have been selected, the “Source Data” and “Plot Data” buttons at the bottom of the dialog can be used to view the data. The “Source Data” button produces a table of the data to be used, with missing values eliminated and any log-transformation applied. The “Plot Data” button will show these same data as they have been summarized for plotting. For example, for box plots, the “Plot Data” button will display a table showing all of the data values in each of the chosen categories. The Ctrl-S keystroke will allow the data to be saved to a file.

Adding and Modifying the Plot Title and Axis Labels

The following keyboard shortcuts can be used to modify the plot title and axis labels:

  • Alt-T – Add or change the title.

  • Alt-X – Change the label on the X axis.

  • Alt-Y – Change the label on the Y axis.

Hotkeys for Customizing Specific Plot Types

Some types of plots can be customized by changing the opacity of lines or symbols or by adding additional information to the plots. These customizations are carried out using keystroke commands (hotkeys). The keystroke commands should be used when a plot window has the focus.

The hotkeys that can be used to customize plots are:

  • Alt-A – Change the opacity (alpha value) of symbols on bubble plots, scatter plots, line plots, stripcharts, and kernel density plots.

  • Alt-B – For histograms and binned min-max plots: change the number of bins used. For scatter plots: toggle on or off the display of vertical and horizontal lines delineating the Jenks Natural Breaks in X and Y variables, respectively. For line plots: toggle on or off the display of vertical lines delineating the Jenks Natural Breaks in the X variable.

  • Alt-F – Flips the Y axis values on min-max plots; bar plots of counts, totals, and means; bubble plots; scatter plots; and Y-range plots. The Y axis on bar plots can only be flipped when the categories have been rotated to be on the Y axis.

  • Alt-G – Toggle the coloring of points on the Normal Q-Q plot to correspond to groups defined by the Jenks Natural Breaks method.

  • Alt-L – Toggle the display of a LOESS smoothing line on line plots and scatter plots.

  • Alt-R – For line plots and scatter plots: toggle the display of an ordinary least-squares linear regression line. For box plots; bar plots of counts, totals, and means; min-max plots; stripcharts; and violin plots: rotate the X and Y axes.

  • Alt-S – For scatter plots, toggle the display of a Theil-Sen line.

Additional plot customization can be carried out by modifying settings using the plot configuration dialog.

Modifying the Range of X and Y Values Displayed

There are two ways to control the range of X and Y values that are displayed on a plot.

The first way is to select only the data that falls within the desired range(s) and then use only the selected data for the plot. The Selections / Set by query dialog is a more reliable way of selecting data by range than clicking in the table or on the map.

The second way is to use the ‘zoom’ control on the control bar that is at the bottom of the plot window (the magnifying glass). After this contol is selected by clicking on it, clicking and dragging a rectangle on the plot will zoom the plot to the selected range of data.