This vignette has been stripped down to comply with CRAN package size policies. To view the complete vignette, including graphics, see the package website.
This is an example of exploratory LCA with continuous indicators, or
finite Gaussian mixture modeling, using tidySEM
. The
present example uses data collected by Alkema as part of a study on
ocean microplastics. To view its documentation, run the command
?tidySEM::alkema_microplastics
in the R console. The
original analyses are available at https://github.com/cjvanlissa/lise_microplastics; in
this vignette, we take a different approach to the analysis to showcase
other possibilities.
To load the data, simply attach the tidySEM
package. For
convenience, we assign the variables used for analysis to an object
called df
. As explained in the paper, the classes are quite
different for lines, films, and fragments. For this reason, we here only
use data from fragments. The indicators are fragments’ length and
width.
As per the best practices, the first step in LCA is examining the
observed data. We use tidySEM::descriptives()
to describe
the data numerically. Because all items are categorical, we remove
columns for continuous data to de-clutter the table:
desc <- tidySEM::descriptives(df)
desc <- desc[, c("name", "type", "n", "missing", "unique",
"mean", "median", "sd", "min", "max", "skew_2se", "kurt_2se")]
knitr::kable(desc, caption = "Descriptive statistics")
name | type | n | missing | unique | mean | median | sd | min | max | skew_2se | kurt_2se |
---|---|---|---|---|---|---|---|---|---|---|---|
length | numeric | 5605 | 0 | 2086 | 2.9 | 2.4 | 1.9 | 1.0 | 69.2 | 137 | 2116 |
width | numeric | 5605 | 0 | 2079 | 2.0 | 1.6 | 1.1 | 0.2 | 6.8 | 22 | 37 |
Additionally, we can plot the data. The ggplot2
function
geom_density()
is useful for continuous data:
df_plot <- df
names(df_plot) <- paste0("Value.", names(df_plot))
df_plot <- reshape(df_plot, varying = names(df_plot), direction = "long",
timevar = "Variable")
ggplot(df_plot, aes(x = Value)) +
geom_density() +
facet_wrap(~Variable)+
theme_bw()
The data are correctly coded as numeric
. There are no
missing values; if any variables had missing values, we would report an
MCAR test with mice::mcar()
, and explain that missing data
are accounted for using FIML. Note that the data are extremely
right-skewed and kurtotic, as also evident from the plot. With this in
mind, it can be useful to transform and rescale the data. We will use a
log transformation.
df_plot$Value <- log(df_plot$Value)
ggplot(df_plot, aes(x = Value)) +
geom_density() +
facet_wrap(~Variable)+
theme_bw()
The log transformation addresses all aforementioned concerns regarding skew and kurtosis. Let’s reshape the data to wide format and examine a scatterplot:
As all variables are continuous, we can use the convenience function
tidySEM::mx_profiles()
, which is a wrapper for the generic
function mx_mixture()
optimized for continuous indicators.
Its default settings are appropriate for LPA, assuming fixed variances
across classes and zero covariances. Its arguments are data
and number of classes
. All variables in data
are included in the analysis, which is why we first selected the
indicator variables.
As this is an exploratory LCA, we will conduct a rather extensive
search across model specifications and number of classes. We will set
the maximum number of classes \(K\) to
four; depending on the results, we can always choose to increase it
later. We set a seed to ensure replicable results. As the analysis takes
a long time to compute, it is prudent to save the results to disk
immediately, so as not to lose them. For this, we use the function
saveRDS()
. We can later use
res <- readRDS("res_gmm.RData")
to load the analysis
from the file.
set.seed(123)
res <- mx_profiles(data = df,
classes = 1:4,
variances = c("equal", "varying"),
covariances = c("zero", "equal",
"varying"),
expand_grid = TRUE)
saveRDS(res, "res_gmm.RData")
To compare the fit of the estimated models, we create a model fit
table using table_fit()
and retain relevant columns. We
also determine whether any models can be disqualified.
fit <- table_fit(res)
There were no indications of convergence problems during estimation.
Next, we check for local identifiability. The sample size is
5605
. We can calculate the ratio of observations to
parameters and append it to the fit table as follows:
fit$par_ratio <- (5605*fit$n_min) / (fit$Parameters/fit$Classes)
As can be seen from the fit table below, the lowest ratio of observations to parameters is 18, which is no cause for concern. However, these classes comprise a very small percentage of the total sample size.
There are, however, concerns about the interpretability of all
solutions, as many of the entropies and minimum classification
probabilities are low. Only a few models have acceptable entropies
around .86
and minimum classification probabilities around
.94
. Note that the BIC and the entropy are strongly
correlated. If we omit the 1-class models, for which entropy is
technically not defined, we see that
cor(fit$BIC[!fit$Classes == 1], fit$Entropy[!fit$Classes == 1])
returns 0.85
. This strong correlation indicates that an
increase in fit comes with a decrease in class separability. This
illustrates why entropy should not be treated as a model fit
criterion.
fit[ , c("Name", "LL", "Parameters", "par_ratio",
"BIC", "Entropy",
"prob_min", "prob_max",
"n_min", "n_max",
"lmr_p")]
Name | LL | Parameters | par_ratio | BIC | Entropy | prob_min | prob_max | n_min | n_max | lmr_p |
---|---|---|---|---|---|---|---|---|---|---|
equal var 1 | -8107 | 4 | 1401 | 16249 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | NA |
equal var 2 | -5211 | 7 | 564 | 10483 | 0.87 | 0.94 | 0.97 | 0.35 | 0.65 | 0 |
equal var 3 | -4138 | 10 | 342 | 8363 | 0.83 | 0.88 | 0.95 | 0.20 | 0.46 | 0 |
equal var 4 | -3500 | 13 | 246 | 7113 | 0.83 | 0.89 | 0.93 | 0.14 | 0.35 | 0 |
free var 1 | -8107 | 4 | 1401 | 16249 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | NA |
free var 2 | -5138 | 9 | 501 | 10353 | 0.85 | 0.94 | 0.97 | 0.40 | 0.60 | 0 |
free var 3 | -4005 | 14 | 374 | 8131 | 0.83 | 0.89 | 0.95 | 0.31 | 0.35 | 0 |
free var 4 | -3331 | 19 | 245 | 6826 | 0.84 | 0.88 | 0.93 | 0.21 | 0.33 | 0 |
equal var, equal cov 1 | -3389 | 5 | 1121 | 6820 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | NA |
equal var, equal cov 2 | -3082 | 8 | 432 | 6234 | 0.72 | 0.86 | 0.95 | 0.31 | 0.69 | 0 |
equal var, equal cov 3 | -3030 | 11 | 256 | 6155 | 0.67 | 0.71 | 0.93 | 0.17 | 0.56 | 0 |
equal var, equal cov 4 | -3021 | 14 | 220 | 6162 | 0.61 | 0.68 | 0.83 | 0.14 | 0.34 | 0 |
free var, equal cov 1 | -3389 | 5 | 1121 | 6820 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | NA |
free var, equal cov 2 | -2545 | 10 | 97 | 5176 | 0.64 | 0.51 | 0.98 | 0.09 | 0.91 | 0 |
free var, equal cov 3 | -2257 | 15 | 72 | 4643 | 0.68 | 0.54 | 0.94 | 0.06 | 0.64 | 0 |
free var, equal cov 4 | -2069 | 20 | 26 | 4310 | 0.63 | 0.52 | 0.90 | 0.02 | 0.50 | 0 |
equal var, free cov 1 | -3389 | 5 | 1121 | 6820 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | NA |
equal var, free cov 2 | -2552 | 9 | 108 | 5181 | 0.65 | 0.52 | 0.98 | 0.09 | 0.91 | 0 |
equal var, free cov 3 | -2359 | 13 | 84 | 4831 | 0.68 | 0.56 | 0.93 | 0.07 | 0.65 | 0 |
equal var, free cov 4 | -2174 | 17 | 28 | 4494 | 0.63 | 0.60 | 0.88 | 0.02 | 0.50 | 0 |
free var, free cov 1 | -3389 | 5 | 1121 | 6820 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | NA |
free var, free cov 2 | -2575 | 11 | 407 | 5245 | 0.56 | 0.81 | 0.91 | 0.40 | 0.60 | 0 |
free var, free cov 3 | -2111 | 17 | 36 | 4370 | 0.65 | 0.51 | 0.88 | 0.04 | 0.56 | 0 |
free var, free cov 4 | -2024 | 23 | 15 | 4247 | 0.62 | 0.50 | 0.91 | 0.02 | 0.48 | 0 |
Next, we plot a scree plot for the BIC by calling
plot(fit)
:
plot(fit) + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
It is not immediately clear which solution to prefer. Looking at the
blocks of 1-4 class models for each model specification, it appears that
the BIC keeps decreasing with the addition of more classes. Across the
blocks, the BIC keeps decreasing with increasingly complex model
specifications. Similarly, all LMR tests are significant. The function
ic_weights(fit)
allows us to compute IC weights for all
models in the set; it prefers the most complex model with a posterior
model probability of nearly 100%. However, the smallest class in this
model contains only 85 cases, about 1.6% of the sample. We can wonder
whether such a small class has a meaningful interpretation.
The analysis thus indicates support for increasingly complex models, and those complex models are ever less interpretable and relevant, as indicated by low entropy and class size, respectively. This suggests a potential risk of overfitting. We may instead choose the most parsimoneous model that fits the data well. To aid in this decision, we plot the BIC and the Entropy, omitting the 1-class solutions because there is clear evidence that more classes are needed:
df_plot <- fit
df_plot <- df_plot[!df_plot$Classes == 1, ]
ggplot(df_plot, aes(x = BIC, y = Entropy, label = Name)) +
geom_point() +
geom_label() +
theme_bw()
It appears that the 2-class model with equal variances and covariances has an above-average fit, and relatively high entropy. We thus proceed with this model.
For convenience, we assign the final model to a separate object:
res_final <- res[["equal var, equal cov 2"]]
The 4-class model yielded classes of reasonable size; the largest class comprised 68%, and the smallest comprised 32% of cases. The entropy was relatively low, \(S = .72\), indicating poor class separability. Furthermore, the posterior classification probability ranged from \([.86, .95]\), which means that classification error was non-negligible. We produce a table of the results below.
table_results(res_final, columns = c("label", "est", "se", "confint", "class"))
label | est | se | confint | class |
---|---|---|---|---|
mix2.weights[1,2] | 0.46 | 0.02 | [0.42, 0.50] | NA |
Variances.length | 0.11 | 0.00 | [0.10, 0.11] | class1 |
Covariances.length.WITH.width | 0.08 | 0.00 | [0.07, 0.08] | class1 |
Variances.width | 0.10 | 0.00 | [0.09, 0.10] | class1 |
Means.length | 0.68 | 0.01 | [0.66, 0.69] | class1 |
Means.width | 0.30 | 0.01 | [0.29, 0.32] | class1 |
Means.length | 1.51 | 0.01 | [1.49, 1.53] | class2 |
Means.width | 1.12 | 0.01 | [1.10, 1.15] | class2 |
The results are best interpreted by examining a plot of the model and
data, however. Relevant plot functions are
plot_bivariate()
, plot_density()
, and
plot_profiles()
. However, we omit the density plots,
because plot_bivariate()
also includes them.
plot_bivariate(res_final)
On the diagonal of the bivariate plot are weighted density plots: normal approximations of the density function of observed data, weighed by class probability. On the off-diagonal are plots for each pair of indicators, with the class means indicated by a point, class standard deviations indicated by lines, and covariances indicated by circles.
The bivariate and marginal plots show that the classes are not clearly separable, as also evident from the low entropy. At the same time however, it is clear that the distributions are non-normal, and the second class accounts for some of this non-normality. The first class (68%) accounts for smaller fragments, and the second class (32%) accounts for some of the right-skew in fragments’ length and width. We can simply label class 1 as small fragments, and class 2 as larger fragments.
Finally, we may want to compare the different classes on auxiliary
variables or models. The BCH()
function applies three-step
analysis, which compares the classes using a multi-group model,
controlling for classification error. For example, we can test whether
polymer type differs between the two classes:
df_pt <- mx_dummies(df_analyze$poly_type)
aux_pt <- BCH(res_final, model = "poly_typeOther | t1
poly_typePE | t1
poly_typePP | t1", data = df_pt)
aux_pt <- mxTryHardOrdinal(aux_pt)
To obtain an omnibus likelihood ratio test of the significance of the
differences in polymer type across classes, use
lr_test(aux_pt)
. The results indicate that there are
significant differences in polymer types across classes, \(\Delta LL(3) = 14.08, p = .003\). The
results can be reported in probability scale using
table_prob(aux_pt)
. To test differences for specific
polymer types, we can use Wald tests:
wald_test(aux_pt, "class1.Thresholds[1,1] = class2.Thresholds[1,1];class1.Thresholds[1,2] = class2.Thresholds[1,2];class1.Thresholds[1,3] = class2.Thresholds[1,3]")
The results indicate that there is no significant difference in the prevalence of “Other” polymer types across classes. However, PE is significantly more prevalent in class 2, and PP is significantly more prevalent in class 1.