Statistical properties of L1- and L2-regularized NMF

An intuitive take on what exactly L1- and L2-regularized NMF actually does

Key Takeaways

For non-negative matrix factorization:

  • L1 and L2 regularization require diagonalization (factorization of the form \(A = wdh\))
  • L1 is a sparsifying, L2 is densifying
  • L1 increases angle between factors, L2 decreases angle between factors
  • L1 penalties cause factors to converge collectively towards a k-means clustering model, L2 penalties cause each factor to converge individually towards the first singular vector

Regularizing NMF

Regularizations are intended to improve the interpretability or identifiability of linear models. Consider the least squares problem \(ax = b\), for which common regularizations include:

  • L1/LASSO regularization: absolute shrinkage, penalty subtracted from \(b\)
  • L2/Ridge regularization: convex shrinkage, penalty added to diagonal of \(a\)

In a typical non-negative least squares (NNLS) fit, these regularizations behave usefully. For example, an L1 penalty equal to the maximum value in \(b\) will ensure complete sparsity of the solution.

Now consider NMF by alternating least squares. NMF differs from one-off least squares problems in several ways:

  • It is iterative
  • The initial distribution of the models are unknown (i.e. projection of random factors)
  • The distribution of a model at a given iteration is dependent on that of the models at all previous iterations

Thus, NMF regularizations have a chain effect: a change in one iteration will lead to a change in information and distribution in the next, and so forth. Thus, if the distribution of the model is not controlled after each update, penalties will cause the model to spiral out of control.

Controlling NMF model distributions during updates

NMF minimizes \(A = wh\). The least squares update of \(h\), column \(j\) given \(A\) and \(w\) is:

\[w^Twh_j = w^TA_j\]

Correspondingly, the least squares update of \(w\), row \(j\), given \(A\) and \(h\) is:

\[hh^Tw_j = hA^T_j\] These equations are in the form \(ax = b\). For instance, in the update of \(h\), \(a = w^Tw\) and \(b = w^TA_j\).

For a regularization penalty strictly in the range (0, 1], we want to guarantee that the penalty will be consistent across random NMF restarts, different datasets, and across alternating least squares updates. To guarantee consistent application of the penalty, we need to control the distribution of \(a\) and \(b\).

The distribution of a model can be controlled by diagonalizing the NMF model, such that \(A = wdh\), where columns in \(w\) and rows in \(h\) are scaled to sum to 1 by a scaling diagonal, \(d\). Factors need not scale to 1, it could be any constant value, but 1 provides nice interpretability.

Diagonalized NMF enables convex regularization

Let’s load the hawaiibirds dataset and factorize the data at several L1 and L2 penalties, with and without model diagonalization, also calculating various statistics such as sparsity, similarity to k-means clustering, and similarity to the first singular vector.

# devtools::install_github("zdebruine/RcppML")
A <- hawaiibirds$counts
alphas <- c(c(1, 3, 5, 9) %o% 10^(-3:-1)) # c(seq(0, 0.1, 0.005), seq(0.11, 0.5, 0.01)) # seq(0, 0.98, 0.02)
seeds <- c(123, 456, 789)
kmeans_centers <- t(kmeans(t(as.matrix(A)), 10)$centers)
svd1 <- nmf(A, 1)@w
df <- data.frame()
for(alpha in alphas){
  for(seed in seeds){
    for(diag in c(FALSE, TRUE)){
      m <- nmf(A, 10, seed = seed, diag = diag)
        for(penalty in c("L1", "L2")){
        m_ <- nmf(A, 10, seed = seed, diag = diag,
                   L1 = ifelse(penalty == "L1", alpha, 0), 
                   L2 = ifelse(penalty == "L2", alpha, 0),
        df <- rbind(df, data.frame(
          "alpha" = alpha,
          "seed" = seed,
          "diag" = diag,
          "penalty" = penalty,
          "sparsity" = sum(m_@w == 0) / prod(dim(m_@w)),
          "robustness" = 1 - bipartiteMatch(1 - cosine(m_@w, m@w))$cost/10,
          "mse" = evaluate(m_, A),
          "mean_angle" = mean(cosine(m_@w)),
          "kmeans" = bipartiteMatch(1 - cosine(kmeans_centers, m_@w))$cost/10,
          "svd1" = sum(cosine(m_@w, svd1))/10,
          "color" = ifelse(penalty == "L1", alpha^0.25, -alpha^0.25)
df$penalty <- factor(df$penalty)
df$seed <- factor(df$seed)


  • Diagonal scaling guarantees consistent regularization between independent replicates (compare a, c with b, d)
  • L1 regularization increases sparsity of factor models (b) while L2 regularization promotes density of the model (d)
  • L1 = 1 guarantees complete sparsity (b) while L2 = 1 guarantees complete density (d)

We might not have expected that L2 is a densifying factorization. Why is this? L2 convexly shrinks values towards zero, and as such decreases the condition number of \(a\). This means signals will be encouraged to “squash” together, and factors in the resulting model will begin to describe similar signal. As this occurs, the model naturally becomes denser until a point is reached that the objective is minimized (at convergence).

Properties of L1- and L2-regularized NMF

Let’s consider how L1 and L2 regularizations affect the robustness of information content of factor models relative to the unregularized equivalent, and how they affect the mean squared error loss of the models.

As a measure of the robustness of information content, we use the mean cost of bipartite matching between L1-regularized and unregularized \(w\) models on a cosine similarity matrix.

Notice how the L2 penalties tend to be much harsher than the L1 penalties. However, both penalties cause movement of the model away from the unregularized state.

Within the models themselves, we can examine how similar factors are to one another by measuring the mean cosine angle:

ggplot(subset(df, diag == TRUE & seed == 123), aes(x = alpha, y = mean_angle, color = penalty)) +
  geom_point() + labs(x = "alpha", y = "mean cosine angle\nbetween factors") +
  theme_classic() + theme(aspect.ratio = 1) + scale_x_continuous(trans = "sqrt") +
  stat_smooth(se = F)

We can see that L1 penalty increases the distance between factors, while L2 penalty increases the similarity between factors.

Now let’s take a look at how L1 and L2 penalties affect the sparsity of factors, and also calculate the similarity of these models to a k-means clustering or the first singular vector (given by a rank-1 NMF):

L1 is sparsifying while L2 is densifying.

Here, L1 promotes a k-means clustering model while L2 promotes convergence towards the first singular vector.

Interpreting L1- and L2-regularized factor models

We’ll select regularization parameters for further analysis based on a cosine angle of about 0.25 away from the original model:

model    <- nmf(A, 10, tol = 1e-6, seed = 123)
model_L1 <- nmf(A, 10, tol = 1e-6, seed = 123, L1 = 0.2)
model_L2 <- nmf(A, 10, tol = 1e-6, seed = 123, L2 = 0.02)

Take a look at the clustering of factors in the \(w\) models on UMAP coordinates:

Similar information is clearly being captured by each of the models, but let’s see in what way.

We’ll align factors in the regularized models to the unregularized models, and then compare specific factors.

biplot <- function(model1, model2, factor){
  df <- data.frame("model1" = model1$w[, factor], "model2" = model2$w[, factor], "label" = rownames(model1$w))
  ggplot(df, aes(x = model1, y = model2, label = label)) + geom_point() + theme_classic() + geom_text_repel(size = 2.5)

model_L1 <- align(model_L1, model)
model_L2 <- align(model_L2, model)

These are very harsh penalties, so notice how L1 can over-sparsify things, while L2 can generate factors that are so dense the information is hardly specific or informative.

A happy medium for sparsifying (or densifying) regularization certainly exists, and this is an objective hyperparameter that must be determined against the objectives of the analysis. Unfortunately, there is nothing against which to optimize – this appears to be a matter of statistical taste.

Future directions

  • Effect of L1 and L2 regularizations on factorization rank
  • Intuition behind one-sided L1 and L2 regularization
  • Intuition behind combined L1/L2 or one-sided L1 vs. one-sided L2
Zach DeBruine
Zach DeBruine
Assistant Professor of Bioinformatics

Assistant Professor of Bioinformatics at Grand Valley State University. Interested in single-cell experiments and dimension reduction. I love simple, fast, and common sense machine learning.