Learning Optimal NMF Models from Random Restarts

Initializing NMF with NNDSVD, random uniform, or random gaussian models

NMF Initialization

Non-negative matrix factorization (NMF) is NP-hard (Vavasis, 2007). As such, the best that NMF can do, in practice, is find the best discoverable local minima from some set of initializations.

Non-negative Double SVD (NNDSVD) has previously been proposed as a “head-start” for NMF (Boutsidis, 2008). However, SVD and NMF are usually nothing alike, as SVD factors are sequentially interdependent while NMF factors are colinearly interdependent. Thus, whether “non-negative” SVD is useful remains unclear.

Random initializations are the most popular and promising method for NMF initialization. It is generally useful to attempt many random initializations to discover the best possible solution.

In this post I explore a number of initializations on the hawaiibirds, aml, and movielens datasets, and a small single-cell dataset.


  • SVD-based initializations (such as NNDSVD) are slower than random initializations, sometimes do worse, and are never better.
  • Multiple random initializations are useful for recovering the best discoverable NMF solution.
  • Normal random distributions (i.e. rnorm(mean = 2, sd = 1)) slightly outperform uniform random distributions (i.e. runif(min = 1, max = 2)) at finding the best NMF solution.

Non-negative Double SVD

The following is an implementation of NNDSVD, adapted from the NMF package. In this function, the use of irlba is a key performance improvement, and we do not do any form of zero-filling as I have found that this does not affect the outcome of RcppML NMF:

nndsvd <- function(data, k) {

  .pos <- function(x) { as.numeric(x >= 0) * x }
  .neg <- function(x) {-as.numeric(x < 0) * x }
  .norm <- function(x) { sqrt(drop(crossprod(x))) }

  w = matrix(0, nrow(data), k)
  s = irlba::irlba(data, k)
  w[, 1] = sqrt(s$d[1]) * abs(s$u[, 1])

  # second SVD for the other factors
  for (i in 2:k) {
    uu = s$u[, i]
    vv = s$v[, i]
    uup = .pos(uu)
    uun = .neg(uu)
    vvp = .pos(vv)
    vvn = .neg(vv)
    n_uup = .norm(uup)
    n_vvp = .norm(vvp)
    n_uun = .norm(uun)
    n_vvn = .norm(vvn)
    termp = as.double(n_uup %*% n_vvp)
    termn = as.double(n_uun %*% n_vvn)
    if (termp >= termn) {
      w[, i] = (s$d[i] * termp)^0.5 * uup / n_uup
    } else {
      w[, i] = (s$d[i] * termn)^0.5 * uun / n_uun

We can compare NNDSVD to normal SVD:

A <- hawaiibirds$counts
m1 <- nndsvd(A, 2)
m2 <- irlba(A, 2)
df <- data.frame("svd2" = m2$u[,2], "nndsvd2" = m1[,2])
ggplot(df, aes(x = svd2, y = nndsvd2)) + 
  geom_point() + 
  labs(x = "second singular vector", y = "second NNDSVD vector") + 

We might also derive a much simpler form of NNDSVD which simply sets negative values in $u to zero:

nndsvd2 <- function(data, k){
  w <- irlba(data, k)$u
  svd1 <- abs(w[,1])
  w[w < 0] <- 0
  w[,1] <- svd1

Finally, we could simply initialize with the signed SVD, and let NMF take care of imposing the non-negativity constraints:

w_svd <- function(data, k){
  irlba(data, k)$u

Random Initializations

We can test different random initializations using runif and rnorm. Hyperparameters to runif are min and max, while hyperparameters to rnorm are mean and sd. In both cases, our matrix must be non-negative.

w_runif <- function(nrow, k, min, max, seed){
  matrix(runif(nrow * k, min, max), nrow, k)

w_rnorm <- function(nrow, k, mean, sd, seed){
  abs(matrix(rnorm(nrow * k, mean, sd), nrow, k))

Generate some initial w matrices using these functions:

w1 <- w_runif(nrow(A), 10, 0, 1, 123)
w2 <- w_runif(nrow(A), 10, 1, 2, 123)
w3 <- w_rnorm(nrow(A), 10, 0, 1, 123)
w4 <- w_rnorm(nrow(A), 10, 2, 1, 123)

See how the distributions of these different models differ:

Evaluating initialization methods

We’ll use Mean Squared Error as a simple evaluation metric. We will compare results across several different datasets, as signal complexity can have a profound effect on recoverable NMF solution minima.

hawaiibirds dataset

First, we’ll look at the hawaii birds dataset. Since this is a small dataset, we will run 50 replicates of each random initialization to 100 iterations.

results <- eval_initializations(
  hawaiibirds$counts, k = 10, n_reps = 50, tol = 1e-10, maxit = 100)

UMAP plot of all models learned for each initialization:

Clearly, rnorm(mean = 2, sd = 1) has discovered a local minima that was not discovered by any other initialization method. Strikingly, it has done so while running faster than other methods.

movielens dataset

For this dataset, we will mask zeros, because 0’s indicate movies that have not been rated by the corresponding users.

We will stop factorizations at tol = 1e-5 and also track the number of iterations needed to get to that point.

results <- eval_initializations(
  movielens$ratings, k = 7, n_reps = 10, tol = 1e-5, maxit = 1000, mask = "zeros")

UMAP plot of the learned models:

Models here are much more similar, but rnorm still does surprisingly well, requires surprisingly few iterations, and is quite fast. Almost entirely on-par with this initialization is nndsvd.

aml dataset

results <- eval_initializations(aml, k = 10, n_reps = 25, tol = 1e-5)

and a UMAP plot of the learned models:

Single-cell data

Let’s have a look at the pbmc3k dataset made available in the SeuratData package. This dataset is an example of complex signal with significant dropout and noise.

## An object of class Seurat 
## 13714 features across 2700 samples within 1 assay 
## Active assay: RNA (13714 features, 0 variable features)
pbmc <- pbmc3k@assays$RNA@counts
results_pbmc3k <- eval_initializations(pbmc, k = 7, n_reps = 20, tol = 1e-5)

Normalized Single-Cell Data

Log-normalize single cell data and see how these changes in the distribution affect the ideal initialization method:

pbmc_norm <- LogNormalize(pbmc)
results_pbmc_norm <- eval_initializations(pbmc_norm, k = 7, n_reps = 20, tol = 1e-5)

Takeaways so far

Runtime: * rnorm and runif. Consistently faster than SVD-based initializations. There is no convincing difference between rnorm and runif.

Loss: * with multiple starts, rnorm(2, 1) never does worse than any other method, but performs worse on average than runif in single-cell data. * nndsvd performs as well as runif in aml and single-cell data, but takes longer. It performs worse than runif in movielens data (by a lot), and better than runif in hawaiibirds (but not as well as rnorm)

Iterations: * runif does at least as well as, or better than, all other methods.

Spectral decompositions such as nndsvd do not out-perform random initialization-based methods such as rnorm or runif consistently. In addition, they require that an SVD be run, which increases the total runtime.

Optimizing runif

It is possible that changing the bounds of the uniform distribution may affect the results.

We will address whether the width of the bounds matters, and the proximity of the lower-bound to zero. We will look at bounds in the range (0, 1), (0, 2), (0, 10), (1, 2), (1, 10), and (2, 10):

results_hibirds <- eval_runif(hawaiibirds$counts, k = 10, n_reps = 20, tol = 1e-6)
results_aml <- eval_runif(aml, k = 12, n_reps = 20)
results_movielens <- eval_runif(movielens$ratings, k = 7, n_reps = 20, mask = "zeros")
results_pbmc <- eval_runif(pbmc, k = 7, n_reps = 20)

These results show no consistent recipe for finding the best minima, but that there is considerable dataset-specific variation.

However, it is clear that varying the lower and upper bounds of runif across restarts is likely to be useful.

Optimizing rnorm

Changing the mean and standard deviation of the absolute value of a normal distribution can generate non-normal distributions, in fact, it can generate distributions quite like a gamma distribution. Thus, we will investigate some different combinations of mean and standard deviation: (0, 0.5), (0, 1), (0, 2), (1, 0.5), (1, 1), and (2, 1):

results_hibirds <- eval_rnorm(hawaiibirds$counts, k = 10, n_reps = 20, tol = 1e-6)
results_aml <- eval_rnorm(aml, k = 12, n_reps = 20)
results_movielens <- eval_rnorm(movielens$ratings, k = 7, n_reps = 20, mask = "zeros")
results_pbmc <- eval_rnorm(pbmc, k = 7, n_reps = 20)

Here it’s more difficult to pick a winner, they really perform similarly. For the pbmc3k dataset, however, rnorm(2,1) is probably the best choice. This distribution is largely normal, as opposed to gamma (i.e. rnorm(0, 0.5), which could be seen as the “loser”) or a lopsided bell-curve shaped (i.e. `rnorm(1, 1)).

Zach DeBruine
Zach DeBruine
Assistant Professor of Bioinformatics

Assistant Professor of Bioinformatics at Grand Valley State University. Interested in single-cell experiments and dimension reduction. I love simple, fast, and common sense machine learning.