# Learning Optimal NMF Models from Random Restarts

Initializing NMF with NNDSVD, random uniform, or random gaussian models

## NMF Initialization

Non-negative matrix factorization (NMF) is NP-hard (Vavasis, 2007). As such, the best that NMF can do, in practice, is find the best discoverable local minima from some set of initializations.

Non-negative Double SVD (NNDSVD) has previously been proposed as a “head-start” for NMF (Boutsidis, 2008). However, SVD and NMF are usually nothing alike, as SVD factors are sequentially interdependent while NMF factors are colinearly interdependent. Thus, whether “non-negative” SVD is useful remains unclear.

Random initializations are the most popular and promising method for NMF initialization. It is generally useful to attempt many random initializations to discover the best possible solution.

In this post I explore a number of initializations on the `hawaiibirds`

, `aml`

, and `movielens`

datasets, and a small single-cell dataset.

## Takeaways

- SVD-based initializations (such as NNDSVD) are slower than random initializations, sometimes do worse, and are never better.
- Multiple random initializations are useful for recovering the best discoverable NMF solution.
- Normal random distributions (i.e.
`rnorm(mean = 2, sd = 1)`

) slightly outperform uniform random distributions (i.e.`runif(min = 1, max = 2)`

) at finding the best NMF solution.

## Non-negative Double SVD

The following is an implementation of NNDSVD, adapted from the NMF package. In this function, the use of `irlba`

is a key performance improvement, and we do not do any form of zero-filling as I have found that this does not affect the outcome of RcppML NMF:

```
nndsvd <- function(data, k) {
.pos <- function(x) { as.numeric(x >= 0) * x }
.neg <- function(x) {-as.numeric(x < 0) * x }
.norm <- function(x) { sqrt(drop(crossprod(x))) }
w = matrix(0, nrow(data), k)
s = irlba::irlba(data, k)
w[, 1] = sqrt(s$d[1]) * abs(s$u[, 1])
# second SVD for the other factors
for (i in 2:k) {
uu = s$u[, i]
vv = s$v[, i]
uup = .pos(uu)
uun = .neg(uu)
vvp = .pos(vv)
vvn = .neg(vv)
n_uup = .norm(uup)
n_vvp = .norm(vvp)
n_uun = .norm(uun)
n_vvn = .norm(vvn)
termp = as.double(n_uup %*% n_vvp)
termn = as.double(n_uun %*% n_vvn)
if (termp >= termn) {
w[, i] = (s$d[i] * termp)^0.5 * uup / n_uup
} else {
w[, i] = (s$d[i] * termn)^0.5 * uun / n_uun
}
}
w
}
```

We can compare NNDSVD to normal SVD:

```
library(irlba)
library(RcppML)
library(ggplot2)
data(hawaiibirds)
A <- hawaiibirds$counts
m1 <- nndsvd(A, 2)
m2 <- irlba(A, 2)
df <- data.frame("svd2" = m2$u[,2], "nndsvd2" = m1[,2])
ggplot(df, aes(x = svd2, y = nndsvd2)) +
geom_point() +
labs(x = "second singular vector", y = "second NNDSVD vector") +
theme_classic()
```

We might also derive a much simpler form of NNDSVD which simply sets negative values in $u to zero:

```
nndsvd2 <- function(data, k){
w <- irlba(data, k)$u
svd1 <- abs(w[,1])
w[w < 0] <- 0
w[,1] <- svd1
w
}
```

Finally, we could simply initialize with the signed SVD, and let NMF take care of imposing the non-negativity constraints:

```
w_svd <- function(data, k){
irlba(data, k)$u
}
```

## Random Initializations

We can test different random initializations using `runif`

and `rnorm`

. Hyperparameters to `runif`

are `min`

and `max`

, while hyperparameters to `rnorm`

are `mean`

and `sd`

. In both cases, our matrix must be non-negative.

```
w_runif <- function(nrow, k, min, max, seed){
set.seed(seed)
matrix(runif(nrow * k, min, max), nrow, k)
}
w_rnorm <- function(nrow, k, mean, sd, seed){
set.seed(seed)
abs(matrix(rnorm(nrow * k, mean, sd), nrow, k))
}
```

Generate some initial `w`

matrices using these functions:

```
library(cowplot)
w1 <- w_runif(nrow(A), 10, 0, 1, 123)
w2 <- w_runif(nrow(A), 10, 1, 2, 123)
w3 <- w_rnorm(nrow(A), 10, 0, 1, 123)
w4 <- w_rnorm(nrow(A), 10, 2, 1, 123)
```

See how the distributions of these different models differ:

## Evaluating initialization methods

We’ll use Mean Squared Error as a simple evaluation metric. We will compare results across several different datasets, as signal complexity can have a profound effect on recoverable NMF solution minima.

`hawaiibirds`

dataset

First, we’ll look at the hawaii birds dataset. Since this is a small dataset, we will run 50 replicates of each random initialization to 100 iterations.

```
data(hawaiibirds)
results <- eval_initializations(
hawaiibirds$counts, k = 10, n_reps = 50, tol = 1e-10, maxit = 100)
```

UMAP plot of all models learned for each initialization:

Clearly, `rnorm(mean = 2, sd = 1)`

has discovered a local minima that was not discovered by any other initialization method. Strikingly, it has done so while running faster than other methods.

`movielens`

dataset

For this dataset, we will mask zeros, because 0’s indicate movies that have not been rated by the corresponding users.

We will stop factorizations at `tol = 1e-5`

and also track the number of iterations needed to get to that point.

```
data(movielens)
results <- eval_initializations(
movielens$ratings, k = 7, n_reps = 10, tol = 1e-5, maxit = 1000, mask = "zeros")
```

UMAP plot of the learned models:

Models here are much more similar, but `rnorm`

still does surprisingly well, requires surprisingly few iterations, and is quite fast. Almost entirely on-par with this initialization is `nndsvd`

.

`aml`

dataset

```
data(aml)
results <- eval_initializations(aml, k = 10, n_reps = 25, tol = 1e-5)
```

and a UMAP plot of the learned models:

### Single-cell data

Let’s have a look at the pbmc3k dataset made available in the `SeuratData`

package. This dataset is an example of complex signal with significant dropout and noise.

```
library(Seurat)
library(SeuratData)
pbmc3k
```

```
## An object of class Seurat
## 13714 features across 2700 samples within 1 assay
## Active assay: RNA (13714 features, 0 variable features)
```

```
pbmc <- pbmc3k@assays$RNA@counts
results_pbmc3k <- eval_initializations(pbmc, k = 7, n_reps = 20, tol = 1e-5)
```

### Normalized Single-Cell Data

Log-normalize single cell data and see how these changes in the distribution affect the ideal initialization method:

```
pbmc_norm <- LogNormalize(pbmc)
results_pbmc_norm <- eval_initializations(pbmc_norm, k = 7, n_reps = 20, tol = 1e-5)
```

## Takeaways so far

**Runtime:**
* `rnorm`

and `runif`

. Consistently faster than SVD-based initializations. There is no convincing difference between `rnorm`

and `runif`

.

**Loss:**
* with multiple starts, `rnorm(2, 1)`

never does worse than any other method, but performs worse on average than `runif`

in single-cell data.
* `nndsvd`

performs as well as `runif`

in `aml`

and single-cell data, but takes longer. It performs worse than `runif`

in `movielens`

data (by a lot), and better than `runif`

in hawaiibirds (but not as well as `rnorm`

)

**Iterations:**
* `runif`

does at least as well as, or better than, all other methods.

Spectral decompositions such as `nndsvd`

do not out-perform random initialization-based methods such as `rnorm`

or `runif`

consistently. In addition, they require that an SVD be run, which increases the total runtime.

## Optimizing runif

It is possible that changing the bounds of the uniform distribution may affect the results.

We will address whether the width of the bounds matters, and the proximity of the lower-bound to zero. We will look at bounds in the range (0, 1), (0, 2), (0, 10), (1, 2), (1, 10), and (2, 10):

```
results_hibirds <- eval_runif(hawaiibirds$counts, k = 10, n_reps = 20, tol = 1e-6)
results_aml <- eval_runif(aml, k = 12, n_reps = 20)
results_movielens <- eval_runif(movielens$ratings, k = 7, n_reps = 20, mask = "zeros")
results_pbmc <- eval_runif(pbmc, k = 7, n_reps = 20)
```

These results show no consistent recipe for finding the best minima, but that there is considerable dataset-specific variation.

However, it is clear that varying the lower and upper bounds of `runif`

across restarts is likely to be useful.

## Optimizing rnorm

Changing the mean and standard deviation of the absolute value of a normal distribution can generate non-normal distributions, in fact, it can generate distributions quite like a gamma distribution. Thus, we will investigate some different combinations of mean and standard deviation: (0, 0.5), (0, 1), (0, 2), (1, 0.5), (1, 1), and (2, 1):

```
results_hibirds <- eval_rnorm(hawaiibirds$counts, k = 10, n_reps = 20, tol = 1e-6)
results_aml <- eval_rnorm(aml, k = 12, n_reps = 20)
results_movielens <- eval_rnorm(movielens$ratings, k = 7, n_reps = 20, mask = "zeros")
results_pbmc <- eval_rnorm(pbmc, k = 7, n_reps = 20)
```

Here it’s more difficult to pick a winner, they really perform similarly. For the `pbmc3k`

dataset, however, `rnorm(2,1)`

is probably the best choice. This distribution is largely normal, as opposed to gamma (i.e. `rnorm(0, 0.5)`

, which could be seen as the “loser”) or a lopsided bell-curve shaped (i.e. `rnorm(1, 1)).