# Integrating Heterogenous Samples with NMF

Aligning cancer methylation signatures with healthy cell-of-origin signatures

## NMF for source separation

One of the many applications of NMF is source separation, aka blind signal separation, where a mixture of signals are resolved in a factor model. Different samples will contain different signals, some unique, and some shared. The goal might be to visualize samples based on signals they share, or to identify discriminating signals.

## Integrative NMF?

Integrative NMF (iNMF) has been proposed for source separation and integration of heterogenous datasets (see LIGER). However, iNMF requires a regularization hyperparameter to enforce integration, and fitting is inherently slow.

Instead, we can simply run NMF on all signals and then annotate what factors are specific to metadata of interest.

## Cancer vs. healthy cell signatures

Classification of cancer cell-of-origin is a great example of source separation. Here, the challenge is to tease out signatures shared by cancer and healthy cell types to discover the cell type from which the cancer originated.

We’ll use the `aml`

dataset from the `MLdata`

package:

```
devtools::install_github("zdebruine/MLdata")
devtools::install_RcppML("zdebruine/RcppML")
```

```
library(RcppML)
library(MLdata)
library(ggplot2)
library(cowplot)
library(umap)
data(aml)
```

The `MLdata::aml`

dataset contains samples from 123 patients with Acute Myelogenous Leukemia (AML) and 5 samples each for putative cells of origin (GMP, LMPP, or MEP cells) from healthy patients. Each sample contains information on ~800 differentially methylated regions (DMRs), a measure of gene expression signatures.

`table(colnames(aml))`

```
##
## AML sample GMP LMPP MEP
## 123 5 5 5
```

Since we have three cell types and cancer, we’ll choose a low factorization rank (`k = 5`

). We’ll fit to machine-tolerances and input ten random seeds so that `RcppML::nmf`

runs factorizations from ten unique random initializations, and returns the best model of the ten:

`nmf_model <- RcppML::nmf(aml, k = 5, tol = 1e-10, maxit = 1000, seed = 1:10, verbose = F)`

## Annotating signal sources

We can see which sample types are represented in each NMF factor:

`plot(summary(nmf_model, group_by = colnames(aml), stat = "mean"))`

Notice how factor 3 almost exclusively describes methylation signal in healthy cells.

Let’s plot factor 3 vs. factor 5:

`biplot(nmf_model, factors = c(3, 5), matrix = "h", group_by = colnames(aml))`

Clearly if we want to “integrate” cancer and healthy cells for the purposes of classifying cell-of-origin, we do not want to be including factor 3 in that analysis.

## UMAP on the NMF embedding

Let’s learn a UMAP embedding of all samples on NMF coordinates using the full NMF model.

```
plot_umap <- function(nmf_model){
set.seed(123)
u <- uwot::umap(t(nmf_model$h), n_neighbors = 10, metric = "cosine", min_dist = 0.3, spread = 1)
df <- data.frame("umap1" = u[, 1], "umap2" = u[, 2], "group" = colnames(nmf_model$h))
ggplot(df, aes(x = umap1, y = umap2, color = group)) + geom_point() + theme_void()
}
plot_umap(nmf_model)
```

Clearly there are fundamental differences between cancer and healthy cells.

## Integrating by source separation

Let’s do the same as we did above, but now excluding factor 3:

`plot_umap(nmf_model[-3])`

Bingo! We are able to classify cancer cells based on healthy cell-of-origin!

In conclusion, we were able to integrate cancer and healthy cell methylation signatures by finding factors describing variation they shared in common.