Guided Clustering with NMF

This vignette complements the Seurat Guided Clustering tutorial, but using NMF and singlet instead.

Set up the Seurat object

This vignette introduces guided clustering and basic gene set enrichment analysis using singlet and Seurat. It uses the SeuratData::pbmc3k dataset and overlaps with the Seurat introductory tutorial on [guided clustering] (satijalab.org/seurat/articles/pbmc3k_tutorial.html) using the SeuratData::pbmc3k dataset.

library(singlet)
library(Seurat)
library(dplyr)
library(ggplot2)
pbmc3k <- singlet::get_pbmc3k_data()

Run NMF

NMF can be run on all features using normalized counts. Here we apply standard log-normalization, which works very well, but any form of approximate variance stabilizing transformation is suitable for helping NMF find meaningful solutions. Raw counts are not suitable for NMF because the model pays too much attention to features with very high counts.

RunNMF will automatically run cross-validation on an array of ranks that you provide, identify the best rank, and learn a model at that rank.

Cross-validation can take some time. For large datasets, it is useful to start with a coarse-grained scan of a wide range of ranks and then focus on a narrow window of ranks with several replicates. Remember that the optimal number of NMF factors in your analysis may be significantly greater than the number of PCA components that might be used. This is a great step to run on a High Performance Computing node.

First off, basic quality control:

pbmc3k[["percent.mt"]] <- PercentageFeatureSet(pbmc3k, pattern = "^MT-")
pbmc3k <- subset(pbmc3k, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)

Now run NMF cross-validation and learn the model at the (automatically determined) best rank:

set.seed(123)
pbmc3k <- NormalizeData(pbmc3k) %>%
  singlet::RunNMF()

Plot the cross-validation results:

RankPlot(pbmc3k)

Remember to set the seed, as above, to guarantee reproducibility of your NMF model.

Visualization

Plot the representation of various metadata (i.e. Seurat-annotated cell types) in each factor:

MetadataPlot(pbmc3k, "cell_type", reduction = "nmf")

Here we simply do what the Seurat Guided Clustering tutorial does, but for NMF:

VizDimLoadings(pbmc3k, dims = 1:2, reduction = "nmf")

Gene Set Enrichment Analysis

singlet makes GSEA with fgsea and msigdbr easy. Simply, the weights in the NMF “w” matrix are used as rankings for terms in the enrichment analysis.

pbmc3k <- RunGSEA(pbmc3k, category = "C7", verbose = FALSE)

See how GSEA results are stored in the NMF model:

## Formal class 'DimReduc' [package "SeuratObject"] with 9 slots
##   ..@ cell.embeddings           : num [1:2638, 1:15] 0.000484 0 0.000691 0 0 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr [1:2638] "AAACATACAACCAC" "AAACATTGAGCTAC" "AAACATTGATCAGC" "AAACCGTGCTTCCG" ...
##   .. .. ..$ : chr [1:15] "NMF_1" "NMF_2" "NMF_3" "NMF_4" ...
##   ..@ feature.loadings          : num [1:13714, 1:15] 0 0 0 0 0 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr [1:13714] "AL627309.1" "AP006222.2" "RP11-206L10.2" "RP11-206L10.9" ...
##   .. .. ..$ : chr [1:15] "NMF_1" "NMF_2" "NMF_3" "NMF_4" ...
##   ..@ feature.loadings.projected: num[0 , 0 ] 
##   ..@ assay.used                : chr "RNA"
##   ..@ global                    : logi(0) 
##   ..@ stdev                     : num [1:15] 541314 413514 361714 342022 307180 ...
##   ..@ key                       : chr "NMF_"
##   ..@ jackstraw                 :Formal class 'JackStrawData' [package "SeuratObject"] with 4 slots
##   .. .. ..@ empirical.p.values     : num[0 , 0 ] 
##   .. .. ..@ fake.reduction.scores  : num[0 , 0 ] 
##   .. .. ..@ empirical.p.values.full: num[0 , 0 ] 
##   .. .. ..@ overall.p.values       : num[0 , 0 ] 
##   ..@ misc                      :List of 2
##   .. ..$ cv_data:Classes 'cross_validate_nmf_data' and 'data.frame': 22 obs. of  3 variables:
##   .. .. ..$ k         : int [1:22] 2 4 8 10 12 13 14 15 16 24 ...
##   .. .. ..$ rep       : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##   .. .. ..$ test_error: num [1:22] 0.136 0.133 0.131 0.131 0.131 ...
##   .. ..$ gsea   :List of 4
##   .. .. ..$ pval: num [1:5084, 1:15] 0.00921 0.35403 0.42873 0.0013 0.05104 ...
##   .. .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. .. ..$ : chr [1:5084] "GSE41867_NAIVE_VS_DAY8_LCMV_CLONE13_EFFECTOR_CD8_TCELL_DN" "GSE27786_CD4_TCELL_VS_NKTCELL_DN" "GSE27786_LSK_VS_NKTCELL_UP" "GSE14769_UNSTIM_VS_240MIN_LPS_BMDM_DN" ...
##   .. .. .. .. ..$ : chr [1:15] "nmf2" "nmf1" "nmf7" "nmf4" ...
##   .. .. ..$ padj: num [1:5084, 1:15] 0 0.00652 0.04927 0 0 ...
##   .. .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. .. ..$ : chr [1:5084] "GSE41867_NAIVE_VS_DAY8_LCMV_CLONE13_EFFECTOR_CD8_TCELL_DN" "GSE27786_CD4_TCELL_VS_NKTCELL_DN" "GSE27786_LSK_VS_NKTCELL_UP" "GSE14769_UNSTIM_VS_240MIN_LPS_BMDM_DN" ...
##   .. .. .. .. ..$ : chr [1:15] "nmf2" "nmf1" "nmf7" "nmf4" ...
##   .. .. ..$ es  : num [1:5084, 1:15] 0.626 0.691 0.698 0.602 0.648 ...
##   .. .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. .. ..$ : chr [1:5084] "GSE41867_NAIVE_VS_DAY8_LCMV_CLONE13_EFFECTOR_CD8_TCELL_DN" "GSE27786_CD4_TCELL_VS_NKTCELL_DN" "GSE27786_LSK_VS_NKTCELL_UP" "GSE14769_UNSTIM_VS_240MIN_LPS_BMDM_DN" ...
##   .. .. .. .. ..$ : chr [1:15] "nmf2" "nmf1" "nmf7" "nmf4" ...
##   .. .. ..$ nes : num [1:5084, 1:15] 0.908 1.007 1.015 0.878 0.939 ...
##   .. .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. .. ..$ : chr [1:5084] "GSE41867_NAIVE_VS_DAY8_LCMV_CLONE13_EFFECTOR_CD8_TCELL_DN" "GSE27786_CD4_TCELL_VS_NKTCELL_DN" "GSE27786_LSK_VS_NKTCELL_UP" "GSE14769_UNSTIM_VS_240MIN_LPS_BMDM_DN" ...
##   .. .. .. .. ..$ : chr [1:15] "nmf2" "nmf1" "nmf7" "nmf4" ...

singlet provides a helper function for quick exploration of GSEA results:

GSEAHeatmap(pbmc3k, reduction = "nmf", max.terms.per.factor = 3)

Cell Clustering

The pbmc3k dataset ships with clusters that have been determined by graph-based clustering on a PCA embedding, followed by annotation based on cluster marker genes.

We will determine clusters by graph-based clustering on an NMF embedding, and then compare them to the PCA-guided clustering.

pbmc3k <- FindNeighbors(pbmc3k, dims = 1:ncol(pbmc3k@reductions$nmf), reduction = "nmf") %>%
  FindClusters(resolution = 0.5, verbose = FALSE) %>%
  RunUMAP(reduction = "nmf", dims = 1:ncol(pbmc3k@reductions$nmf), verbose = FALSE)

Because NMF factors are additive signals, we can also visualize their representation on UMAP coordinates:

FeaturePlot(pbmc3k, features = paste0("NMF_", 1:6))

Compare the composition of NMF clusters to Seurat PCA-guided clustering:

df <- data.frame(
  "nmf_clusters" = pbmc3k@meta.data$seurat_clusters,
  "pca_clusters" = pbmc3k@meta.data$cell_type)

df <- df[!is.na(df$pca_clusters), ]

df <- df %>% 
  group_by(nmf_clusters) %>% 
  count(pca_clusters) %>% 
  mutate(freq = n / sum(n))

ggplot(df, aes(nmf_clusters, pca_clusters, size = freq, color = n)) + 
  geom_point() + 
  theme_bw() + 
  labs(x = "NMF cluster", 
       y = "PCA cluster", 
       size = "proportion\nof cluster", 
       color = "cells in\nNMF cluster") + 
  scale_color_viridis_c(option = "D")

Since there is significant correspondence between PCA- and NMF-guided clusters, we can just transfer the labels based on majority overlap:

cluster_names <- df %>% 
  slice(which.max(n)) %>% 
  pull(pca_clusters)

levels(pbmc3k@meta.data$seurat_clusters) <- make.unique(as.vector(cluster_names))

DimPlot(pbmc3k, 
        reduction = "umap", 
        label = TRUE, 
        group.by = "seurat_clusters", 
        pt.size = 0.5) + NoLegend()

Compared to the PCA clustering (see the Seurat vignette), the NMF clustering more completely resolves NK cells and better resolves CD8 T-cells from Memory CD4 T-cells, but does not identify a small cluster of Dendritic Cells annotated in the Seurat vignette.

These results show how PCA and NMF can be used to achieve very similar results for cell clustering, while NMF models are interpretable and capture more information and thus better cluster resolution.

Zach DeBruine

2022-09-09

Set up the Seurat object

Run NMF

Visualization

Gene Set Enrichment Analysis

Cell Clustering