Consensus Clustering for NMF

Run multiple NMF replicates and compute consensus matrix showing co-clustering frequency of samples.

Usage

consensus_nmf(
  data,
  k,
  reps = 50,
  method = c("hard", "knn_jaccard"),
  knn = 10,
  seed = NULL,
  threads = 0,
  verbose = FALSE,
  ...
)

Arguments

data: input matrix (samples x features for clustering samples)
k: rank of factorization
reps: number of replicates (default 50)
method: consensus method: "hard" for hard cluster assignments (default), or "knn_jaccard" for KNN-based Jaccard overlap of factor loadings
knn: number of nearest neighbors to use for KNN Jaccard method (default 10)
seed: random seed for reproducibility
threads: number of threads for OpenMP parallelization (default 0 = all available)
verbose: print progress information (default FALSE)
...: additional arguments passed to nmf

Value

List with:

consensus - consensus matrix (samples x samples)
models - list of fitted nmf objects
clusters - final cluster assignments
cophenetic - cophenetic correlation coefficient
method - consensus method used

Details

Consensus clustering runs NMF multiple times with different random initializations.

**Hard clustering method** (method = "hard"): For each run, samples are clustered based on their dominant factor in W. The consensus matrix C[i,j] gives the proportion of runs where samples i and j were assigned to the same cluster. This is the traditional consensus clustering approach.

**KNN Jaccard method** (method = "knn_jaccard"): For each run, the k-nearest neighbors of each sample are computed based on factor loadings (W matrix). The consensus matrix C[i,j] is the average Jaccard similarity between the KNN sets of samples i and j across all replicates. This approach is more robust to ambiguous cluster assignments and captures neighborhood structure rather than hard cluster membership.

High consensus values (near 1) indicate stable co-clustering or neighborhood overlap. Intermediate values suggest ambiguous relationships.

The cophenetic correlation coefficient measures cluster stability - higher values (closer to 1) indicate more stable/reproducible clustering.

Examples

# \donttest{
library(Matrix)
A <- rsparsematrix(100, 50, 0.3)

# Traditional hard clustering consensus
cons_hard <- consensus_nmf(A, k = 5, reps = 10, method = "hard", seed = 123)

# KNN Jaccard consensus (more robust)
cons_knn <- consensus_nmf(A, k = 5, reps = 10, method = "knn_jaccard", knn = 15, seed = 123)

# Plot consensus heatmap
plot(cons_hard)

plot(cons_knn)


# Check cophenetic coefficient (higher = more stable)
print(cons_hard$cophenetic)
#> [1] 0.8231326
print(cons_knn$cophenetic)
#> [1] 0.8317164

# Get cluster assignments
print(table(cons_hard$clusters))
#> 
#>  1  2  3  4  5 
#> 18 19 30 17 16 
# }

Usage

Arguments

Value

Details

See also

Examples