Recursive bipartitioning by rank-2 matrix factorization with an efficient modularity-approximate stopping criteria
Arguments
- A
matrix of features-by-samples in sparse format (preferred class is "Matrix::dgCMatrix")
- min_samples
stopping criteria giving the minimum number of samples permitted in a cluster
- min_dist
stopping criteria giving the minimum cosine distance of samples within a cluster to the center of their assigned vs. unassigned cluster. If
0
, neither this distance nor cluster centroids will be calculated.- tol
in rank-2 NMF, the correlation distance (\(1 - R^2\)) between \(w\) across consecutive iterations at which to stop factorization
- maxit
maximum number of fitting iterations
- nonneg
in rank-2 NMF, enforce non-negativity
- seed
random seed for rank-2 NMF model initialization
Value
A list of lists corresponding to individual clusters:
id : character sequence of "0" and "1" giving position of clusters along splitting hierarchy
samples : indices of samples in the cluster
center : mean feature expression of all samples in the cluster
dist : if applicable, relative cosine distance of samples in cluster to assigned/unassigned cluster center.
leaf : is cluster a leaf node
Details
Divisive clustering is a sensitive and fast method for sample classification. Samples are recursively partitioned into two groups until a stopping criteria is satisfied and prevents successful partitioning.
See nmf
and bipartition
for technical considerations and optimizations relevant to bipartitioning.
Stopping criteria. Two stopping criteria are used to prevent indefinite division of clusters and tune the clustering resolution to a desirable range:
min_samples
: Minimum number of samples permitted in a clustermin_dist
: Minimum cosine distance of samples to their cluster center relative to their unassigned cluster center (an approximation of Newman-Girvan modularity)
Newman-Girvan modularity (\(Q\)) is an interpretable and widely used measure of modularity for a bipartition. However, it requires the calculation of distance between all within-cluster and between-cluster sample pairs. This is computationally intensive, especially for large sample sets.
dclust
uses a measure which linearly approximates Newman-Girvan modularity, and simply requires the calculation of distance between all samples in a cluster and both cluster centers (the assigned and unassigned center), which is orders of magnitude faster to compute. Cosine distance is used instead of Euclidean distance since it handles outliers and sparsity well.
A bipartition is rejected if either of the two clusters contains fewer than min_samples
or if the mean relative cosine distance of the bipartition is less than min_dist
.
A bipartition will only be attempted if there are more than 2 * min_samples
samples in the cluster, meaning that dist
may not be calculated for some clusters.
Reproducibility. Because rank-2 NMF is approximate and requires random initialization, results may vary slightly across restarts. Therefore, specify a seed
to guarantee absolute reproducibility.
Other than setting the seed, reproducibility may be improved by setting tol
to a smaller number to increase the exactness of each bipartition.
References
Schwartz, G. et al. "TooManyCells identifies and visualizes relationships of single-cell clades". Nature Methods (2020).
Newman, MEJ. "Modularity and community structure in networks". PNAS (2006)
Kuang, D, Park, H. (2013). "Fast rank-2 nonnegative matrix factorization for hierarchical document clustering." Proc. 19th ACM SIGKDD intl. conf. on Knowledge discovery and data mining.