Spectral biparitioning by rank-2 matrix factorization
Arguments
- data
dense or sparse matrix of features in rows and samples in columns. Prefer
matrix
orMatrix::dgCMatrix
, respectively- tol
tolerance of the fit
- nonneg
enforce non-negativity of the rank-2 factorization used for bipartitioning
- ...
development parameters
Value
A list giving the bipartition and useful statistics:
v : vector giving difference between sample loadings between factors in a rank-2 factorization
dist : relative cosine distance of samples within a cluster to centroids of assigned vs. not-assigned cluster
size1 : number of samples in first cluster (positive loadings in 'v')
size2 : number of samples in second cluster (negative loadings in 'v')
samples1: indices of samples in first cluster
samples2: indices of samples in second cluster
center1 : mean feature loadings across samples in first cluster
center2 : mean feature loadings across samples in second cluster
Details
Spectral bipartitioning is a popular subroutine in divisive clustering. The sign of the difference between sample loadings in factors of a rank-2 matrix factorization gives a bipartition that is nearly identical to an SVD.
Rank-2 matrix factorization by alternating least squares is faster than rank-2-truncated SVD (i.e. irlba).
This function is a specialization of rank-2 nmf
with support for factorization of only a subset of samples, and with additional calculations on the factorization model relevant to bipartitioning. See nmf
for details regarding rank-2 factorization.
Advanced Parameters
Several parameters may be specified in the ...
argument:
diag = TRUE
: scale factors in \(w\) and \(h\) to sum to 1 by introducing a diagonal, \(d\). This should generally never be set toFALSE
. Diagonalization enables symmetry of models in factorization of symmetric matrices, convex L1 regularization, and consistent factor scalings.samples = 1:ncol(A)
: samples to include in bipartition, numbered from 1 toncol(A)
. Default is all samples.calc_dist = TRUE
: calculate the relative cosine distance of samples within a cluster to either cluster centroid. IfTRUE
, centers for clusters will also be calculated.seed = NULL
: random seed for model initialization, generally not needed for rank-2 factorizations because robust solutions are recovered whendiag = TRUE
maxit = 100
: maximum number of alternating updates of \(w\) and \(h\). Generally, rank-2 factorizations converge quickly and this should not need to be adjusted.