Score-test distribution diagnostic

Given a baseline MSE-fitted NMF model and the original data, computes score-test statistics for the power-variance family ($V(\mu) = \mu^p$) to determine the best-fitting distribution without refitting. Optionally tests for NB overdispersion.

Usage

score_test_distribution(
  data,
  model,
  powers = c(0, 1, 2, 3),
  test_nb = TRUE,
  min_mu = 1e-06
)

Arguments

data: Original data matrix (sparse or dense)
model: A fitted NMF object (from nmf() with loss="mse")
powers: Numeric vector of variance powers to test. Default c(0, 1, 2, 3) covers Gaussian, Poisson, Gamma, Inverse Gaussian.
test_nb: Logical; if TRUE and data is integer-valued, also test the NB overdispersion diagnostic.
min_mu: Floor for predicted values to avoid division by zero. Default 1e-6.

Value

A list with:

scores: Data frame with columns power, T_stat, abs_T, distribution (label)
best_power: Numeric: the power p with smallest |T_p|
best_distribution: Character: name of the best-matching distribution
nb_diagnostic: If test_nb=TRUE: list with T_NB and overdispersed (logical)

Details

The score test statistic for variance power $p$ is: $$T_p = \text{mean}\left(\frac{r_{ij}^2}{\mu_{ij}^p} - 1\right)$$ where $r_{ij} = x_{ij} - \mu_{ij}$ are residuals and $\mu_{ij} = (WH)_{ij}$ are predicted values.

Under the correct model, $E[T_p] = 0$. The power minimizing $|T_p|$ best matches the observed variance-mean relationship.

The NB diagnostic tests for quadratic overdispersion: $$T_{NB} = \text{mean}\left(\frac{r_{ij}^2 - \mu_{ij}}{\mu_{ij}^2}\right)$$ If $T_{NB} > 0.1$, there is substantial overdispersion beyond Poisson, suggesting NB may be preferable to GP.

Examples

# \donttest{
A <- abs(rsparsematrix(200, 100, 0.3))
model <- nmf(A, k = 5, loss = "mse")
diag <- score_test_distribution(A, model)
print(diag$scores)
#>   power        T_stat        abs_T     distribution
#> 1     0 -4.365022e-01 4.365022e-01         gaussian
#> 2     1  5.237193e+02 5.237193e+02               gp
#> 3     2  5.222982e+08 5.222982e+08            gamma
#> 4     3  5.222981e+14 5.222981e+14 inverse_gaussian
cat("Best distribution:", diag$best_distribution, "\n")
#> Best distribution: gaussian 
# }

Usage

Arguments

Value

Details

See also

Examples