Search over the number of topics K — search

Fits the model across a range of K and reports diagnostics for choosing it: held-out likelihood (document completion), semantic coherence, exclusivity, and the variational bound. Unlike stm::searchK, the per-K fits parallelize across K (a long-standing request, bstewart/stm#262) and each fit is itself fast (Rust), so a sweep that took minutes takes seconds.

Usage

search_k(
  corpus,
  K,
  prevalence = NULL,
  content = NULL,
  heldout = TRUE,
  proportion = 0.5,
  residuals = FALSE,
  cores = 1L,
  M = 10L,
  seed = 1L,
  measure = c("mimno", "npmi", "c_v"),
  verbose = FALSE,
  ...
)

Arguments

corpus: A faSTM_corpus (from as_corpus()).
K: Integer vector of topic counts to try.
prevalence, content: Optional covariate formulas (see stm()).
heldout: Logical; compute held-out likelihood via document completion.
proportion: Held-out token fraction (passed to make_heldout()).
cores: Number of K-fits to run in parallel (forked; 1 = sequential). When cores > 1 each fit runs single-threaded to avoid oversubscription; when cores == 1 each fit uses all cores.
M: Top words for coherence/exclusivity.
seed: RNG seed (held-out split + fits).
...: Passed to stm() (e.g. max.em.its, init.type).

Value

A faSTM_searchk object wrapping a tidy data.frame results with one row per K (K, heldout, semcoh, exclusivity, bound).