Skip to contents

Fits the model across a range of K and reports diagnostics for choosing it: held-out likelihood (document completion), semantic coherence, exclusivity, and the variational bound. Unlike stm::searchK, the per-K fits parallelize across K (a long-standing request, bstewart/stm#262) and each fit is itself fast (Rust), so a sweep that took minutes takes seconds.

Usage

search_k(
  corpus,
  K,
  prevalence = NULL,
  content = NULL,
  heldout = TRUE,
  proportion = 0.5,
  residuals = FALSE,
  cores = 1L,
  M = 10L,
  seed = 1L,
  measure = c("mimno", "npmi", "c_v"),
  verbose = FALSE,
  ...
)

Arguments

corpus

A faSTM_corpus (from as_corpus()).

K

Integer vector of topic counts to try.

prevalence, content

Optional covariate formulas (see stm()).

heldout

Logical; compute held-out likelihood via document completion.

proportion

Held-out token fraction (passed to make_heldout()).

cores

Number of K-fits to run in parallel (forked; 1 = sequential). When cores > 1 each fit runs single-threaded to avoid oversubscription; when cores == 1 each fit uses all cores.

M

Top words for coherence/exclusivity.

seed

RNG seed (held-out split + fits).

...

Passed to stm() (e.g. max.em.its, init.type).

Value

A faSTM_searchk object wrapping a tidy data.frame results with one row per K (K, heldout, semcoh, exclusivity, bound).