keyATM toolkit¶

keyATM-specific workflow helpers live in topica.keyatm. The general post-hoc diagnostics (top words, representative documents, coherence, pyLDAvis, covariate effects, …) are model-agnostic and work on a fitted KeyATM directly: see the Diagnostics and STM toolkit pages.

The fitted model also exposes the convergence trace keyATM's plot_modelfit reports, as KeyATM.log_likelihood_history — a list of (iteration, per-token log-likelihood) pairs (perplexity is exp(-log_likelihood)). The same trace is available in the cross-model form as KeyATM.fit_history. Passing convergence_tol to fit (default 0.0, disabled) opts into early stopping on that trace: the Gibbs run halts once the relative change in the recorded log-likelihood drops below the tolerance, and KeyATM.converged reports whether it did. The base, covariate, and dynamic Gibbs backends support it; the CVB0 backend (sampler="cvb0") keeps no trace and never early-stops.

topica.keyatm.top_topics ¶

top_topics(model_or_theta, *, n=2, topic_names=None)

The n most prevalent topics in each document (≈ keyATM::top_topics).

Returns a list (one per document) of (topic_name, proportion) pairs, sorted by descending document-topic proportion. Pass a fitted :class:~topica.KeyATM (topic names are taken from it) or a raw theta array.

topica.keyatm.by_strata ¶

by_strata(model_or_theta, strata, *, ci=0.95, topic_names=None, corpus=None, nsims=None, seed=0)

Mean topic prevalence within each level of a document covariate (≈ keyATM::by_strata_DocTopic).

Splits documents by their value in strata (one label per document) and, for each level, reports the mean of each topic's proportion with a normal-approximation confidence interval on that mean. This is keyATM's descriptive answer to "how does topic prevalence differ across groups".

With nsims (and a fitted model as the first argument), the interval is widened by the method of composition: the model's θ posterior is drawn for you (logistic-normal for STM/CTM, Dirichlet for the Gibbs models — pass corpus= so document lengths are available) and the per-stratum means are pooled by Rubin's rules, so the topic-estimation uncertainty is propagated, not just the across-document spread. For a regression with the same propagation use :func:topica.stm.estimate_effect.

Returns a list of :class:StrataPrevalence, one per unique stratum (sorted). [s.as_dict() for s in result] builds a table.

topica.keyatm.visualize_keywords ¶

visualize_keywords(docs, keywords)

Corpus frequency of each keyword (≈ keyATM::visualize_keywords).

For every keyword in every set, reports how common it is in docs so you can catch keywords that are too rare to anchor a topic or so frequent they dominate it — the diagnostic keyATM asks you to run before fitting.

Returns a dict mapping each keyword-set name to a list of dicts {"keyword", "count", "proportion", "doc_freq"} sorted by descending proportion, where proportion is the keyword's share of all corpus tokens and doc_freq is the number of documents containing it.

topica.keyatm.refine_keywords ¶

refine_keywords(docs, keywords, *, min_count=2, min_doc_freq=1, verbose=False)

Drop keywords too rare to anchor a topic (≈ keyATM::refine_keywords).

Removes any keyword whose corpus count is below min_count or whose document frequency is below min_doc_freq (so out-of-vocabulary keywords, with count 0, always go). Keyword sets that end up empty are dropped, since a keyword topic needs at least one surviving keyword.

Returns (refined, dropped) where refined is the cleaned keyword dict and dropped maps each set name to the list of removed keywords. Set verbose=True to print a short report.

Dynamic model: time-trend credible intervals¶

Per-period topic prevalence with credible bands from the dynamic keyATM posterior's retained MCMC theta draws.

topica.time_prevalence_ci ¶

time_prevalence_ci(model, timestamps, *, ci=0.95, normalize=True)

Per-period topic prevalence with credible intervals from the dynamic keyATM posterior.

For a dynamic :class:~topica.KeyATM (fit with timestamps= and keep_theta_draws=True), this computes per-period prevalence uncertainty directly from the retained MCMC theta_draws. For each posterior draw and each time period, the per-draw average of theta over the documents in that period is computed, giving a (S, T, K) array of per-draw period-level prevalences. The point estimate is the posterior mean over draws; ci_low and ci_high are the empirical (1-ci)/2 and (1+ci)/2 quantiles; sd is the posterior standard deviation.

The periods are ordered to match model.time_labels exactly, so the result aligns with model.time_prevalence.

Parameters:

Name	Description	Default
`model`	A fitted dynamic :class:`~topica.KeyATM` with non-empty `time_labels` and non-`None` `theta_draws`. Refit with `keep_theta_draws=True` (the default) if draws are absent.	required
`timestamps`	One value per document — the same array passed to `fit`.	required
`ci`	Credible interval coverage (default 0.95 gives a 95 percent interval).	`0.95`
`normalize`	When `True` (default), each per-draw per-period prevalence row is normalized to sum to 1 before computing the summary statistics.	`True`

Returns:

Type	Description
`dict with keys:`	`labels`: list of period labels (equals `model.time_labels`) `mean`: ndarray shape (T, K), posterior mean prevalence per period `ci_low`: ndarray shape (T, K), lower credible bound `ci_high`: ndarray shape (T, K), upper credible bound `sd`: ndarray shape (T, K), posterior standard deviation