Models¶

All models share the same shape of API: construct with hyperparameters and a seed, call fit(documents, ...), then read topic_word (φ), doc_topic (θ), top_words(n), coherence(n), and save / load.

This page covers the count-based models. The embedding-based models (BERTopic, Top2Vec, ETM, FASTopic) are on the Embedding models page.

topica.LDA ¶

SparseLDA topic model (the MALLET algorithm).

Construct with the hyperparameters, then call :meth:fit on a :class:Corpus or a list of token lists. After fitting, the estimated distributions are available as :attr:topic_word (φ) and :attr:doc_topic (θ).

doc `class-attribute` ¶

__doc__ = 'SparseLDA topic model (the MALLET algorithm).\n\nConstruct with the hyperparameters, then call :meth:`fit` on a\n:class:`Corpus` or a list of token lists. After fitting, the estimated\ndistributions are available as :attr:`topic_word` (φ) and\n:attr:`doc_topic` (θ).'

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.

module `class-attribute` ¶

__module__ = 'topica'

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.

alpha `property` ¶

alpha

Per-topic α after (optional) optimisation, shape (num_topics,).

beta `property` ¶

beta

The (optimised) symmetric β.

converged `property` ¶

converged

True if fit stopped early because the convergence tolerance criterion was met (convergence_tol > 0); False if the full iters sweeps ran (the default).

doc_lengths `property` ¶

doc_lengths

Per-document token counts (length D), in :attr:doc_topic row order. Lets :func:topica.composition_theta recover the Dirichlet concentration N_d without re-threading the original :class:Corpus.

doc_names `property` ¶

doc_names

Document ids, parallel to the rows of :attr:doc_topic.

doc_topic `property` ¶

doc_topic

Document-topic probability matrix θ, shape (num_docs, num_topics).

fit_history `property` ¶

fit_history

Uniform convergence trace: (iteration, log_likelihood) pairs, one per trace checkpoint. Equivalent to :attr:log_likelihood_history for LDA.

log_likelihood_history `property` ¶

log_likelihood_history

Per-iteration log-likelihood trace: (iteration, log_likelihood) pairs recorded every check_every sweeps during :meth:fit. Non-empty for the SparseLDA path; empty for the LightLDA path.

num_topics `property` ¶

num_topics

seed `property` ¶

seed

The random seed the model was constructed with.

settings `property` ¶

settings

The constructor configuration as a JSON-serialisable dict, keyword-named to match __init__ (issue #400). Internal flags are reported under their public names (sampler, init); values are the effective ones actually in force (e.g. num_threads after the .max(1) floor).

theta_draws `property` ¶

theta_draws

Thinned MCMC θ draws, shape (num_draws, num_docs, num_topics), or None when fit with keep_theta_draws=False. These are real cross-sweep posterior samples; :func:topica.composition_theta prefers them over the within-document Dirichlet approximation.

topic_divergence `property` ¶

topic_divergence

Pairwise Jensen-Shannon divergence between topic-word distributions, shape (num_topics, num_topics) (base 2, in [0, 1]; 0 on the diagonal). Low off-diagonal values flag near-duplicate topics.

topic_names `property` ¶

topic_names

One label per topic, in topic order. Defaults to ["topic_0", ...] after fit; assign a list of the same length to override.

topic_word `property` ¶

topic_word

Topic-word probability matrix φ, shape (num_topics, num_words).

vocabulary `property` ¶

vocabulary

The vocabulary: word for each column of :attr:topic_word.

new `builtin` ¶

__new__(*args, **kwargs)

Create and return a new object. See help(type) for accurate signature.

repr `method descriptor` ¶

__repr__()

Return repr(self).

coherence `method descriptor` ¶

coherence(n=10)

UMass topic coherence for each topic, shape (num_topics,).

Intrinsic (no external corpus): for each topic's top-n words, Σ_{i>j} log[(codoc(w_i,w_j)+1)/docfreq(w_j)] over the training corpus. Higher (closer to 0) is more coherent. numpy.mean(...) gives the usual single-number summary.

diagnostics `method descriptor` ¶

diagnostics(n=10)

Per-topic diagnostics (MALLET-style), one dict per topic, suitable for pandas.DataFrame(model.diagnostics()).

Keys mirror MALLET's topic diagnostics: topic, tokens (assignments to the topic), coherence (UMass), exclusivity (mean top-word share of φ vs. other topics; higher = more distinctive), effective_words (exp(H(φ_t)), MALLET's eff_num_words; lower = more focused), document_entropy (entropy of the topic's token allocation across documents), uniform_dist (KL of φ_t from uniform) and corpus_dist (KL of φ_t from the corpus word distribution), rank1_docs (documents whose dominant topic is this one), alpha, and top_words. n is the number of top words per topic surfaced in top_words.

evaluate `method descriptor` ¶

evaluate(data, *, num_particles=10, seed=None)

Held-out evaluation via the Wallach et al. (2009) left-to-right estimator (the method MALLET's evaluate-topics uses).

data is a held-out :class:Corpus or list[list[str]]; its tokens are matched to the training vocabulary by string (out-of-vocabulary tokens are dropped). Returns a dict with log_likelihood (total held-out log P(data)), perplexity (exp(-LL / num_tokens), lower is better), num_tokens (scored), and num_oov (dropped). Cost grows with the square of document length, so keep num_particles modest. seed seeds the inference RNG (defaults to the model's seed).

fit `method descriptor` ¶

fit(data, *, iters=1000, num_samples=5, sample_interval=25, progress=None, progress_interval=50, keep_theta_draws=True, num_theta_draws=25, convergence_tol=0.0, check_every=10, num_threads=None, turbo_merge_every=1)

Run Gibbs sampling on data, then average num_samples snapshots (taken sample_interval iterations apart) into the final φ/θ estimates.

data may be a :class:Corpus or a list of token lists (each a list of strings). When a token-list is passed, an internal corpus is built with no frequency filtering — build a :class:Corpus explicitly for that.

progress, if given, is called as progress(iteration, ll_per_token) every progress_interval iterations during the main loop.

convergence_tol (default 0.0, disabled) enables early stopping: after each check_every sweeps the relative change in a smoothed log-likelihood is compared; if it falls below convergence_tol the loop stops and :attr:converged is set to True. When 0 (default), the full iters sweeps always run (default behavior is unchanged, bit-for-bit identical).

turbo_merge_every (default 1, exact) is an opt-in approximate-speed knob for multi-threaded runs only. The parallel sampler partitions documents across workers and reconciles the shared topic-word counts after every sweep; that per-sweep merge is the thread-scaling ceiling. Setting this to m > 1 lets each worker run m sweeps against its own counts before one merge, so the table is synchronized once per m sweeps. This is approximate (workers sample against staler cross-partition counts the deeper into a batch they go), so results differ from the exact path and are not bit-reproducible against it; with m = 1 (or single-threaded, or the LightLDA/WarpLDA/CVB0 samplers) the exact per-sweep path runs and is unchanged. We measured the tradeoff on a large wide-vocabulary corpus (30k docs, 30k vocabulary, K=400, 8 threads): m = 3 ran 1.55x faster for a 0.010 drop in c_npmi topic coherence. The win appears only when the merge actually dominates (large corpus, wide vocabulary, high K, many threads); on smaller corpora it does not help and can run slower, so leave it at the default unless profiling shows the merge is your bottleneck. Recommended range when it helps: 3 to 4.

keep_theta_draws (default True) retains the last num_theta_draws thinned MCMC θ snapshots in theta_draws for composition_theta standard errors; set it False to save memory. num_threads overrides the constructor's num_threads for this fit call only (None = constructor value).

load `staticmethod` ¶

load(path)

Load a model previously written by :meth:save.

load_state `staticmethod` ¶

load_state(path)

Reconstruct a fitted model from a MALLET-format Gibbs state file (the inverse of :meth:save_state; MALLET's --input-state). The file may be gzip-compressed or plain text. The vocabulary, documents, per-token topic assignments, and the #alpha/#beta hyperparameters are read back, so the loaded model supports the full read-only surface (topic_word, doc_topic, top_words, …) and transform on new documents, and can re-emit the state with :meth:save_state.

log_likelihood `method descriptor` ¶

log_likelihood()

MALLET-formula model log-likelihood of the final sampler state.

perplexity `method descriptor` ¶

perplexity(data, *, num_particles=10, seed=None)

Held-out perplexity (lower is better) — convenience wrapper over :meth:evaluate. See evaluate for data/num_particles semantics. seed seeds the inference RNG (defaults to the model's seed).

save `method descriptor` ¶

save(path)

Save the fitted model to path (compact binary). Reload with LDA.load.

save_doc_topic `method descriptor` ¶

save_doc_topic(path)

Write document-topic probabilities to a TSV file (the train CLI format).

save_state `method descriptor` ¶

save_state(path)

Write the token-level Gibbs state to a gzipped file in MALLET's --output-state format: a header, the #alpha/#beta hyperparameter lines, then one row per token — doc source pos typeindex type topic — giving the final topic assignment of every token in the training corpus. Researchers pipe this into custom visualizations (e.g. pyLDAvis) or corpus metrics. The file is gzip-compressed, as MALLET writes it.

save_topic_word `method descriptor` ¶

save_topic_word(path)

Write topic-word probabilities to a TSV file (the train CLI format).

similar_documents `method descriptor` ¶

similar_documents(doc, n=10)

The n training documents most similar to document doc (by index), as (doc_name, divergence) pairs sorted by ascending Jensen-Shannon divergence of their document-topic distributions.

top_documents `method descriptor` ¶

top_documents(topic, n=10)

The n training documents most strongly associated with topic, as (doc_name, weight) pairs sorted by descending θ for that topic.

top_words `method descriptor` ¶

top_words(n=10, *, topic=None)

Top n words per topic as (word, probability) pairs.

Returns a list of n-length lists (one per topic), or — when topic is given — just that topic's list.

transform `method descriptor` ¶

transform(data, *, iters=100, burn_in=10, num_samples=10, sample_interval=5, seed=None, iterations=None)

Infer document-topic distributions for new, unseen documents under the fitted model (sklearn-style transform). data is a :class:Corpus or list[list[str]]; tokens are matched to the training vocabulary by string (OOV dropped). A document with no in-vocabulary tokens gets the prior θ. Returns an array of shape (num_new_docs, num_topics) whose rows sum to 1.

The collapsed-Gibbs controls are per-document: iters sweeps each new document, discarding the first burn_in, then averaging num_samples θ snapshots taken sample_interval sweeps apart; seed seeds the inference RNG. iterations is a deprecated alias for iters.

topica.DMR ¶

Dirichlet-Multinomial Regression topic model (Mimno & McCallum, 2008).

Like :class:LDA, but the per-document topic prior is a log-linear function of document features: α_{d,t} = exp(λ_t · x_d). After fitting, the learned weights are available as :attr:feature_effects — how each covariate shifts each topic's prevalence.

doc `class-attribute` ¶

__doc__ = "Dirichlet-Multinomial Regression topic model (Mimno & McCallum, 2008).\n\nLike :class:`LDA`, but the per-document topic prior is a log-linear function\nof document features: ``α_{d,t} = exp(λ_t · x_d)``. After fitting, the\nlearned weights are available as :attr:`feature_effects` — how each covariate\nshifts each topic's prevalence."

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.

module `class-attribute` ¶

__module__ = 'topica'

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.

alpha `property` ¶

alpha

The baseline document-topic Dirichlet prior α, shape (num_topics,): exp(λ_intercept), the per-topic prior at covariates = 0. DMR's prior is per-document (α_{d,k} = exp(λ_k · x_d)), so this is the baseline; it marks DMR as a Dirichlet model for :func:topica.effects.composition_theta.

converged `property` ¶

converged

True if the relative-change convergence criterion was satisfied before all iterations completed. Always False when convergence_tol=0.

doc_lengths `property` ¶

doc_lengths

Per-document token counts (length D), in :attr:doc_topic row order.

doc_names `property` ¶

doc_names

doc_topic `property` ¶

doc_topic

Document-topic matrix θ, shape (num_docs, num_topics).

feature_effect_se `property` ¶

feature_effect_se

Standard error of each feature weight λ, shape (num_topics, num_features), from the observed information of the penalized Dirichlet-multinomial likelihood at the fit — the curvature of the same objective L-BFGS maximizes to estimate :attr:feature_effects. Aligned to feature_effects; an effect more than ~2 SEs from zero is the usual significance cue. None for models saved before this was added.

feature_effects `property` ¶

feature_effects

Learned feature weights λ, shape (num_topics, num_features) — how each feature (column 0 is the intercept) shifts each topic's log-prior. Positive ⇒ the feature raises that topic's prevalence.

feature_names `property` ¶

feature_names

Feature names aligned with the columns of :attr:feature_effects ("intercept" first).

fit_history `property` ¶

fit_history

Per-iteration log-likelihood trace. Returns one (iter, ll) pair for every check_every sweeps (empty when check_every=0, the default).

num_topics `property` ¶

num_topics

seed `property` ¶

seed

The random seed the model was constructed with.

settings `property` ¶

settings

The constructor configuration as a JSON-serialisable dict, keyword-named to match __init__ (issue #400).

theta_draws `property` ¶

theta_draws

Thinned MCMC θ draws, shape (num_draws, num_docs, num_topics), or None when fit with keep_theta_draws=False. These are real cross-sweep posterior samples; :func:topica.composition_theta prefers them over the within-document Dirichlet approximation.

topic_names `property` ¶

topic_names

One label per topic, in topic order. Defaults to ["topic_0", ...] after fit; assign a list of the same length to override.

topic_word `property` ¶

topic_word

Topic-word matrix φ, shape (num_topics, num_words).

vocabulary `property` ¶

vocabulary

new `builtin` ¶

__new__(*args, **kwargs)

Create and return a new object. See help(type) for accurate signature.

repr `method descriptor` ¶

__repr__()

Return repr(self).

coherence `method descriptor` ¶

coherence(n=10)

UMass topic coherence per topic, shape (num_topics,). UMass topic coherence per topic, shape (num_topics,). n is the number of top words per topic scored.

fit `method descriptor` ¶

fit(data, features=None, *, feature_names=None, iters=1000, num_samples=5, sample_interval=25, progress=None, progress_interval=50, keep_theta_draws=True, num_theta_draws=25, convergence_tol=0.0, check_every=10, covariates=None, offset=None)

Fit the model. data is a :class:Corpus or list[list[str]]; features is a (num_docs, F) numpy array or list of float lists (an intercept column is prepended automatically). feature_names (length F) names the columns; an "intercept" name is prepended. covariates is accepted as a no-deprecation alias for features.

iters is the number of Gibbs sweeps. After burn-in, num_samples posterior snapshots are collected sample_interval sweeps apart for the retained draws. progress toggles a progress display; progress_interval sets how often the model-fit/log-likelihood trace is recorded (0 = ~50 evenly spaced points); report_interval is a deprecated alias for progress_interval. keep_theta_draws (default True) retains num_theta_draws thinned MCMC θ snapshots in theta_draws, the cross-sweep posterior samples composition_theta prefers over the Dirichlet approximation; set it False to save memory. convergence_tol (default 0.0, disabled) enables opt-in early stopping: the run stops once the relative change in the recorded log-likelihood between the last two trace points, |ΔLL| / |LL|, falls below it, setting converged. The monitored quantity is the collapsed model-fit log-likelihood; the comparison window is the trace cadence (check_every / progress_interval), so a coarser cadence compares more widely spaced sweeps. This is a pragmatic early-stop heuristic on the log-likelihood trace, not a guarantee the Gibbs chain has mixed. check_every is how often, in sweeps, the log-likelihood is recorded and the convergence_tol test is applied. offset is an optional fixed (num_docs, num_topics) term added inside the exponent of the per-document prior, α_{d,t} = exp(λ_t · x_d + offset[d,t]). A constant offset shifts the baseline Dirichlet concentration (e.g. GDMR passes log(alpha) to center the intercept prior at log(alpha)); None (default) leaves the prior unshifted.

load `staticmethod` ¶

load(path)

Load a model previously written by :meth:save.

save `method descriptor` ¶

save(path)

Save the fitted model to path (compact binary). Reload with DMR.load.

top_words `method descriptor` ¶

top_words(n=10, *, topic=None)

Top n words per topic as (word, probability) pairs (all topics, or one when topic is given).

transform `method descriptor` ¶

transform(data, features=None, *, iters=100, burn_in=10, num_samples=10, sample_interval=5, seed=None, iterations=None)

Infer topic proportions θ for new documents by collapsed Gibbs against the fitted topic-word matrix. data is a :class:Corpus or list[list[str]]; OOV tokens are dropped. features (optional, a (num_docs, F) covariate array matching training, no intercept) sets each document's Dirichlet prior α_d = exp(Xγ); if omitted the intercept-only baseline prior is used. Returns (num_docs, num_topics).

The collapsed-Gibbs controls are per-document: iters sweeps each new document, discarding the first burn_in, then averaging num_samples θ snapshots taken sample_interval sweeps apart; seed seeds the inference RNG. iterations is a deprecated alias for iters.

topica.GDMR ¶

Generalized DMR topic model (g-DMR; Lee & Song 2020).

GDMR replaces the raw document covariates of DMR with a Legendre tensor-product polynomial basis over one or more continuous metadata variables. A decay prior progressively shrinks the higher-order basis terms, producing a smooth topic-distribution function (TDF) over the continuous metadata domain.

We implement GDMR as a thin wrapper around the compiled topica.DMR engine. The Legendre basis is realized in NumPy and passed to DMR as its feature matrix; the decay prior is realized via column scaling (the "scaling trick"), so no changes to the Rust core are required.

Parameters:

Name	Type	Description	Default
`num_topics`	`int`	Number of topics K.	required
`degrees`	`list[int]`	Per continuous-metadata-dimension maximum Legendre degree. Length must equal the number of metadata dimensions D. `degrees=[3]` gives a cubic TDF over a single continuous covariate.	required
`beta`	`float`	Dirichlet word smoothing parameter (passed through to DMR).	`0.01`
`optimize_interval`	`int`	How often (in Gibbs sweeps) to run the L-BFGS lambda-optimization.	`50`
`burn_in`	`int`	Sweeps before optimization begins.	`200`
`seed`	`int`	RNG seed.	`42`
`sigma`	`float`	Prior std on the non-constant (order >= 1) basis terms, matching tomotopy's `GDMRModel.sigma`.	`1.0`
`sigma0`	`float`	Prior std on the constant (order-0 / intercept) term, matching tomotopy's `GDMRModel.sigma0`.	`3.0`
`decay`	`float`	Per-dimension shrinkage of higher-order terms (tomotopy `GDMRModel.decay`): a non-constant term with per-dimension Legendre degrees `(p_0..p_{D-1})` has prior variance `sigma2 / prod_d (p_d + 1)(2*decay)`. Any `decay > 0` shrinks; `decay == 0` gives a uniform `sigma` over all non-constant terms (the original paper's decay-free prior).	`0.0`
`alpha`	`float`	Baseline Dirichlet concentration (tomotopy `GDMRModel.alpha`, default 0.1). Applied as a constant `log(alpha)` offset in the DMR predictor, so it sets the baseline topic concentration (smaller `alpha` -> sparser per-document topic mixtures) and centers the intercept prior at `log(alpha)`, matching both reference implementations. `alpha = 1` reproduces a zero-mean intercept prior (topica's pre-#426 behavior).	`0.1`
`metadata_range`	`list[tuple[float, float]] \| None`	Per-dimension `(lo, hi)` bounds for the [-1, 1] mapping. If None, we infer from the training data at fit time.	`None`
`lbfgs_iters`	`int`	L-BFGS step cap per optimization round.	`20`
`sampler`	`str`	Gibbs sampler variant: `"sparse"` (default), `"warp"`, or `"cvb0"`. See `topica.DMR` for details.	`'sparse'`

doc `class-attribute` ¶

__doc__ = 'Generalized DMR topic model (g-DMR; Lee & Song 2020).\n\n    GDMR replaces the raw document covariates of DMR with a Legendre\n    tensor-product polynomial basis over one or more continuous metadata\n    variables.  A decay prior progressively shrinks the higher-order basis\n    terms, producing a smooth topic-distribution function (TDF) over the\n    continuous metadata domain.\n\n    We implement GDMR as a thin wrapper around the compiled ``topica.DMR``\n    engine.  The Legendre basis is realized in NumPy and passed to DMR as its\n    feature matrix; the decay prior is realized via column scaling (the\n    "scaling trick"), so no changes to the Rust core are required.\n\n    Parameters\n    ----------\n    num_topics:\n        Number of topics K.\n    degrees:\n        Per continuous-metadata-dimension maximum Legendre degree.  Length\n        must equal the number of metadata dimensions D.  ``degrees=[3]``\n        gives a cubic TDF over a single continuous covariate.\n    beta:\n        Dirichlet word smoothing parameter (passed through to DMR).\n    optimize_interval:\n        How often (in Gibbs sweeps) to run the L-BFGS lambda-optimization.\n    burn_in:\n        Sweeps before optimization begins.\n    seed:\n        RNG seed.\n    sigma:\n        Prior std on the **non-constant** (order >= 1) basis terms, matching\n        tomotopy\'s ``GDMRModel.sigma``.\n    sigma0:\n        Prior std on the **constant** (order-0 / intercept) term, matching\n        tomotopy\'s ``GDMRModel.sigma0``.\n    decay:\n        Per-dimension shrinkage of higher-order terms (tomotopy\n        ``GDMRModel.decay``): a non-constant term with per-dimension Legendre\n        degrees ``(p_0..p_{D-1})`` has prior variance\n        ``sigma**2 / prod_d (p_d + 1)**(2*decay)``.  Any ``decay > 0`` shrinks;\n        ``decay == 0`` gives a uniform ``sigma`` over all non-constant terms (the\n        original paper\'s decay-free prior).\n    alpha:\n        Baseline Dirichlet concentration (tomotopy ``GDMRModel.alpha``, default\n        0.1).  Applied as a constant ``log(alpha)`` offset in the DMR predictor, so\n        it sets the baseline topic concentration (smaller ``alpha`` -> sparser\n        per-document topic mixtures) and centers the intercept prior at\n        ``log(alpha)``, matching both reference implementations.  ``alpha = 1``\n        reproduces a zero-mean intercept prior (topica\'s pre-#426 behavior).\n    metadata_range:\n        Per-dimension ``(lo, hi)`` bounds for the [-1, 1] mapping.  If None,\n        we infer from the training data at fit time.\n    lbfgs_iters:\n        L-BFGS step cap per optimization round.\n    sampler:\n        Gibbs sampler variant: ``"sparse"`` (default), ``"warp"``, or\n        ``"cvb0"``.  See ``topica.DMR`` for details.\n    '

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.

module `class-attribute` ¶

__module__ = 'topica.gdmr'

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.

weakref `property` ¶

__weakref__

list of weak references to the object

alpha `property` ¶

alpha

Intercept baseline topic prevalence, shape (num_topics,).

Equals exp(lambda_intercept) — the per-topic prior with every non-constant Legendre column set to zero. This is a formal baseline, not the TDF at the metadata midpoint: at the midpoint (mapped to Legendre t = 0) the even-degree columns are non-zero (P_2(0) = -1/2, P_4(0) = 3/8, ...), so for any degree >= 2 the midpoint TDF differs from this intercept-only value. Use :meth:tdf at the midpoint for the latter.

converged `property` ¶

converged

True if fit early-stopped due to convergence.

decay `property` ¶

decay

Decay exponent for higher-order prior shrinkage (0 disables decay).

degrees `property` ¶

degrees

Maximum Legendre degree per metadata dimension (read-only).

doc_lengths `property` ¶

doc_lengths

Number of tokens in each training document.

doc_names `property` ¶

doc_names

Per-document identifiers in corpus order.

doc_topic `property` ¶

doc_topic

Document-topic matrix theta, shape (num_docs, num_topics), rows sum to 1.

feature_effect_se `property` ¶

feature_effect_se

Standard error of :attr:feature_effects, shape (num_topics, num_basis), from the underlying DMR's observed-information SE rescaled by the same per-column factor that undoes the basis standardization. Aligned to feature_effects; None for models saved before this was added.

feature_effects `property` ¶

feature_effects

Learned lambda over the Legendre basis, shape (num_topics, num_basis).

Column 0 is the intercept (order-0 Legendre product). Subsequent columns correspond to the remaining basis terms in the tensor-product enumeration order.

feature_names `property` ¶

feature_names

Labels for the Legendre basis terms, aligned with the columns of :attr:feature_effects.

Column 0 is "intercept"; the rest are :-joined "{name}^{k}" terms over the metadata dimensions (e.g. "year^2", "year^1:citations^1"), using :attr:metadata_names. The ^k marks the degree-k Legendre term, not a raw power. Because a continuous covariate's per-degree coefficients are rarely interpretable on their own, read the fitted surface with :meth:tdf / :meth:tdf_linspace rather than the individual basis coefficients.

fit_history `property` ¶

fit_history

Per-iteration (iteration, objective) trace.

metadata_names `property` ¶

metadata_names

Names of the D continuous metadata dimensions (the model's inputs).

These are distinct from :attr:feature_names: a metadata dimension (say "year") expands into several Legendre basis terms (year^1, year^2, ...), and it is those basis terms that :attr:feature_effects is indexed by. Set via metadata_names= on :meth:fit; defaults to ["x0", "x1", ...].

metadata_range `property` ¶

metadata_range

Per-dimension (lo, hi) bounds used for the [-1, 1] mapping.

num_topics `property` ¶

num_topics

Number of topics K.

settings `property` ¶

settings

The constructor configuration as a JSON-serialisable dict, keyword-named to match __init__ (issue #400).

sigma `property` ¶

sigma

Prior std on the order-0 (intercept) basis term.

sigma0 `property` ¶

sigma0

Prior std on order >= 1 basis terms (before decay scaling).

theta_draws `property` ¶

theta_draws

Thinned MCMC theta draws (num_draws, num_docs, num_topics) or None.

topic_names `property` ¶

topic_names

Per-topic labels. Defaults to ["topic_0", ...] after fit.

topic_word `property` ¶

topic_word

Topic-word matrix phi, shape (num_topics, num_words), rows sum to 1.

vocabulary `property` ¶

vocabulary

Word vocabulary in corpus token-ID order.

coherence ¶

coherence(n: int = 10) -> np.ndarray

UMass topic coherence per topic, shape (num_topics,).

fit ¶

fit(data: Corpus | Sequence[Sequence[str]], features=None, *, metadata_names=None, iters: int = 1000, num_samples: int = 5, sample_interval: int = 25, keep_theta_draws: bool = True, convergence_tol: float = 0.0, check_every: int = 10, covariates=None, metadata=None) -> None

Fit GDMR by collapsed Gibbs with the Legendre-basis DMR prior.

We construct the Legendre tensor-product basis from the continuous covariates, apply column scaling to realize the decay prior, then hand off to the compiled topica.DMR engine for sampling and L-BFGS optimization. After fitting we recover the true lambda coefficients (feature_effects) by undoing the column scaling.

Parameters:

Name	Type	Description	Default
`data`	`Corpus \| Sequence[Sequence[str]]`	A `topica.Corpus` or a list of token lists.	required
`features`		Array-like of shape `(num_docs, D)` of continuous covariate values where D equals `len(degrees)`. Values outside `metadata_range` are clipped to [-1, 1] in Legendre space. As in :class:`DMR`, `covariates=` is an accepted alias; `metadata=` is also accepted for users porting from tomotopy's `GDMRModel`. Pass exactly one.	`None`
`metadata_names`		Optional human-readable labels for the `D` covariate columns, surfaced in the Legendre-basis `feature_names`. Defaults to `x0, x1, ...`.	`None`
`iters`	`int`	Total Gibbs sweeps.	`1000`
`num_samples`	`int`	Number of topic-word phi snapshots to average.	`5`
`sample_interval`	`int`	Sweeps between phi snapshots.	`25`
`keep_theta_draws`	`bool`	Whether to retain thinned MCMC theta draws.	`True`
`convergence_tol`	`float`	Relative-change early-stop threshold (0 disables).	`0.0`
`check_every`	`int`	Sweeps between convergence checks.	`10`
`covariates`		Alias for `features` (topica's DMR vocabulary).	`None`
`metadata`		Alias for `features` (tomotopy `GDMRModel` vocabulary).	`None`

load `staticmethod` ¶

load(path: str) -> GDMR

Load a GDMR model previously written by :meth:save.

Parameters:

Name	Type	Description	Default
`path`	`str`	File path passed to :meth:`save`.	required

Returns:

Type	Description
A fitted ``GDMR`` instance.

save ¶

save(path: str) -> None

Persist the fitted GDMR model to path.

We save the GDMR wrapper state (degrees, metadata_range, sigma, sigma0, decay, recover_scales, constructor parameters) alongside the inner DMR model, using Python pickle for the wrapper envelope and the DMR native save format. Reload with :meth:GDMR.load.

tdf ¶

tdf(metadata, *, normalize: bool = True) -> np.ndarray

Topic-distribution function at one or more metadata points.

Evaluates the fitted surface at metadata and returns topic prevalences implied by the Legendre-basis DMR prior.

Parameters:

Name	Type	Description	Default
`metadata`		Array-like of shape `(D,)` for a single point or `(P, D)` for P points, in original metadata units (mapped internally via `metadata_range`).	required
`normalize`	`bool`	If True (default), normalize each row so topic prevalences sum to 1. If False, return the raw alpha = exp(lambda @ phi(metadata)).	`True`

Returns:

Type	Description
Array of shape ``(num_topics,)`` for a single point, or
``(P, num_topics)`` for P points.

tdf_linspace ¶

tdf_linspace(start, stop, num: int, *, endpoint: bool = True, normalize: bool = True) -> np.ndarray

Evaluate the TDF on a regular grid over the metadata domain.

For D == 1: start and stop are scalars (or length-1 arrays). Returns an array of shape (num, num_topics).

For D > 1: start and stop have length D. Returns a tensor grid of shape (num, ..., num, num_topics) with D leading axes of size num. The 1-D case is the primary tested path.

Parameters:

Name	Type	Description	Default
`start`		Lower bound of the evaluation range. Scalar for D == 1, or length-D for D > 1.	required
`stop`		Upper bound of the evaluation range.	required
`num`	`int`	Number of grid points per dimension.	required
`endpoint`	`bool`	Include `stop` in the grid (default True, matching np.linspace).	`True`
`normalize`	`bool`	See :meth:`tdf`.	`True`

Returns:

Type	Description
Array of shape ``(num, num_topics)`` for D == 1, or
``(num, ..., num, num_topics)`` for D > 1.

top_words ¶

top_words(n: int = 10, *, topic=None)

Top n words per topic as (word, probability) pairs.

Parameters:

Name	Type	Description	Default
`n`	`int`	Number of top words to return per topic.	`10`
`topic`		If given, return only the list for that topic index. If None (default), return a list of lists (one per topic).	`None`

transform ¶

transform(data, features=None, *, iters: int = 100, burn_in: int = 10, num_samples: int = 10, sample_interval: int = 5, seed=None, covariates=None, metadata=None) -> np.ndarray

Infer document-topic theta for new documents.

Parameters:

Name	Type	Description	Default
`data`		A `topica.Corpus` or list of token lists.	required
`features`		Optional continuous covariate array-like of shape `(num_new_docs, D)`. If provided, the Legendre-basis DMR prior is used for each document; if None, the intercept-only baseline is used. `covariates=` and `metadata=` are accepted aliases (see :meth:`fit`).	`None`
`iters`	`int`	Inference sweeps.	`100`
`burn_in`	`int`	Sweeps before sampling begins.	`10`
`num_samples`	`int`	Number of theta snapshots to average.	`10`
`sample_interval`	`int`	Sweeps between snapshots.	`5`
`seed`		Optional RNG seed override.	`None`
`covariates`		Alias for `features` (topica's DMR vocabulary).	`None`
`metadata`		Alias for `features` (tomotopy `GDMRModel` vocabulary).	`None`

Returns:

Type	Description
Array of shape ``(num_new_docs, num_topics)``.

topica.RTM ¶

RTM: the Relational Topic Model (Chang & Blei, "Hierarchical Relational Models for Document Networks", AOAS 2010). LDA plus a link model: for each observed pair of documents a binary link is drawn from a function of the two documents' mean topic assignments, so the same topics explain both words and links. Fit with fit(docs, links=edges) on a document graph (citations, hyperlinks, co-sponsorship, adjacency); predict links from words for unseen documents with suggest_links. Undirected links; link="logistic" (default) or "exponential".

doc `class-attribute` ¶

__doc__ = 'RTM: the Relational Topic Model (Chang & Blei, "Hierarchical Relational Models\nfor Document Networks", AOAS 2010). LDA plus a link model: for each observed\npair of documents a binary link is drawn from a function of the two documents\'\nmean topic assignments, so the same topics explain both words and links. Fit\nwith ``fit(docs, links=edges)`` on a document graph (citations, hyperlinks,\nco-sponsorship, adjacency); predict links from words for unseen documents with\n``suggest_links``. Undirected links; ``link="logistic"`` (default) or\n``"exponential"``.'

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.

module `class-attribute` ¶

__module__ = 'topica'

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.

converged `property` ¶

converged

Whether the objective met the convergence tolerance before iters.

doc_topic `property` ¶

doc_topic

eta `property` ¶

eta

Link-function coefficients eta (length K): how topic co-occurrence drives the log-odds (logistic) or log-rate (exponential) of a link.

fit_history `property` ¶

fit_history

Per-EM-iteration variational objective (word + z + link log-likelihood).

link `property` ¶

link

The link probability function in use ("logistic" or "exponential").

nu `property` ¶

nu

Link-function intercept nu.

num_topics `property` ¶

num_topics

phi_bar `property` ¶

phi_bar

Mean topic-assignment vectors phi_bar (D x K) — the quantity the link function reads. This is NOT doc_topic (the normalized Dirichlet mean).

seed `property` ¶

seed

The random seed the model was constructed with.

settings `property` ¶

settings

The constructor configuration as a JSON-serialisable dict, keyword-named to match __init__ (issue #400). alpha/rho are None when left to resolve at fit (alpha = 1/num_topics; rho = negative_ratio * #links).

topic_word `property` ¶

topic_word

vocabulary `property` ¶

vocabulary

new `builtin` ¶

__new__(*args, **kwargs)

Create and return a new object. See help(type) for accurate signature.

repr `method descriptor` ¶

__repr__()

Return repr(self).

fit `method descriptor` ¶

fit(data, links, *, iters=50, e_sweeps=3, e_inner=5)

Fit RTM on a document graph. data is a Corpus or list[list[str]]; links is a sequence of undirected (i, j) document-index pairs.

load `builtin` ¶

load(path)

Load a model from path.

predict_link `method descriptor` ¶

predict_link(i, j)

Plug-in link probability between two training documents, psi(phi_bar_i o phi_bar_j).

suggest_links `method descriptor` ¶

suggest_links(doc, *, top_n=20, exclude=None, infer_iters=50)

Suggest links for a new document from its words alone. Infers phi_bar from the (in-vocabulary) tokens with the link term removed, then ranks training documents by plug-in link probability. Returns (doc_index, probability) pairs, highest first.

topica.NarrativeTM ¶

Intra-Document Narrative Trajectory Model (Experimental).

Segments documents into chunks (e.g., sentences or fixed token intervals) and fits a Generalized DMR (GDMR) model over their relative positions, capturing the average progression of topics from the beginning to the end of texts.

doc `class-attribute` ¶

__doc__ = 'Intra-Document Narrative Trajectory Model (Experimental).\n\n    Segments documents into chunks (e.g., sentences or fixed token intervals)\n    and fits a Generalized DMR (GDMR) model over their relative positions,\n    capturing the average progression of topics from the beginning to the end of texts.\n    '

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.

module `class-attribute` ¶

__module__ = 'topica.narrative'

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.

weakref `property` ¶

__weakref__

list of weak references to the object

settings `property` ¶

settings

The constructor configuration as a JSON-serialisable dict, keyword-named to match __init__ (issue #400).

global_trajectory ¶

global_trajectory(t: float | Sequence[float] | ndarray) -> np.ndarray

Evaluate the learned global narrative trajectory at relative position t in [0, 1].

Returns a numpy array of shape (len(t), K) or (K,) representing the topic proportions at position t.

load `staticmethod` ¶

load(path: str) -> NarrativeTM

Load a saved model state from path.

save ¶

save(path: str) -> None

Persist the model state using pickle.

topica.LabeledLDA ¶

Supervised topic model (Ramage et al., 2009): each document carries a set of labels, each label is a topic, and a document's tokens are constrained to its labels' topics. The number of topics is the number of distinct labels.

Documents with an empty label set are treated as unconstrained (all topics).

doc `class-attribute` ¶

__doc__ = "Supervised topic model (Ramage et al., 2009): each document carries a set of\nlabels, each label is a topic, and a document's tokens are constrained to its\nlabels' topics. The number of topics is the number of distinct labels.\n\nDocuments with an empty label set are treated as unconstrained (all topics)."

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.

module `class-attribute` ¶

__module__ = 'topica'

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.

alpha `property` ¶

alpha

The symmetric document-topic Dirichlet prior α, shape (num_topics,). Marks LabeledLDA as a Dirichlet model for :func:topica.effects.composition_theta.

converged `property` ¶

converged

True if the convergence criterion was met; False otherwise.

doc_lengths `property` ¶

doc_lengths

Number of tokens in each training document, shape (num_docs,).

doc_names `property` ¶

doc_names

doc_topic `property` ¶

doc_topic

Document-topic matrix θ, shape (num_docs, num_topics); for each document only its label topics are non-zero, and rows sum to 1.

fit_history `property` ¶

fit_history

Per-iteration log-likelihood trace recorded every check_every sweeps.

labels `property` ¶

labels

The label name for each topic, in topic (column) order.

num_topics `property` ¶

num_topics

seed `property` ¶

seed

The random seed the model was constructed with.

settings `property` ¶

settings

The constructor configuration as a JSON-serialisable dict, keyword-named to match __init__ (issue #400).

theta_draws `property` ¶

theta_draws

Thinned MCMC θ snapshots, shape (num_draws, num_docs, num_topics), dtype float32. None when fit with keep_theta_draws=False. These are real cross-sweep draws; use them with :func:topica.effects.composition_theta for uncertainty quantification.

topic_names `property` ¶

topic_names

One label per topic, in topic order. Defaults to ["topic_0", ...] after fit; assign a list of the same length to override.

topic_word `property` ¶

topic_word

Topic-word matrix φ, shape (num_topics, num_words).

vocabulary `property` ¶

vocabulary

new `builtin` ¶

__new__(*args, **kwargs)

Create and return a new object. See help(type) for accurate signature.

repr `method descriptor` ¶

__repr__()

Return repr(self).

coherence `method descriptor` ¶

coherence(n=10)

UMass topic coherence per topic, shape (num_topics,). UMass topic coherence per topic, shape (num_topics,). n is the number of top words per topic scored.

fit `method descriptor` ¶

fit(data, labels, *, label_names=None, iters=1000, num_samples=5, sample_interval=25, progress=None, progress_interval=50, keep_theta_draws=True, num_theta_draws=25, convergence_tol=0.0, check_every=10)

Fit the model. data is a :class:Corpus or list[list[str]]; labels is a list (one per document) of label lists. The topic set is the union of all labels (or label_names, which also fixes topic order and must contain every non-empty observed label exactly once). An empty label list leaves that document unconstrained.

convergence_tol (default 0.0, disabled) enables early stopping based on the relative change in log-likelihood every check_every sweeps.

iters is the number of Gibbs sweeps. After burn-in, num_samples posterior snapshots are collected sample_interval sweeps apart for the retained draws. progress toggles a progress display; progress_interval sets how often the model-fit/log-likelihood trace is recorded (0 = ~50 evenly spaced points); report_interval is a deprecated alias for progress_interval. keep_theta_draws (default True) retains num_theta_draws thinned MCMC θ snapshots in theta_draws, the cross-sweep posterior samples composition_theta prefers over the Dirichlet approximation; set it False to save memory.

load `staticmethod` ¶

load(path)

Load a model previously written by :meth:save.

save `method descriptor` ¶

save(path)

Save the fitted model to path. Reload with LabeledLDA.load.

top_words `method descriptor` ¶

top_words(n=10, *, topic=None)

Top n words for one topic (by label name or index) or all topics.

transform `method descriptor` ¶

transform(data, *, iters=100, burn_in=10, num_samples=10, sample_interval=5, seed=None, iterations=None)

Infer label (topic) proportions θ for new documents by collapsed Gibbs against the fitted topic-word matrix, treating every label as available (unsupervised inference). data is a :class:Corpus or list[list[str]]; OOV tokens are dropped. Returns (num_docs, num_topics); columns align with :attr:labels.

The collapsed-Gibbs controls are per-document: iters sweeps each new document, discarding the first burn_in, then averaging num_samples θ snapshots taken sample_interval sweeps apart; seed seeds the inference RNG. iterations is a deprecated alias for iters.

topica.SAGE ¶

Content-covariate topic model (SAGE / the STM content model).

Topics are shared, but each topic's word distribution varies by a document-level group covariate, so you can read how a topic is worded differently across groups. Construct, then :meth:fit on documents plus a per-document group label.

doc `class-attribute` ¶

__doc__ = "Content-covariate topic model (SAGE / the STM content model).\n\nTopics are shared, but each topic's word distribution varies by a\ndocument-level **group** covariate, so you can read how a topic is worded\ndifferently across groups. Construct, then :meth:`fit` on documents plus a\nper-document group label."

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.

module `class-attribute` ¶

__module__ = 'topica'

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.

alpha `property` ¶

alpha

The symmetric document-topic Dirichlet prior α, shape (num_topics,). SAGE's sparse additive parameterization is on the word side; the document side is an ordinary Dirichlet, so this marks SAGE as a Dirichlet model for :func:topica.effects.composition_theta.

content_kappa `property` ¶

content_kappa

The fitted content deviations κ, as a dict of numpy arrays: "topic" (K×V), "group" (G×V), and "interaction" (K·G×V, row index k*G + g). log β_{k,g,v} = m_v + κ_topic[k,v] + κ_group[g,v] + κ_interaction[k·G+g, v] up to the softmax normalizer. Under a sparse prior most entries are ~0; the nonzero ones are the words each topic/group up- or down-weights relative to the background m.

converged `property` ¶

converged

True if the relative-change convergence criterion was satisfied before all iterations completed. Always False when convergence_tol=0.

doc_lengths `property` ¶

doc_lengths

Number of tokens in each training document, shape (num_docs,).

doc_names `property` ¶

doc_names

doc_topic `property` ¶

doc_topic

Document-topic matrix θ, shape (num_docs, num_topics); rows sum to 1.

fit_history `property` ¶

fit_history

Per-iteration log-likelihood trace. Returns one (iter, ll) pair for every check_every sweeps (empty when check_every=0, the default).

groups `property` ¶

groups

Group names, in the index order used by :attr:topic_word's second axis.

num_groups `property` ¶

num_groups

num_topics `property` ¶

num_topics

prior `property` ¶

prior

The prior on the κ content deviations ("laplace", "gaussian", or "jeffreys").

seed `property` ¶

seed

The random seed the model was constructed with.

settings `property` ¶

settings

The constructor configuration as a JSON-serialisable dict, keyword-named to match __init__ (issue #400).

theta_draws `property` ¶

theta_draws

Thinned MCMC θ snapshots, shape (num_draws, num_docs, num_topics), dtype float32. None when fit with keep_theta_draws=False. These are real cross-sweep draws; use them with :func:topica.effects.composition_theta for uncertainty quantification.

topic_names `property` ¶

topic_names

One label per topic, in topic order. Defaults to ["topic_0", ...] after fit; assign a list of the same length to override.

topic_word `property` ¶

topic_word

Topic-word distributions per group, shape (num_topics, num_groups, num_words).

topic_word_marginal `property` ¶

topic_word_marginal

Group-neutral topic-word matrix, shape (num_topics, num_words): the per-group β_{k,g,·} averaged with equal weight over groups, β_k = (1/G) Σ_g β_{k,g}. This is a deliberate group-neutral summary of each topic's content (the topic with the group covariate marginalized out under a uniform group prior); it is not the empirical marginal Σ_g P(g|z=k) β_{k,g}, which would tilt topics toward the more prevalent groups. Use :attr:topic_word for the full per-group distributions.

vocabulary `property` ¶

vocabulary

new `builtin` ¶

__new__(*args, **kwargs)

Create and return a new object. See help(type) for accurate signature.

repr `method descriptor` ¶

__repr__()

Return repr(self).

coherence `method descriptor` ¶

coherence(n=10)

UMass topic coherence per topic (group-averaged), shape (num_topics,). UMass topic coherence per topic, shape (num_topics,). n is the number of top words per topic scored.

fit `method descriptor` ¶

fit(data, groups, *, group_names=None, iters=1000, num_samples=5, sample_interval=25, progress=None, progress_interval=50, keep_theta_draws=True, num_theta_draws=25, convergence_tol=0.0, check_every=10)

Fit the model. data is a :class:Corpus or list[list[str]]; groups is a per-document group label (strings or ints), one per document. group_names fixes the group order (defaults to sorted union).

iters is the number of Gibbs sweeps. After burn-in, num_samples posterior snapshots are collected sample_interval sweeps apart for the retained draws. progress toggles a progress display; progress_interval sets how often the model-fit/log-likelihood trace is recorded (0 = ~50 evenly spaced points); report_interval is a deprecated alias for progress_interval. keep_theta_draws (default True) retains num_theta_draws thinned MCMC θ snapshots in theta_draws, the cross-sweep posterior samples composition_theta prefers over the Dirichlet approximation; set it False to save memory. convergence_tol (default 0.0, disabled) enables opt-in early stopping: the run stops once the relative change in the recorded log-likelihood between the last two trace points, |ΔLL| / |LL|, falls below it, setting converged. The monitored quantity is the word-emission log-likelihood under the current topic assignments (Σ n·log β), not a full collapsed model-fit likelihood. It is a corpus constant until the first κ update, so the early-stop test is only applied after κ has been re-estimated (issue #422). The comparison window is the trace cadence (check_every / progress_interval), so a coarser cadence compares more widely spaced sweeps. This is a pragmatic early-stop heuristic on the log-likelihood trace, not a guarantee the Gibbs chain has mixed. check_every is how often, in sweeps, the log-likelihood is recorded and the convergence_tol test is applied.

load `staticmethod` ¶

load(path)

Load a model previously written by :meth:save.

save `method descriptor` ¶

save(path)

Save the fitted model to path. Reload with SAGE.load.

top_words `method descriptor` ¶

top_words(n=10, *, topic=None, group=None)

Top n words per topic. topic=None (default) returns a list of lists (one per topic); topic=k returns the list for topic k. With group (name or index) given, uses that group's word distribution; otherwise the group-averaged distribution is used.

transform `method descriptor` ¶

transform(data, *, iters=100, burn_in=10, num_samples=10, sample_interval=5, seed=None, iterations=None)

Infer document-topic distributions for new, unseen documents under the fitted model (sklearn-style transform). Holds the fitted group-averaged topic-word distributions fixed and runs collapsed Gibbs to infer θ for each document. Returns shape (num_new_docs, num_topics) with rows summing to 1.

Approximation: held-out inference uses the group-averaged topic-word matrix (the marginal over groups) and does not condition on a group covariate for new documents. This is a baseline projection; the group-specific word distributions are a training-time device and cannot be recovered for documents whose group label is unknown.

The collapsed-Gibbs controls are per-document: iters sweeps each new document, discarding the first burn_in, then averaging num_samples θ snapshots taken sample_interval sweeps apart; seed seeds the inference RNG. iterations is a deprecated alias for iters.

word_contrast `method descriptor` ¶

word_contrast(topic, group_a, group_b, n=10)

Words that most distinguish how topic is worded in group_a vs group_b, by log-ratio of the two groups' word probabilities. Returns (word, log_ratio) — positive favours group_a. n is the number of most contrastive words to return.

topica.CTM ¶

Correlated Topic Model (Blei & Lafferty; the STM core). Topics are drawn from a logistic-normal prior with a full covariance, so they can correlate — unlike LDA's Dirichlet. Fit by variational EM (STM's Laplace E-step).

This is the engine STM builds on; prevalence/content covariates layer on top.

The per-document E-step runs in parallel on all cores by default; cap it with fit(num_threads=...) (results are identical regardless). variational= chooses the covariance approximation ("laplace" full, or "diagonal" for a faster mean-field one at high K), and fit(keep_eta_cov=False) trades stored covariance for far less memory at large K.

doc `class-attribute` ¶

__doc__ = 'Correlated Topic Model (Blei & Lafferty; the STM core). Topics are drawn\nfrom a logistic-normal prior with a full covariance, so they can correlate —\nunlike LDA\'s Dirichlet. Fit by variational EM (STM\'s Laplace E-step).\n\nThis is the engine STM builds on; prevalence/content covariates layer on top.\n\nThe per-document E-step runs in parallel on all cores by default; cap it with\n``fit(num_threads=...)`` (results are identical regardless). ``variational=``\nchooses the covariance approximation (``"laplace"`` full, or ``"diagonal"``\nfor a faster mean-field one at high K), and ``fit(keep_eta_cov=False)`` trades\nstored covariance for far less memory at large K.'

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.

module `class-attribute` ¶

__module__ = 'topica'

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.

bound `property` ¶

bound

Final variational bound (approximate ELBO) at convergence — the quantity R stm reports as convergence$bound.

bound_history `property` ¶

bound_history

The variational bound after each EM iteration (the convergence trajectory). Its length is the number of iterations actually run.

converged `property` ¶

converged

True if EM stopped on the em_tol criterion; False if it hit the iters cap first (the fit may not have converged).

doc_names `property` ¶

doc_names

doc_topic `property` ¶

doc_topic

Document-topic matrix θ, shape (num_docs, num_topics); rows sum to 1.

eta_cov `property` ¶

eta_cov

Per-document variational posterior covariances ν of η, shape (num_docs, num_topics-1, num_topics-1). Stored as float32 in memory to halve the dominant memory term; cast to float64 with np.asarray(model.eta_cov, dtype=np.float64) when full precision is needed. Raises RuntimeError if the model was fit with keep_eta_cov=False; use :meth:_recompute_eta_cov to regenerate on demand.

eta_mean `property` ¶

eta_mean

Per-document variational posterior means λ of the logistic-normal η, shape (num_docs, num_topics-1). Pairs with :attr:eta_cov to sample θ draws (method-of-composition uncertainty).

fit_history `property` ¶

fit_history

Uniform convergence trace: (iteration, bound) pairs, one per EM iteration. The objective is the variational ELBO (same as :attr:bound_history).

initialization `property` ¶

initialization

The initialization route the fit actually took (issue #410): "spectral", "random-fallback" (spectral requested but recovery fell back to a seeded random init), or "random". None before the model is fitted, and after loading a model saved before this was recorded.

num_topics `property` ¶

num_topics

seed `property` ¶

seed

The random seed the model was constructed with.

settings `property` ¶

settings

The constructor configuration as a JSON-serialisable dict, keyword-named to match __init__ (issue #400).

topic_correlation `property` ¶

topic_correlation

Topic-correlation matrix from the logistic-normal Σ, shape (num_topics, num_topics). Off-diagonal entries are genuine topic correlations (the whole point of CTM vs. LDA).

topic_covariance `property` ¶

topic_covariance

The fitted logistic-normal prior covariance Σ over η, shape (num_topics-1, num_topics-1) (the last topic is the softmax reference, so it is dropped). This is the model's own topic covariance — unlike :attr:topic_correlation, which is an across-document θ correlation.

topic_names `property` ¶

topic_names

One label per topic, in topic order. Defaults to ["topic_0", ...] after fit; assign a list of the same length to override.

topic_word `property` ¶

topic_word

Topic-word matrix β, shape (num_topics, num_words).

variational `property` ¶

variational

Variational-covariance mode: "laplace" (full ν = H⁻¹) or "diagonal" (mean-field ν = diag(1/H_ii)).

vocabulary `property` ¶

vocabulary

new `builtin` ¶

__new__(*args, **kwargs)

Create and return a new object. See help(type) for accurate signature.

repr `method descriptor` ¶

__repr__()

Return repr(self).

coherence `method descriptor` ¶

coherence(n=10)

UMass topic coherence per topic, shape (num_topics,). UMass topic coherence per topic, shape (num_topics,). n is the number of top words per topic scored.

fit `method descriptor` ¶

fit(data, *, iters=500, convergence_tol=1e-05, inference='batch', batch_size=256, tau=64.0, kappa=0.7, beta_init=None, em_tol=None, keep_eta_cov=True, num_threads=None)

Fit by variational EM. data is a :class:Corpus or list[list[str]]. EM runs until the relative change in the variational bound drops below convergence_tol (R stm's emtol) or iters iterations are reached, whichever comes first. Pass convergence_tol=0 to always run iters steps. Check :attr:converged and :attr:bound afterward. inference="svi" switches from full-batch variational EM to stochastic variational inference (online VB): documents are processed in minibatches of batch_size, taking a stochastic step on the global parameters with a decaying learning rate (tau + t)^(-kappa), for iters epochs. SVI is for very large corpora; on moderate corpora the default "batch" EM is preferable. SVI uses the base logistic-normal model only. num_threads caps the worker pool for the parallel per-document E-step; the default None uses all available cores. The fit is bit-for-bit identical regardless of the worker count, so this only trades resource use (set it to 1 for a fully serial run). keep_eta_cov=False does not store the per-document variational covariance (an O(N*K^2) array), cutting memory sharply at large K; posterior_theta_samples / estimate_effect with draws transparently recompute it on demand when needed. beta_init is an optional initial topic-word matrix to warm-start from. em_tol is the relative-bound tolerance for EM early stopping — the run stops when the relative change in the variational evidence bound falls below it (the criterion R stm uses).

load `staticmethod` ¶

load(path)

Load a model previously written by :meth:save.

save `method descriptor` ¶

save(path)

Save the fitted model to path. Reload with CTM.load.

top_words `method descriptor` ¶

top_words(n=10, *, topic=None)

Top n words per topic (or one topic) as (word, probability) pairs.

transform `method descriptor` ¶

transform(data)

Infer topic proportions θ for new documents by the variational E-step against the fitted globals (β, logistic-normal prior μ, Σ). data is a :class:Corpus or list[list[str]]; tokens outside the training vocabulary are dropped. Returns a (num_docs, num_topics) array.

topica.STM ¶

Structural Topic Model (Roberts, Stewart & Tingley). The correlated-topic core (:class:CTM) with prevalence covariates: a document's prior topic mean is a regression on its covariates, μ_d = X_d γ, so covariates shift which topics a document discusses. After fitting, prevalence_effects holds the learned γ; pair it with topica.stm.estimate_effect for inference.

The per-document E-step runs in parallel on all cores by default; cap it with fit(num_threads=...) (results are identical regardless). variational= chooses the covariance approximation ("laplace" full, or "diagonal" for a faster mean-field one at high K), and fit(keep_eta_cov=False) trades stored covariance for far less memory at large K.

doc `class-attribute` ¶

__doc__ = 'Structural Topic Model (Roberts, Stewart & Tingley). The correlated-topic\ncore (:class:`CTM`) with **prevalence covariates**: a document\'s prior topic\nmean is a regression on its covariates, `μ_d = X_d γ`, so covariates shift\nwhich topics a document discusses. After fitting, `prevalence_effects` holds\nthe learned γ; pair it with `topica.stm.estimate_effect` for inference.\n\nThe per-document E-step runs in parallel on all cores by default; cap it with\n``fit(num_threads=...)`` (results are identical regardless). ``variational=``\nchooses the covariance approximation (``"laplace"`` full, or ``"diagonal"``\nfor a faster mean-field one at high K), and ``fit(keep_eta_cov=False)`` trades\nstored covariance for far less memory at large K.'

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.

module `class-attribute` ¶

__module__ = 'topica'

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.

bound `property` ¶

bound

Final variational bound (approximate ELBO) at convergence — the quantity R stm reports as convergence$bound.

bound_history `property` ¶

bound_history

The variational bound after each EM iteration (the convergence trajectory). Its length is the number of iterations actually run.

content_kappa `property` ¶

content_kappa

The SAGE content-model κ decomposition behind the per-group topic-word model, as a dict: m (num_words,), kappa_topic (num_topics, num_words), kappa_cov (num_groups, num_words), and kappa_interaction (num_topics, num_groups, num_words). The per-group log-probabilities are m + kappa_topic + kappa_cov + kappa_interaction (softmax over words). Requires content covariates. These additive parts are what R stm's sageLabels() / labelTopics() rank words by; the per-group β alone does not identify them.

converged `property` ¶

converged

True if EM stopped on the em_tol criterion; False if it hit the iters cap first (the fit may not have converged).

doc_names `property` ¶

doc_names

doc_topic `property` ¶

doc_topic

Document-topic matrix θ, shape (num_docs, num_topics); rows sum to 1.

eta_cov `property` ¶

eta_cov

Per-document variational posterior covariances ν of η, shape (num_docs, num_topics-1, num_topics-1). Stored as float32 in memory to halve the dominant memory term; cast to float64 with np.asarray(model.eta_cov, dtype=np.float64) when full precision is needed. Raises RuntimeError if the model was fit with keep_eta_cov=False; use :meth:_recompute_eta_cov to regenerate on demand.

eta_mean `property` ¶

eta_mean

Per-document variational posterior means λ of η, shape (num_docs, num_topics-1). With :attr:eta_cov this is the logistic-normal posterior used to draw θ samples for method-of-composition uncertainty in estimate_effect.

feature_names `property` ¶

feature_names

Covariate names aligned with the rows of :attr:prevalence_effects ("intercept" first).

fit_history `property` ¶

fit_history

Uniform convergence trace: (iteration, bound) pairs, one per EM iteration. The objective is the variational ELBO (same as :attr:bound_history).

groups `property` ¶

groups

Content-covariate group names (axis-1 order of :attr:topic_word_by_group).

initialization `property` ¶

initialization

The initialization route the fit actually took (issue #410): "spectral", "random-fallback" (spectral requested but recovery fell back to a seeded random init), or "random". None before the model is fitted, and after loading a model saved before this was recorded.

num_base_groups `property` ¶

num_base_groups

Number of base content groups (the content= levels). 0 if no content.

num_time_periods `property` ¶

num_time_periods

Number of ordered content-time periods (the content_time= levels), or 0 for a plain content model. When > 0, the saturated :attr:groups are the cross base@period with index = base*num_time_periods + period.

num_topics `property` ¶

num_topics

prevalence_effects `property` ¶

prevalence_effects

Prevalence coefficients γ, shape (num_features, num_topics-1) — how each covariate (row 0 is the intercept) shifts each topic's log-prior. The last topic is the softmax reference. For inference, prefer topica.stm.estimate_effect(model.doc_topic, X).

seed `property` ¶

seed

The random seed the model was constructed with.

settings `property` ¶

settings

The constructor configuration as a JSON-serialisable dict, keyword-named to match __init__ (issue #400).

topic_correlation `property` ¶

topic_correlation

Topic-correlation matrix, shape (num_topics, num_topics).

topic_covariance `property` ¶

topic_covariance

The fitted logistic-normal prior covariance Σ over η, shape (num_topics-1, num_topics-1) (the last topic is the softmax reference, so it is dropped). This is the model's own topic covariance — unlike :attr:topic_correlation, which is an across-document θ correlation.

topic_names `property` ¶

topic_names

One label per topic, in topic order. Defaults to ["topic_0", ...] after fit; assign a list of the same length to override.

topic_word `property` ¶

topic_word

Topic-word matrix β, shape (num_topics, num_words).

topic_word_by_group `property` ¶

topic_word_by_group

Per-group topic-word distributions, shape (num_topics, num_groups, num_words) — only available when fit with content covariates.

variational `property` ¶

variational

Variational-covariance mode: "laplace" (full ν = H⁻¹) or "diagonal" (mean-field ν = diag(1/H_ii)).

vocabulary `property` ¶

vocabulary

new `builtin` ¶

__new__(*args, **kwargs)

Create and return a new object. See help(type) for accurate signature.

repr `method descriptor` ¶

__repr__()

Return repr(self).

coherence `method descriptor` ¶

coherence(n=10)

UMass topic coherence per topic, shape (num_topics,). UMass topic coherence per topic, shape (num_topics,). n is the number of top words per topic scored.

fit `method descriptor` ¶

fit(data, prevalence=None, *, prevalence_names=None, content=None, content_names=None, content_time=None, content_smooth=1.0, content_prior_var=0.5, content_prior='l2', iters=500, convergence_tol=1e-05, gamma_prior='pooled', gamma_enet=1.0, beta_init=None, em_tol=None, covariates=None, keep_eta_cov=True, num_threads=None)

Fit. data is a :class:Corpus or list[list[str]]. prevalence (optional, (num_docs, F) covariates) makes topic prevalence depend on covariates (μ_d = X_d γ); an intercept is prepended. content (optional, one group label per document) makes the topic-word distributions vary by group (the SAGE content model). At least one of prevalence/content should be given (else use :class:CTM).

EM runs until the relative change in the variational bound drops below em_tol (R stm's emtol) or iters iterations are reached, whichever comes first. Pass em_tol=0 to always run iters steps. Inspect :attr:converged and :attr:bound after fitting.

gamma_prior controls the prevalence-coefficient (γ) regression in the M-step. "pooled" (default) is a variational-Bayes ridge that estimates the coefficient and noise precisions from the data (adaptive shrinkage, intercept unpenalised), a faithful port of R stm's gamma.prior="Pooled" path (vb.variational.reg). The adaptive shrinkage keeps μ = Xγ stable across EM iterations on wide designs (e.g. a day spline), so EM converges in far fewer iterations than a fixed ridge would (see issue #247). "l1" fits an elastic-net path by coordinate descent with the penalty selected by AIC — recommended when the prevalence design is high-dimensional (many one-hot levels). gamma_enet is the elastic-net mix: 1.0 is pure lasso, values in (0, 1) add a ridge component (R stm's gamma.enet). gamma_enet is ignored when gamma_prior="pooled". num_threads caps the worker pool for the parallel per-document E-step; the default None uses all available cores, and the fit is bit-for-bit identical regardless of the worker count (set it to 1 for a fully serial run). keep_eta_cov=False does not store the per-document variational covariance (an O(N*K^2) array), cutting memory sharply at large K; posterior_theta_samples / estimate_effect with draws recompute it on demand. The covariance approximation is set on the constructor via variational= ("laplace" default, or "diagonal" for a faster, lower-precision mean-field covariance). prevalence_names and content_names are human-readable labels for the columns of the prevalence and content design matrices, surfaced in the effect outputs. content_time is an optional ordered (time) content covariate, one period index per document: its group-by-period deviations are tied by a first-order random walk, the temporal generalization of content. content_smooth controls that random-walk penalty strength (1/tau^2); larger values tie adjacent periods more tightly. content_prior selects the prior on the content (SAGE κ) deviation blocks: "l2" (default) is a Gaussian ridge that keeps every kappa_topic, while "l1" puts a sparse Laplace prior (FISTA, exact zeros) that recovers sparse content contrasts, matching R stm's sparse content model. content_prior_var is the L2 prior variance on those content deviations (default 0.5); larger loosens regularization (more group-driven contrast), smaller tightens it toward the shared baseline. The "l2" path with content_time=None is bit-for-bit identical to the prior release. convergence_tol is the relative-bound tolerance for EM early stopping — the run stops when the relative change in the variational evidence bound falls below it (the criterion R stm uses). beta_init is an optional initial topic-word matrix to warm-start from.

load `staticmethod` ¶

load(path)

Load a model previously written by :meth:save.

save `method descriptor` ¶

save(path)

Save the fitted model to path. Reload with STM.load.

top_words `method descriptor` ¶

top_words(n=10, *, topic=None)

Top n words per topic (or one topic) as (word, probability) pairs.

transform `method descriptor` ¶

transform(data, *, eta_prior_mean=None)

Infer topic proportions θ for new documents by the variational E-step against the fitted globals (β and the logistic-normal prior). data is a :class:Corpus or list[list[str]]; out-of-vocabulary tokens are dropped. Returns a (num_docs, num_topics) array.

When eta_prior_mean is None (the default), the covariate-free baseline μ learned at fit time is used for every document — the same inference that stm's fitNewDocuments performs when no new covariate design is supplied.

When eta_prior_mean is a (num_docs, num_topics-1) array, each document's prior mean is set to the corresponding row. This is the low-level hook used by :func:topica.stm.transform to apply the prevalence-covariate prior μ_d = X_d γ to held-out documents.

word_contrast `method descriptor` ¶

word_contrast(topic, group_a, group_b, n=10)

Words that most distinguish how topic is worded in group_a vs group_b (log word-probability ratio; positive favours group_a). Requires content covariates. n is the number of most contrastive words to return.

topica.STS ¶

Structural Topic and Sentiment-Discourse model (Chen & Mankad 2024, Management Science). STS extends STM with a per-document, per-topic continuous sentiment-discourse latent α^(s) that modulates the topic-word distribution, with both topic prevalence and sentiment-discourse driven by document covariates. Fit by Laplace variational EM (a faithful port of the authors' R sts package).

doc `class-attribute` ¶

__doc__ = "Structural Topic and Sentiment-Discourse model (Chen & Mankad 2024, *Management\nScience*). STS extends STM with a per-document, per-topic **continuous\nsentiment-discourse** latent `α^(s)` that modulates the topic-word\ndistribution, with both topic prevalence and sentiment-discourse driven by\ndocument covariates. Fit by Laplace variational EM (a faithful port of the\nauthors' R ``sts`` package)."

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.

module `class-attribute` ¶

__module__ = 'topica'

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.

bound `property` ¶

bound

Final variational bound (approximate ELBO).

bound_history `property` ¶

bound_history

The variational bound after each EM iteration.

converged `property` ¶

converged

True if EM stopped on the em_tol criterion, False if it hit the iters cap.

doc_names `property` ¶

doc_names

Document labels (row order of :attr:doc_topic), default the document indices as strings.

doc_topic `property` ¶

doc_topic

Document-topic prevalence matrix θ, shape (num_docs, num_topics).

eta_cov `property` ¶

eta_cov

Per-document variational posterior covariances ν of η, shape (num_docs, 2*num_topics-1, 2*num_topics-1). Stored as float32 in memory to halve the dominant memory term; cast to float64 with np.asarray(model.eta_cov, dtype=np.float64) when full precision is needed. Raises RuntimeError if the model was fit with keep_eta_cov=False; use :meth:_recompute_eta_cov to regenerate on demand.

eta_mean `property` ¶

eta_mean

Per-document variational posterior means λ of the logistic-normal latent η = [α^(p){1..K-1}, α^(s)], shape (num_docs, 2*num_topics-1). Pairs with :attr:eta_cov as the joint prevalence/sentiment posterior for method-of-composition uncertainty.

feature_names `property` ¶

feature_names

fit_history `property` ¶

fit_history

Uniform convergence trace: (iteration, bound) pairs.

initialization `property` ¶

initialization

The initialization route the fit actually took (issue #410): "spectral", "random-fallback", or "random". None before fit / for old saves.

num_topics `property` ¶

num_topics

prevalence_effects `property` ¶

prevalence_effects

Prevalence regression coefficients Γ^(p), shape (num_features, num_topics-1) — covariate effects on topic prevalence. Requires a prevalence design at fit time.

seed `property` ¶

seed

The random seed the model was constructed with.

sentiment `property` ¶

sentiment

Per-document topic sentiment-discourse α^(s), shape (num_docs, num_topics). Positive values mean the document discussed that topic with wording shifted along the κ^(s) (sentiment-discourse) direction.

sentiment_effects `property` ¶

sentiment_effects

Sentiment-discourse regression coefficients Γ^(s), shape (num_features, num_topics) — covariate effects on topic sentiment-discourse. Requires a prevalence design at fit time.

settings `property` ¶

settings

The constructor configuration as a JSON-serialisable dict, keyword-named to match __init__ (issue #400).

topic_names `property` ¶

topic_names

One label per topic, in topic order. Defaults to ["topic_0", ...].

topic_word `property` ¶

topic_word

Baseline topic-word matrix β at neutral sentiment, shape (num_topics, num_words). Use :meth:topic_word_at for other sentiment levels.

vocabulary `property` ¶

vocabulary

new `builtin` ¶

__new__(*args, **kwargs)

Create and return a new object. See help(type) for accurate signature.

coherence `method descriptor` ¶

coherence(n=10)

UMass topic coherence per topic, shape (num_topics,). UMass topic coherence per topic, shape (num_topics,). n is the number of top words per topic scored.

fit `method descriptor` ¶

fit(data, sentiment_seed, prevalence=None, *, prevalence_names=None, iters=30, convergence_tol=1e-05, kappa_estimation='ridge', kappa_ridge=0.001, em_tol=None, covariates=None, keep_eta_cov=True)

Fit. data is a :class:Corpus or list[list[str]]. sentiment_seed (required, one value per document) defines the discrete aggregation groups for the κ Poisson M-step and seeds the initial sentiment — typically a document attribute the sentiment should track (e.g. a star rating). prevalence (optional, (num_docs, F) covariates) makes both topic prevalence and sentiment-discourse depend on covariates (α_d ~ N(X_d Γ, Σ)); an intercept is prepended.

EM runs until the relative change in the variational bound drops below convergence_tol or iters iterations are reached.

kappa_estimation chooses the topic-word (κ) estimator: "ridge" (default) is a fast ridge-penalized Poisson fit (kappa_ridge sets the ridge); "lasso" is an L1 Poisson path with AIC-selected penalty, matching the reference R sts exactly (sparser κ) at a higher cost. The two give the same topics on well-conditioned corpora. prevalence_names are human-readable labels for the prevalence design-matrix columns, surfaced in the effect outputs. em_tol is the relative-bound tolerance for EM early stopping — the run stops when the relative change in the variational evidence bound falls below it. keep_eta_cov (default True) stores the full per-document logistic-normal covariances; set it False to save memory.

load `staticmethod` ¶

load(path)

Load a model previously written by :meth:save.

save `method descriptor` ¶

save(path)

Save the fitted model to path. Reload with :meth:STS.load.

top_words `method descriptor` ¶

top_words(n=10, *, topic=None)

Top n words per topic (or one topic) at neutral sentiment, as (word, probability) pairs.

topic_word_at `method descriptor` ¶

topic_word_at(level)

Topic-word matrix β at sentiment level level (the same value applied to every topic), shape (num_topics, num_words). Inspect the wording at positive vs. negative sentiment by passing percentiles of :attr:sentiment.

transform `method descriptor` ¶

transform(data)

Infer topic prevalence θ for new documents by the Laplace E-step against the fitted globals (κ, m, Σ) with a zero prior mean (held-out documents carry no covariates). data is a :class:Corpus or list[list[str]]; tokens outside the training vocabulary are dropped. Returns a (num_docs, num_topics) array of prevalence proportions.

topica.ProdLDA ¶

ProdLDA (Srivastava & Sutton 2017), the AVITM autoencoding-variational topic model. ProdLDA is LDA with the word-level mixture replaced by a product of experts: each topic is an unnormalized expert and the word distribution is softmax(beta . theta) rather than softmax(beta) . theta, which yields noticeably more coherent topics. Inference is amortized -- an encoder network maps a document's bag of words to a logistic-normal posterior over theta, trained by minibatch Adam on the ELBO -- so new documents transform with a single forward pass. Batch normalization and high-momentum Adam guard against the component collapse that otherwise afflicts this model. Unlike ETM you bring no embeddings: beta is learned directly.

doc `class-attribute` ¶

__doc__ = "ProdLDA (Srivastava & Sutton 2017), the AVITM autoencoding-variational topic\nmodel. ProdLDA is LDA with the word-level mixture replaced by a *product of\nexperts*: each topic is an unnormalized expert and the word distribution is\n``softmax(beta . theta)`` rather than ``softmax(beta) . theta``, which yields\nnoticeably more coherent topics. Inference is amortized -- an encoder network\nmaps a document's bag of words to a logistic-normal posterior over ``theta``,\ntrained by minibatch Adam on the ELBO -- so new documents transform with a\nsingle forward pass. Batch normalization and high-momentum Adam guard against\nthe component collapse that otherwise afflicts this model. Unlike ``ETM`` you\nbring no embeddings: ``beta`` is learned directly."

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.

module `class-attribute` ¶

__module__ = 'topica'

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.

bound `property` ¶

bound

The ELBO (negative training loss) at the final epoch.

bound_history `property` ¶

bound_history

Per-epoch ELBO trajectory.

contrastive `property` ¶

contrastive

Whether contrastive (InfoNCE) regularization is enabled.

converged `property` ¶

converged

doc_names `property` ¶

doc_names

doc_topic `property` ¶

doc_topic

Document-topic proportions theta (num_docs, num_topics); rows sum to 1.

epochs_run `property` ¶

epochs_run

fit_history `property` ¶

fit_history

Uniform convergence trace: (epoch, elbo) pairs, one per training epoch (same as :attr:bound_history but indexed).

num_topics `property` ¶

num_topics

prior `property` ¶

prior

The document-topic prior: "laplace" (default) or "dirichlet".

seed `property` ¶

seed

The random seed the model was constructed with.

settings `property` ¶

settings

The constructor configuration as a JSON-serialisable dict, keyword-named to match __init__ (issue #400). prior is reported as its public string; convergence_tol is the effective tolerance in force (the deprecated em_tol alias is folded into it and reported as None).

topic_names `property` ¶

topic_names

topic_word `property` ¶

topic_word

Topic-word matrix (num_topics, vocab); each row is softmax(beta_k).

vocabulary `property` ¶

vocabulary

new `builtin` ¶

__new__(*args, **kwargs)

Create and return a new object. See help(type) for accurate signature.

repr `method descriptor` ¶

__repr__()

Return repr(self).

coherence `method descriptor` ¶

coherence(n=10)

UMass topic coherence per topic, shape (num_topics,). n is the number of top words per topic scored.

fit `method descriptor` ¶

fit(data, *, iters=None, convergence_tol=None)

Fit on data (a Corpus or list of token lists). iters sets the number of training epochs (default 200). convergence_tol overrides the constructor value for this run (when given).

fit_transform `method descriptor` ¶

fit_transform(data)

Fit, then return the document-topic proportions (fit_transform).

load `staticmethod` ¶

load(path)

Load a model previously written by :meth:save.

save `method descriptor` ¶

save(path)

Save the fitted model to path (topica's binary format).

top_words `method descriptor` ¶

top_words(n=10, *, topic=None)

Top n words per topic as (word, probability) pairs.

Returns a list of n-length lists (one per topic), or — when topic is given — just that topic's list.

transform `method descriptor` ¶

transform(data)

Held-out topic proportions for new documents: one encoder forward pass each (running batchnorm statistics, no sampling), mapped to the simplex with the training prior's map — softmax(mu) for laplace, the normalized Weibull median for dirichlet, stick-breaking for stick_breaking. Tokens outside the vocabulary are dropped. Returns (num_docs, num_topics).

topica.HDP ¶

Hierarchical Dirichlet Process topic model (Teh, Jordan, Beal & Blei 2006): LDA that infers the number of topics rather than fixing it. Fit by the direct-assignment Gibbs sampler (the Chinese Restaurant Franchise). The two concentration parameters alpha (document level) and gamma (corpus level) govern how readily new topics appear; by default both are resampled from the data (a faithful port of blei-lab/hdp), so you typically don't tune them.

doc `class-attribute` ¶

__doc__ = "Hierarchical Dirichlet Process topic model (Teh, Jordan, Beal & Blei 2006):\nLDA that **infers the number of topics** rather than fixing it. Fit by the\ndirect-assignment Gibbs sampler (the Chinese Restaurant Franchise). The two\nconcentration parameters `alpha` (document level) and `gamma` (corpus level)\ngovern how readily new topics appear; by default both are resampled from the\ndata (a faithful port of blei-lab/hdp), so you typically don't tune them."

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.

module `class-attribute` ¶

__module__ = 'topica'

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.

alpha `property` ¶

alpha

The fitted document-level concentration α0 (resampled if enabled).

concentration_history `property` ¶

concentration_history

The learned-concentration trace: (iteration, alpha, gamma) triples sampled during fit (only informative when resample_conc=True). Empty if tracing was disabled.

converged `property` ¶

converged

HDP does not implement an early-stop criterion; always False.

doc_lengths `property` ¶

doc_lengths

Number of tokens in each training document, shape (num_docs,).

doc_names `property` ¶

doc_names

doc_topic `property` ¶

doc_topic

Document-topic matrix θ, shape (num_docs, num_topics); rows sum to 1.

fit_history `property` ¶

fit_history

Uniform convergence trace: (iteration, log_likelihood) pairs (same as :attr:log_likelihood_history).

gamma `property` ¶

gamma

The fitted corpus-level concentration γ (resampled if enabled).

log_likelihood_history `property` ¶

log_likelihood_history

The convergence trace: (iteration, per-token log-likelihood) pairs sampled during fit. Empty if tracing was disabled.

num_topics `property` ¶

num_topics

The inferred number of topics K.

seed `property` ¶

seed

The random seed the model was constructed with.

settings `property` ¶

settings

The constructor configuration as a JSON-serialisable dict, keyword-named to match __init__ (issue #400). eta is a deprecated alias for beta, folded at construction, so it always reports None here.

theta_draws `property` ¶

theta_draws

Thinned θ draws, shape (num_draws, num_docs, num_topics), dtype float32. None when fit with keep_theta_draws=False. Because HDP's K changes during training, these draws are sampled from the final Dirichlet posterior after the Gibbs chain ends.

topic_count_history `property` ¶

topic_count_history

The topic-discovery trajectory: (iteration, num_topics) pairs sampled during fit. Watching K stabilize is the nonparametric model's headline convergence check (it grows and shrinks before settling). Sampled every report_interval sweeps (auto ≈ 50 points); empty if disabled.

topic_names `property` ¶

topic_names

One label per topic, in topic order. Defaults to ["topic_0", ...] after fit; assign a list of the same length to override.

topic_word `property` ¶

topic_word

Topic-word matrix β, shape (num_topics, num_words).

vocabulary `property` ¶

vocabulary

new `builtin` ¶

__new__(*args, **kwargs)

Create and return a new object. See help(type) for accurate signature.

repr `method descriptor` ¶

__repr__()

Return repr(self).

coherence `method descriptor` ¶

coherence(n=10)

UMass topic coherence per topic, shape (num_topics,). UMass topic coherence per topic, shape (num_topics,). n is the number of top words per topic scored.

fit `method descriptor` ¶

fit(data, *, iters=150, progress_interval=0, keep_theta_draws=True, num_theta_draws=25, report_interval=None)

Fit by Gibbs sampling for iters sweeps. data is a :class:Corpus or list[list[str]]. The inferred topic count is available as num_topics. progress_interval sets how often the discovery trace is recorded (0 = ~50 evenly spaced points); report_interval is a deprecated alias for it. keep_theta_draws (default True) retains num_theta_draws thinned MCMC θ snapshots in theta_draws, the cross-sweep posterior samples composition_theta prefers over the Dirichlet approximation; set it False to save memory.

load `staticmethod` ¶

load(path)

Load a model previously written by :meth:save.

save `method descriptor` ¶

save(path)

Save the fitted model to path. Reload with HDP.load.

top_words `method descriptor` ¶

top_words(n=10, *, topic=None)

Top n words per topic (or one topic) as (word, probability) pairs.

transform `method descriptor` ¶

transform(data, *, iters=100, burn_in=10, num_samples=10, sample_interval=5, seed=None, iterations=None)

Infer topic proportions θ for new documents over the discovered topics, by collapsed Gibbs against the fixed topic-word matrix. data is a :class:Corpus or list[list[str]]; OOV tokens are dropped. The document-level prior is symmetric with total mass equal to the learned concentration α. Returns a (num_docs, num_topics) array.

The collapsed-Gibbs controls are per-document: iters sweeps each new document, discarding the first burn_in, then averaging num_samples θ snapshots taken sample_interval sweeps apart; seed seeds the inference RNG. iterations is a deprecated alias for iters.

topica.DTM ¶

Dynamic Topic Model (Blei & Lafferty 2006): topics whose word distributions evolve across time slices. Each topic-word chain follows a Gaussian state-space model; inference is variational with Kalman smoothing, a faithful port of Blei's C dtm / gensim's LdaSeqModel. After fitting, query a topic's word distribution at any slice with topic_word(time) and trace a word's trajectory with word_evolution(topic, word).

doc `class-attribute` ¶

__doc__ = "Dynamic Topic Model (Blei & Lafferty 2006): topics whose word distributions\n**evolve across time slices**. Each topic-word chain follows a Gaussian\nstate-space model; inference is variational with Kalman smoothing, a faithful\nport of Blei's C `dtm` / gensim's `LdaSeqModel`. After fitting, query a\ntopic's word distribution at any slice with `topic_word(time)` and trace a\nword's trajectory with `word_evolution(topic, word)`."

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.

module `class-attribute` ¶

__module__ = 'topica'

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.

bound `property` ¶

bound

The final variational bound (ELBO) reached during fitting.

converged `property` ¶

converged

DTM does not implement an early-stop criterion; always False.

fit_history `property` ¶

fit_history

DTM has no per-iteration ELBO trace yet; always returns [].

initialization `property` ¶

initialization

The initialization route the fit actually took (issue #410): "spectral", "random-fallback" (spectral fell back to the seeded static-LDA init), or "random". None before fit / for old saves.

num_times `property` ¶

num_times

The number of time slices (available after fit).

num_topics `property` ¶

num_topics

seed `property` ¶

seed

The random seed the model was constructed with.

settings `property` ¶

settings

The constructor configuration as a JSON-serialisable dict, keyword-named to match __init__ (issue #400).

topic_names `property` ¶

topic_names

One label per topic, in topic order. Defaults to ["topic_0", ...] after fit; assign a list of the same length to override.

vocabulary `property` ¶

vocabulary

new `builtin` ¶

__new__(*args, **kwargs)

Create and return a new object. See help(type) for accurate signature.

repr `method descriptor` ¶

__repr__()

Return repr(self).

fit `method descriptor` ¶

fit(data, times, *, iters=20)

Fit by variational EM. data is a :class:Corpus or list[list[str]]; times gives each document's integer time-slice index (0-based, contiguous). The number of slices is inferred as max(times) + 1. iters is the number of variational-EM iterations.

load `staticmethod` ¶

load(path)

Load a model previously written by :meth:save.

save `method descriptor` ¶

save(path)

Save the fitted model to path. Reload with DTM.load.

top_words `method descriptor` ¶

top_words(topic, time, n=10)

Top n words for a topic at one time slice as (word, probability).

topic_word `method descriptor` ¶

topic_word(time)

Topic-word matrix at time slice time, shape (num_topics, num_words); rows sum to 1.

word_drift `method descriptor` ¶

word_drift(topic, *, n=10, from_time=0, to_time=None)

Which words inside topic drift most between two time slices.

For each word, the change in its probability within the topic from from_time to to_time (defaults: the first and last slices) is computed. Returns a dict with two keys, "rising" and "falling", each a list of (word, delta) pairs (largest gain first; largest drop first). This is how you see what makes a topic's vocabulary evolve, not just that it does. n is the number of top drifting words to return per direction.

word_evolution `method descriptor` ¶

word_evolution(topic, word)

Trajectory of a word's probability in a topic across slices, shape (num_times,). word is a vocabulary string or its integer id.

topica.SupervisedLDA ¶

Supervised LDA (Blei & McAuliffe 2007): LDA in which each document carries a real-valued response y_d ~ N(ηᵀ z̄_d, σ²) regressed on its topic usage. Fitting is supervised by the response, so topics are shaped to be predictive and the coefficients η report how each topic moves y. Fit by variational EM; predict returns ŷ for new documents.

doc `class-attribute` ¶

__doc__ = 'Supervised LDA (Blei & McAuliffe 2007): LDA in which each document carries a\nreal-valued response `y_d ~ N(ηᵀ z̄_d, σ²)` regressed on its topic usage.\nFitting is supervised by the response, so topics are shaped to be predictive\nand the coefficients `η` report how each topic moves `y`. Fit by variational\nEM; `predict` returns ŷ for new documents.'

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.

module `class-attribute` ¶

__module__ = 'topica'

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.

alpha `property` ¶

alpha

The symmetric document-topic Dirichlet prior α, shape (num_topics,). Marks SupervisedLDA as a Dirichlet model for :func:topica.effects.composition_theta.

coefficient_se `property` ¶

coefficient_se

Standard error of each regression coefficient η, shape (num_topics,), from the OLS-style covariance σ²M⁻¹ where M = Σ_d E[z̄ z̄ᵀ] is the normal-equations matrix the fit solves for η. This is a conditional approximation: it treats the fitted topics, β, and the variational moments E[z̄ z̄ᵀ] as fixed and known, so it does not propagate uncertainty in the learned topics or β. Read |η| > ~2·SE as an informal ordering/importance cue under those assumptions, not a calibrated significance test. Aligned to coefficients. None for models saved before this was added.

coefficients `property` ¶

coefficients

Regression coefficients η, shape (num_topics,) — how each topic moves the response (in the response's units, per unit of topic frequency).

converged `property` ¶

converged

True if the relative-change convergence criterion was satisfied before all EM iterations completed. Always False when convergence_tol=0.

doc_lengths `property` ¶

doc_lengths

Number of tokens in each training document, shape (num_docs,).

doc_names `property` ¶

doc_names

doc_topic `property` ¶

doc_topic

Document-topic matrix θ, shape (num_docs, num_topics); rows sum to 1.

fit_history `property` ¶

fit_history

Per-EM-iteration response log-likelihood trace. Returns one (iter, ll) pair per check_every EM iterations (empty when check_every=0).

num_topics `property` ¶

num_topics

seed `property` ¶

seed

The random seed the model was constructed with.

settings `property` ¶

settings

The constructor configuration as a JSON-serialisable dict, keyword-named to match __init__ (issue #400).

sigma2 `property` ¶

sigma2

The fitted response variance σ².

theta_draws `property` ¶

theta_draws

Variational θ draws, shape (num_draws, num_docs, num_topics), dtype float32. None when fit with keep_theta_draws=False. These are independent samples from each document's fitted variational Dirichlet(γ_d) (the mean-field posterior approximation), taken after fitting — not thinned MCMC or cross-sweep snapshots.

topic_names `property` ¶

topic_names

One label per topic, in topic order. Defaults to ["topic_0", ...] after fit; assign a list of the same length to override.

topic_word `property` ¶

topic_word

Topic-word matrix β, shape (num_topics, num_words).

vocabulary `property` ¶

vocabulary

new `builtin` ¶

__new__(*args, **kwargs)

Create and return a new object. See help(type) for accurate signature.

repr `method descriptor` ¶

__repr__()

Return repr(self).

coherence `method descriptor` ¶

coherence(n=10)

UMass topic coherence per topic, shape (num_topics,). UMass topic coherence per topic, shape (num_topics,). n is the number of top words per topic scored.

fit `method descriptor` ¶

fit(data, y, *, iters=25, var_iters=15, keep_theta_draws=True, num_theta_draws=25, convergence_tol=0.0, check_every=1)

Fit by variational EM. data is a :class:Corpus or list[list[str]]; y is the per-document real-valued response (length = number of docs).

iters is the number of variational-EM iterations; var_iters is the number of variational E-step iterations per document. keep_theta_draws (default True) retains num_theta_draws θ samples in theta_draws. For SupervisedLDA these are independent draws from each document's fitted variational Dirichlet(γ_d) — the mean-field posterior approximation — not MCMC/cross-sweep snapshots; composition_theta can use them in place of the plug-in Dirichlet mean. Set False to save memory. convergence_tol (default 0.0, disabled) enables opt-in early stopping: the run stops once the relative change in the recorded variational objective between the last two trace points, |ΔL| / |L|, falls below it, setting converged. The monitored quantity is the variational-EM log-likelihood bound; the comparison window is the trace cadence (check_every), so a coarser cadence compares more widely spaced iterations. This is a pragmatic early-stop heuristic on the bound trace, not a convergence guarantee. check_every is how often, in EM iterations, the bound is recorded and the convergence_tol test is applied.

load `staticmethod` ¶

load(path)

Load a model previously written by :meth:save.

predict `method descriptor` ¶

predict(data, *, var_iters=20, return_std=False)

Predict the response ŷ for new documents (list[list[str]] or a :class:Corpus). Out-of-vocabulary words are ignored.

With return_std=False (default) returns a 1-D array of predictions. With return_std=True returns (mean, std), where std propagates the new document's variational topic uncertainty through the regression, ηᵀ Cov(z̄) η, plus the residual variance σ². This is a conditional predictive spread — it holds the fitted β, η, and σ² fixed and uses the mean-field Cov(z̄), so it is not a full Bayesian posterior-predictive interval (it does not propagate uncertainty in the learned topics or coefficients). mean ± 1.96·std is a Gaussian approximation under those conditions. var_iters is the number of variational E-step iterations per new document.

save `method descriptor` ¶

save(path)

Save the fitted model to path. Reload with SupervisedLDA.load.

top_words `method descriptor` ¶

top_words(n=10, *, topic=None)

Top n words per topic (or one topic) as (word, probability) pairs.

transform `method descriptor` ¶

transform(data, *, iters=100, burn_in=10, num_samples=10, sample_interval=5, seed=None, iterations=None)

Infer topic proportions θ for new documents by collapsed Gibbs against the fitted topic-word matrix (the response is not used — this is the unsupervised E-step). data is a :class:Corpus or list[list[str]]; OOV tokens are dropped. Returns (num_docs, num_topics). To predict the response for new documents, take transform(data) @ eta.

The collapsed-Gibbs controls are per-document: iters sweeps each new document, discarding the first burn_in, then averaging num_samples θ snapshots taken sample_interval sweeps apart; seed seeds the inference RNG. iterations is a deprecated alias for iters.

topica.PT ¶

Pseudo-document Topic Model (Zuo et al. 2016) for short texts. Documents are aggregated into num_pseudo pseudo-documents that carry the topic distributions, so the topic structure is estimated from richer aggregated statistics than individual short documents would provide. Collapsed Gibbs.

doc `class-attribute` ¶

__doc__ = 'Pseudo-document Topic Model (Zuo et al. 2016) for **short texts**. Documents\nare aggregated into `num_pseudo` pseudo-documents that carry the topic\ndistributions, so the topic structure is estimated from richer aggregated\nstatistics than individual short documents would provide. Collapsed Gibbs.'

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.

module `class-attribute` ¶

__module__ = 'topica'

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.

alpha `property` ¶

alpha

The symmetric document-topic Dirichlet prior α, shape (num_topics,). Marks PT as a Dirichlet model for :func:topica.effects.composition_theta.

converged `property` ¶

converged

True if the relative-change convergence criterion was satisfied before all iterations completed. Always False when convergence_tol=0.

doc_lengths `property` ¶

doc_lengths

Number of tokens in each training document, shape (num_docs,).

doc_names `property` ¶

doc_names

doc_topic `property` ¶

doc_topic

fit_history `property` ¶

fit_history

Per-iteration log-likelihood trace. Returns one (iter, ll) pair for every check_every sweeps (empty when check_every=0, the default).

num_topics `property` ¶

num_topics

seed `property` ¶

seed

The random seed the model was constructed with.

settings `property` ¶

settings

The constructor configuration as a JSON-serialisable dict, keyword-named to match __init__ (issue #400).

theta_draws `property` ¶

theta_draws

Thinned MCMC θ snapshots, shape (num_draws, num_docs, num_topics), dtype float32. None when fit with keep_theta_draws=False.

topic_names `property` ¶

topic_names

One label per topic, in topic order. Defaults to ["topic_0", ...] after fit; assign a list of the same length to override.

topic_word `property` ¶

topic_word

vocabulary `property` ¶

vocabulary

new `builtin` ¶

__new__(*args, **kwargs)

Create and return a new object. See help(type) for accurate signature.

repr `method descriptor` ¶

__repr__()

Return repr(self).

coherence `method descriptor` ¶

coherence(n=10)

UMass topic coherence per topic, shape (num_topics,). n is the number of top words per topic scored.

fit `method descriptor` ¶

fit(data, *, iters=1000, keep_theta_draws=True, num_theta_draws=25, convergence_tol=0.0, check_every=10)

Fit by collapsed Gibbs sampling for iters sweeps. keep_theta_draws (default True) retains num_theta_draws thinned MCMC θ snapshots in theta_draws, the cross-sweep posterior samples composition_theta prefers over the Dirichlet approximation; set it False to save memory. convergence_tol (default 0.0, disabled) enables opt-in early stopping: the run stops once the relative change in the recorded log-likelihood between the last two trace points, |ΔLL| / |LL|, falls below it, setting converged. The monitored quantity is the collapsed model-fit log-likelihood; the comparison window is the trace cadence (check_every / progress_interval), so a coarser cadence compares more widely spaced sweeps. This is a pragmatic early-stop heuristic on the log-likelihood trace, not a guarantee the Gibbs chain has mixed. check_every is how often, in sweeps, the log-likelihood is recorded and the convergence_tol test is applied.

load `staticmethod` ¶

load(path)

Load a model previously written by :meth:save.

save `method descriptor` ¶

save(path)

Save the fitted model to path. Reload with PT.load.

top_words `method descriptor` ¶

top_words(n=10, *, topic=None)

Top n words per topic as (word, probability) pairs.

Returns a list of n-length lists (one per topic), or — when topic is given — just that topic's list.

transform `method descriptor` ¶

transform(data, *, iters=100, burn_in=10, num_samples=10, sample_interval=5, seed=None, iterations=None)

Infer document-topic distributions for new, unseen documents under the fitted model (sklearn-style transform). Holds the fitted topic-word distributions fixed and runs collapsed Gibbs to infer θ for each document. Returns shape (num_new_docs, num_topics) with rows summing to 1.

Approximation: the pseudo-document layer is a training-time aggregation device. Held-out documents infer θ over the K topics directly under the fitted topic-word matrix, without pseudo-document assignment.

The collapsed-Gibbs controls are per-document: iters sweeps each new document, discarding the first burn_in, then averaging num_samples θ snapshots taken sample_interval sweeps apart; seed seeds the inference RNG. iterations is a deprecated alias for iters.

topica.GSDMM ¶

GSDMM — the "Movie Group Process" (Yin & Wang 2014). A mixture model for short texts (tweets, survey answers, headlines) where each document belongs to exactly one topic, not a mixture. You set an upper bound K on the number of clusters; empty clusters die out during sampling, so the effective num_topics is inferred from the data (≤ K). Handles the sparsity of short documents far better than LDA.

doc `class-attribute` ¶

__doc__ = 'GSDMM — the "Movie Group Process" (Yin & Wang 2014). A mixture model for\n**short texts** (tweets, survey answers, headlines) where each document\nbelongs to exactly *one* topic, not a mixture. You set an upper bound `K` on\nthe number of clusters; empty clusters die out during sampling, so the\neffective `num_topics` is inferred from the data (≤ K). Handles the sparsity\nof short documents far better than LDA.'

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.

module `class-attribute` ¶

__module__ = 'topica'

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.

cluster_count_history `property` ¶

cluster_count_history

The cluster-discovery trajectory: (iteration, num_clusters) pairs over the fit. The Movie Group Process starts from num_topics clusters and empties most of them; watching the count collapse to a stable value is its headline convergence check. Sampled every report_interval sweeps (auto ≈ 50 points); empty if disabled.

converged `property` ¶

converged

GSDMM does not implement an early-stop criterion; always False.

doc_cluster `property` ¶

doc_cluster

Hard cluster assignment of each document, shape (num_docs,); values in 0..num_topics. GSDMM gives each document a single cluster.

doc_names `property` ¶

doc_names

doc_topic `property` ¶

doc_topic

Document-topic matrix θ, shape (num_docs, num_topics); rows sum to 1.

fit_history `property` ¶

fit_history

Uniform convergence trace: (iteration, log_likelihood) pairs (same as :attr:log_likelihood_history).

log_likelihood_history `property` ¶

log_likelihood_history

The convergence trace: (iteration, per-token log-likelihood) pairs (each document scored under its assigned cluster). Empty if disabled.

num_topics `property` ¶

num_topics

The number of non-empty clusters discovered (≤ the K you set).

seed `property` ¶

seed

The random seed the model was constructed with.

settings `property` ¶

settings

The constructor configuration as a JSON-serialisable dict, keyword-named to match __init__ (issue #400). num_topics is the max-cluster cap.

topic_names `property` ¶

topic_names

topic_word `property` ¶

topic_word

Topic-word matrix β, shape (num_topics, num_words) (used clusters only).

vocabulary `property` ¶

vocabulary

new `builtin` ¶

__new__(*args, **kwargs)

Create and return a new object. See help(type) for accurate signature.

repr `method descriptor` ¶

__repr__()

Return repr(self).

coherence `method descriptor` ¶

coherence(n=10)

UMass topic coherence per topic, shape (num_topics,). n is the number of top words per topic scored.

fit `method descriptor` ¶

fit(data, *, iters=30, progress_interval=0, report_interval=None)

Fit by the Movie Group Process (collapsed Gibbs) for iters sweeps. progress_interval controls the cluster-discovery trace (cluster_count_history / log_likelihood_history): 0 = auto (~50 points), a positive value records every that-many sweeps. report_interval is a deprecated alias for progress_interval.

load `staticmethod` ¶

load(path)

Load a model previously written by :meth:save.

save `method descriptor` ¶

save(path)

Save the fitted model to path. Reload with GSDMM.load.

top_words `method descriptor` ¶

top_words(n=10, *, topic=None)

Top n words per topic as (word, probability) pairs.

Returns a list of n-length lists (one per topic), or — when topic is given — just that topic's list.

topica.SeededLDA ¶

Seeded LDA (guided topic modeling): you supply a few seed words per topic and the model is steered so those topics form around them, while the rest of each topic's vocabulary (and any residual unseeded topics) is still learned. Useful when theory tells you which themes to expect (Jagarlamudi et al. 2012; the seeding follows koheiw/seededlda — seed words get a weight × 100 prior pseudocount in their topic).

doc `class-attribute` ¶

__doc__ = "Seeded LDA (guided topic modeling): you supply a few **seed words** per topic\nand the model is steered so those topics form around them, while the rest of\neach topic's vocabulary (and any `residual` unseeded topics) is still learned.\nUseful when theory tells you which themes to expect (Jagarlamudi et al. 2012;\nthe seeding follows koheiw/seededlda — seed words get a `weight × 100`\nprior pseudocount in their topic)."

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.

module `class-attribute` ¶

__module__ = 'topica'

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.

alpha `property` ¶

alpha

The symmetric document-topic Dirichlet prior α, broadcast to (num_topics,). Marks SeededLDA as a Dirichlet model for :func:topica.effects.composition_theta.

converged `property` ¶

converged

True if the convergence criterion was met (convergence_tol > 0); False if the full iters ran.

doc_lengths `property` ¶

doc_lengths

Per-document token counts (length D), in doc_topic row order, so composition_theta can recover N_d without re-threading the Corpus.

doc_names `property` ¶

doc_names

doc_topic `property` ¶

doc_topic

fit_history `property` ¶

fit_history

Per-iteration log-likelihood trace. Each entry is (iteration, log_likelihood) recorded every check_every sweeps during :meth:fit. Non-empty after fitting.

num_topics `property` ¶

num_topics

seed `property` ¶

seed

The random seed the model was constructed with.

settings `property` ¶

settings

The constructor configuration as a JSON-serialisable dict, keyword-named to match __init__ (issue #400). The seed_words guidance is data, not a hyperparameter, so it is not reported here.

theta_draws `property` ¶

theta_draws

Thinned MCMC θ draws, shape (num_draws, num_docs, num_topics), or None when fit with keep_theta_draws=False. Real cross-sweep posterior samples that :func:topica.composition_theta prefers over the within-document Dirichlet approximation.

topic_names `property` ¶

topic_names

The topic labels: the seed names you gave, then residual_1 … for any unseeded topics. Settable after fit; length must equal num_topics.

topic_word `property` ¶

topic_word

vocabulary `property` ¶

vocabulary

new `builtin` ¶

__new__(*args, **kwargs)

Create and return a new object. See help(type) for accurate signature.

repr `method descriptor` ¶

__repr__()

Return repr(self).

coherence `method descriptor` ¶

coherence(n=10)

UMass topic coherence per topic, shape (num_topics,). n is the number of top words per topic scored.

fit `method descriptor` ¶

fit(data, *, iters=2000, doc_topic_prior=None, keep_theta_draws=True, num_theta_draws=25, convergence_tol=0.0, check_every=10)

Fit by collapsed Gibbs for iters sweeps. Seeded topics come first (in the order given), then the residual topics.

doc_topic_prior (optional, (num_docs, num_topics)) supplies a per-document asymmetric Dirichlet prior α_{d,k} that replaces the symmetric alpha, biasing each document's topic mixture toward chosen topics (e.g. from a document embedding). It is a prior, so the sampler can still move a document away from it.

convergence_tol (default 0.0, disabled) enables early stopping: after each check_every sweeps the relative change in the log-likelihood is compared; if it falls below convergence_tol the loop stops and :attr:converged is set to True. When 0 (default), the full iters run exactly as before. keep_theta_draws (default True) retains num_theta_draws thinned MCMC θ snapshots in theta_draws, the cross-sweep posterior samples composition_theta prefers over the Dirichlet approximation; set it False to save memory.

load `staticmethod` ¶

load(path)

Load a model previously written by :meth:save.

save `method descriptor` ¶

save(path)

Save the fitted model to path. Reload with SeededLDA.load.

top_words `method descriptor` ¶

top_words(n=10, *, topic=None)

Top n words per topic as (word, probability) pairs.

Returns a list of n-length lists (one per topic), or — when topic is given — just that topic's list.

transform `method descriptor` ¶

transform(data, *, iters=100, burn_in=10, num_samples=10, sample_interval=5, seed=None, iterations=None)

Infer document-topic distributions for new, unseen documents under the fitted model (sklearn-style transform). Holds the fitted topic-word distributions fixed and runs collapsed Gibbs to infer θ for each document. Returns shape (num_new_docs, num_topics) with rows summing to 1.

Approximation: the seed-word boost is baked into the fitted topic-word matrix. New documents infer θ under those distributions without re-estimating the seed prior.

The collapsed-Gibbs controls are per-document: iters sweeps each new document, discarding the first burn_in, then averaging num_samples θ snapshots taken sample_interval sweeps apart; seed seeds the inference RNG. iterations is a deprecated alias for iters.

topica.KeyATM ¶

Keyword-Assisted Topic Model (keyATM Base). Like LDA, but some topics carry a researcher-supplied keyword list; a token in a keyword topic comes either from a distribution over only that topic's keywords or from the topic's full distribution. This anchors keyword topics to their keywords while still learning the rest of the vocabulary. Faithful to keyATM/keyATM.

doc `class-attribute` ¶

__doc__ = "Keyword-Assisted Topic Model (keyATM Base). Like LDA, but some topics carry a\nresearcher-supplied **keyword** list; a token in a keyword topic comes either\nfrom a distribution over only that topic's keywords or from the topic's full\ndistribution. This anchors keyword topics to their keywords while still\nlearning the rest of the vocabulary. Faithful to keyATM/keyATM."

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.

module `class-attribute` ¶

__module__ = 'topica'

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.

alpha `property` ¶

alpha

The document-topic Dirichlet prior α, shape (num_topics,). For the base model this is the estimated asymmetric prior (R keyATM's alpha); the covariate and dynamic models use a per-document prior, so this falls back to the symmetric base value. Marks keyATM as a Dirichlet model for :func:topica.effects.composition_theta.

alpha_history `property` ¶

alpha_history

Trace of the estimated document-topic prior α as (iteration, alpha) pairs, where alpha is the length-K asymmetric prior at that sweep — keyATM's plot_alpha / values_iter$alpha_iter. Base model only; empty for the covariate model (which traces λ) and dynamic model.

converged `property` ¶

converged

True if the Gibbs run early-stopped because the relative change in the recorded model_fit log-likelihood fell below convergence_tol; False when the full iters sweeps ran (the default, and always for the CVB0 backend, which keeps no trace).

doc_lengths `property` ¶

doc_lengths

Per-document token counts (length D), in doc_topic row order, so composition_theta can recover N_d without re-threading the Corpus.

doc_names `property` ¶

doc_names

doc_topic `property` ¶

doc_topic

feature_effect_se `property` ¶

feature_effect_se

Covariate model: standard errors of feature_effects (λ), same shape (num_topics, F+1) and column order, on the original covariate scale. From the observed information of the penalized Dirichlet-multinomial in the standardized fit space, mapped back by the standardization Jacobian (issue #316). A coefficient is notable when |feature_effects| / feature_effect_se exceeds ~2. Entries are NaN where the standardized λ hit the ±5 bound (the constrained estimate has no valid asymptotic SE). None when λ was never optimized to a stationary point (#418). Raises if the model was fit without covariates.

feature_effects `property` ¶

feature_effects

Covariate model: learned DMR coefficients λ, shape (num_topics, F+1); column 0 is the intercept. Raises if the model was fit without covariates.

feature_names `property` ¶

feature_names

Covariate model: names aligned with feature_effects columns ("intercept" first). Empty for the base model.

fit_history `property` ¶

fit_history

Uniform convergence trace: (iteration, log_likelihood) pairs (the first two columns of :attr:log_likelihood_history; perplexity column dropped for cross-model uniformity).

keyword_rate `property` ¶

keyword_rate

Per-topic keyword switch rate π_k (the share of a keyword topic's mass drawn from its keyword distribution); 0 for regular topics.

log_likelihood_history `property` ¶

log_likelihood_history

Convergence trace as a list of (iteration, log_likelihood, perplexity) triples — the three columns of keyATM's model_fit (plot_modelfit). log_likelihood is the collapsed marginal log-likelihood and perplexity is exp(-log_likelihood / total_weighted_tokens), both on R keyATM's scale. Sampled every report_interval sweeps during :meth:fit (auto ≈ 50 points). Empty if tracing was disabled.

num_topics `property` ¶

num_topics

pi_history `property` ¶

pi_history

Trace of the per-topic keyword switch rate π as (iteration, pi) pairs (pi length K, 0 for regular topics) — keyATM's plot_pi / values_iter$pi_iter. Empty for a keyword-free model.

seed `property` ¶

seed

The random seed the model was constructed with.

settings `property` ¶

settings

The constructor configuration as a JSON-serialisable dict, keyword-named to match __init__ (issue #400). The keywords guidance is data, not a hyperparameter, so it is not reported here; num_topics and alpha are the effective values resolved at construction.

theta_draws `property` ¶

theta_draws

Thinned MCMC θ draws, shape (num_draws, num_docs, num_topics), or None when fit with keep_theta_draws=False. Real cross-sweep posterior samples that :func:topica.composition_theta prefers over the within-document Dirichlet approximation.

time_labels `property` ¶

time_labels

Dynamic model: the distinct, sorted timestamp labels, one per time segment (length T). Empty for non-dynamic models.

time_prevalence `property` ¶

time_prevalence

Dynamic model: smoothed topic prevalence per time segment, shape (T, num_topics), rows sum to 1, aligned with time_labels. Raises if the model was fit without timestamps.

time_state `property` ¶

time_state

Dynamic model: the latent HMM state (regime) of each time segment, length T, aligned with time_labels. Empty for non-dynamic models.

topic_names `property` ¶

topic_names

The keyword topic labels (then any regular topic labels). Settable after fit; length must equal num_topics.

topic_word `property` ¶

topic_word

transition_matrix `property` ¶

transition_matrix

Dynamic model: the left-to-right state transition matrix, shape (num_states, num_states). Raises if fit without timestamps.

vocabulary `property` ¶

vocabulary

new `builtin` ¶

__new__(*args, **kwargs)

Create and return a new object. See help(type) for accurate signature.

repr `method descriptor` ¶

__repr__()

Return repr(self).

coherence `method descriptor` ¶

coherence(n=10)

UMass topic coherence per topic, shape (num_topics,). n is the number of top words per topic scored.

fit `method descriptor` ¶

fit(data, *, iters=1500, covariates=None, feature_names=None, times=None, timestamps=None, num_states=5, weights='information-theory', num_threads=None, optimize_interval=50, burn_in=200, prior_variance=1.0, lbfgs_iters=20, progress_interval=0, prior_offset=None, keep_theta_draws=True, num_theta_draws=25, convergence_tol=0.0, report_interval=None, turbo_alpha_stride=1)

Fit by collapsed Gibbs for iters sweeps. Keyword topics come first (in the order given), then any regular topics.

Pass covariates (a (num_docs, F) array or list of float lists) for the covariate keyATM: the document-topic prior becomes a Dirichlet-multinomial regression, α_{d,k} = exp(x_d · λ_k) (an intercept is prepended). feature_names (length F) labels the columns; the learned λ is exposed as feature_effects (on the original covariate scale). With no covariates, this is the base symmetric-α keyATM. Following R keyATM, the covariates are standardized internally and λ is bounded (±5 in standardized space) under the N(0,1) prior, which keeps a high-dimensional design (e.g. many one-hot levels) from driving α to a degenerate fit on one topic (issue #270).

Pass times (one value per document) for the dynamic keyATM: a Chib (1998) change-point HMM lets topic prevalence shift over time across num_states latent regimes. Documents are sorted by time internally; the smoothed prevalence path is exposed as time_prevalence (aligned with time_labels) and the per-segment regime as time_state. times and covariates are mutually exclusive. timestamps= is an accepted alias for times= (the canonical cross-model name, as in DTM).

weights is keyATM's token weighting: "information-theory" (default, each token counts by its word's surprisal in bits), "inv-freq" or "none". num_threads overrides the constructor's num_threads for this fit call only (None = constructor value). The covariate model's λ is re-estimated by L-BFGS every optimize_interval sweeps starting after burn_in, lbfgs_iters steps per update, under a Gaussian prior of variance prior_variance on λ; prior_offset is an optional (num_docs, num_topics) fixed per-document log-prior offset (covariate variant only, ignored otherwise). keep_theta_draws (default True) retains num_theta_draws thinned MCMC θ snapshots in theta_draws, the cross-sweep posterior samples composition_theta prefers over the Dirichlet approximation; set it False to save memory. progress_interval sets how often model_fit is recorded for log_likelihood_history (0 = ~50 evenly spaced points); report_interval is a deprecated alias for it. convergence_tol (default 0.0, disabled) enables opt-in early stopping: the run stops once the relative change in the recorded model-fit log-likelihood between the last two trace points falls below it, setting converged (ignored by the CVB0 backend, which keeps no trace). turbo_alpha_stride (default 1, exact) is an approximate speed knob for the base model's α slice-sampler: it evaluates the data term over every s-th document (fixed stride in corpus order) and scales it up by s, cutting the dominant lgamma cost to ~1/s. It is not unbiased — the slice sampler then targets the subsampled posterior rather than the full-data one, and because the stride subset is deterministic the bias also depends on document order. Use stride=1 for the exact α (base model only, estimate_alpha=True).

load `staticmethod` ¶

load(path)

Load a model previously written by :meth:save.

save `method descriptor` ¶

save(path)

Save the fitted model to path. Reload with KeyATM.load.

top_words `method descriptor` ¶

top_words(n=10, *, topic=None)

Top n words per topic as (word, probability) pairs.

Returns a list of n-length lists (one per topic), or — when topic is given — just that topic's list.

transform `method descriptor` ¶

transform(data, *, iters=100, burn_in=10, num_samples=10, sample_interval=5, seed=None, iterations=None)

Infer document-topic distributions for new, unseen documents under the fitted model (sklearn-style transform). Holds the fitted effective topic-word distributions fixed and runs collapsed Gibbs to infer θ for each document. Returns shape (num_new_docs, num_topics) with rows summing to 1.

Approximation: held-out inference uses the fitted effective P(w | topic), which already marginalizes over the keyword switch, and the estimated asymmetric document-topic prior α (falling back to the symmetric base value when α was not estimated). The keyword switch variable is not re-estimated for new tokens.

The collapsed-Gibbs controls are per-document: iters sweeps each new document, discarding the first burn_in, then averaging num_samples θ snapshots taken sample_interval sweeps apart; seed seeds the inference RNG. iterations is a deprecated alias for iters.

weighted_lda `staticmethod` ¶

weighted_lda(num_topics, *, alpha=0.1, beta=0.01, seed=42)

Weighted LDA — keyATM's weightedLDA: a keyword-free model with no keyword topics, so it is plain LDA fit with keyATM's token weighting and estimated asymmetric α (collapsed Gibbs). Use it as the unsupervised baseline next to a keyword-assisted :class:KeyATM. fit it the same way (the weights argument controls the token weighting); the keyword-specific outputs (keyword_rate, pi_history) are empty.

num_topics is the number of topics K; alpha is the document-topic Dirichlet prior (the estimated asymmetric α starts here), beta the topic-word Dirichlet smoothing; seed seeds the Gibbs RNG.

topica.PA ¶

Pachinko Allocation Model (Li & McCallum 2006): a DAG of num_super super-topics over num_sub shared sub-topics over words, capturing topic correlations — super_sub reports which sub-topics each super-topic groups together. Collapsed Gibbs over (super, sub) pairs.

doc `class-attribute` ¶

__doc__ = 'Pachinko Allocation Model (Li & McCallum 2006): a DAG of `num_super`\nsuper-topics over `num_sub` shared sub-topics over words, capturing topic\n*correlations* — `super_sub` reports which sub-topics each super-topic groups\ntogether. Collapsed Gibbs over (super, sub) pairs.'

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.

module `class-attribute` ¶

__module__ = 'topica'

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.

alpha `property` ¶

alpha

The symmetric sub-topic Dirichlet prior α, broadcast to the columns of :attr:doc_topic, shape (num_sub,). Marks PA as a Dirichlet model for :func:topica.effects.composition_theta.

converged `property` ¶

converged

True if the relative-change convergence criterion was satisfied before all iterations completed. Always False when convergence_tol=0.

doc_lengths `property` ¶

doc_lengths

Number of tokens in each training document, shape (num_docs,).

doc_names `property` ¶

doc_names

doc_topic `property` ¶

doc_topic

Document × sub-topic proportions, shape (num_docs, num_sub).

fit_history `property` ¶

fit_history

Per-iteration log-likelihood trace. Returns one (iter, ll) pair for every check_every sweeps (empty when check_every=0, the default).

num_sub `property` ¶

num_sub

num_super `property` ¶

num_super

num_topics `property` ¶

num_topics

Alias for num_sub (the word-level topics).

seed `property` ¶

seed

The random seed the model was constructed with.

settings `property` ¶

settings

The constructor configuration as a JSON-serialisable dict, keyword-named to match __init__ (issue #400).

super_sub `property` ¶

super_sub

Super-topic → sub-topic association, shape (num_super, num_sub); row s shows which sub-topics super-topic s groups together (the correlations).

theta_draws `property` ¶

theta_draws

Thinned MCMC θ snapshots, shape (num_draws, num_docs, num_sub), dtype float32. None when fit with keep_theta_draws=False.

topic_names `property` ¶

topic_names

topic_word `property` ¶

topic_word

Sub-topic word distributions, shape (num_sub, num_words).

vocabulary `property` ¶

vocabulary

new `builtin` ¶

__new__(*args, **kwargs)

Create and return a new object. See help(type) for accurate signature.

repr `method descriptor` ¶

__repr__()

Return repr(self).

coherence `method descriptor` ¶

coherence(n=10)

UMass topic coherence per topic, shape (num_topics,). n is the number of top words per topic scored.

fit `method descriptor` ¶

fit(data, *, iters=1000, keep_theta_draws=True, num_theta_draws=25, convergence_tol=0.0, check_every=10)

Fit by collapsed Gibbs sampling for iters sweeps. keep_theta_draws (default True) retains num_theta_draws thinned MCMC θ snapshots in theta_draws, the cross-sweep posterior samples composition_theta prefers over the Dirichlet approximation; set it False to save memory. convergence_tol (default 0.0, disabled) enables opt-in early stopping: the run stops once the relative change in the recorded log-likelihood between the last two trace points, |ΔLL| / |LL|, falls below it, setting converged. The monitored quantity is the collapsed model-fit log-likelihood; the comparison window is the trace cadence (check_every / progress_interval), so a coarser cadence compares more widely spaced sweeps. This is a pragmatic early-stop heuristic on the log-likelihood trace, not a guarantee the Gibbs chain has mixed. check_every is how often, in sweeps, the log-likelihood is recorded and the convergence_tol test is applied.

load `staticmethod` ¶

load(path)

Load a model previously written by :meth:save.

save `method descriptor` ¶

save(path)

Save the fitted model to path. Reload with PA.load.

top_words `method descriptor` ¶

top_words(n=10, *, topic=None)

Top n words per topic as (word, probability) pairs.

Returns a list of n-length lists (one per topic), or — when topic is given — just that topic's list.

transform `method descriptor` ¶

transform(data, *, iters=100, burn_in=10, num_samples=10, sample_interval=5, seed=None, iterations=None)

Infer sub-topic proportions for new, unseen documents under the fitted model (sklearn-style transform). Holds the fitted sub-topic–word distributions fixed and runs collapsed Gibbs to infer θ over the num_sub sub-topics for each document. Returns shape (num_new_docs, num_sub) with rows summing to 1.

Approximation: held-out inference projects directly onto the fitted sub-topics, marginalizing the super-topic layer. The super-topic assignments are a training-time device and are not re-estimated for new documents.

The collapsed-Gibbs controls are per-document: iters sweeps each new document, discarding the first burn_in, then averaging num_samples θ snapshots taken sample_interval sweeps apart; seed seeds the inference RNG. iterations is a deprecated alias for iters.

topica.HLDA ¶

Hierarchical LDA (Blei, Griffiths & Jordan): topics organized in a tree of fixed depth, inferred by the nested Chinese Restaurant Process. The root is the shared (general) topic; deeper nodes are progressively more specific. Each document follows a root-to-leaf path. Inspect the tree with topic_word/node_levels/node_parents/doc_paths.

doc `class-attribute` ¶

__doc__ = 'Hierarchical LDA (Blei, Griffiths & Jordan): topics organized in a tree of\nfixed `depth`, inferred by the nested Chinese Restaurant Process. The root is\nthe shared (general) topic; deeper nodes are progressively more specific.\nEach document follows a root-to-leaf path. Inspect the tree with\n`topic_word`/`node_levels`/`node_parents`/`doc_paths`.'

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.

module `class-attribute` ¶

__module__ = 'topica'

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.

converged `property` ¶

converged

HLDA does not implement an early-stop criterion; always False.

doc_paths `property` ¶

doc_paths

Each document's root-to-leaf path (a list of node ids), length num_docs.

fit_history `property` ¶

fit_history

HLDA has no per-iteration trace yet (part B); always returns [].

leaves `property` ¶

leaves

The leaf node ids (nodes that are no node's parent).

node_levels `property` ¶

node_levels

The tree level (0 = root) of each node, length num_nodes.

node_parents `property` ¶

node_parents

The parent node id of each node (-1 for the root), length num_nodes.

num_nodes `property` ¶

num_nodes

The number of topic nodes in the inferred tree.

seed `property` ¶

seed

The random seed the model was constructed with.

settings `property` ¶

settings

The constructor configuration as a JSON-serialisable dict, keyword-named to match __init__ (issue #400). beta is the effective topic-word Dirichlet in force (the internal eta field, after resolving the deprecated eta= alias); eta is the deprecated alias and is not retained, so it always reports None.

topic_names `property` ¶

topic_names

topic_word `property` ¶

topic_word

Per-node word distributions, shape (num_nodes, num_words).

vocabulary `property` ¶

vocabulary

new `builtin` ¶

__new__(*args, **kwargs)

Create and return a new object. See help(type) for accurate signature.

repr `method descriptor` ¶

__repr__()

Return repr(self).

fit `method descriptor` ¶

fit(data, *, iters=500)

Fit by nested-CRP collapsed Gibbs sampling for iters sweeps.

load `staticmethod` ¶

load(path)

Load a model previously written by :meth:save.

save `method descriptor` ¶

save(path)

Save the fitted model to path. Reload with HLDA.load.

top_words `method descriptor` ¶

top_words(node, n=10)

Top n words for one topic node as (word, probability) pairs.

topica.Corpus ¶

A preprocessed, integer-encoded document collection.

Build one from already-tokenised documents with :meth:Corpus.from_documents, from a raw text file with :meth:Corpus.from_text_file, or load a binary corpus written by the preprocess CLI with :meth:Corpus.load.

doc `class-attribute` ¶

__doc__ = 'A preprocessed, integer-encoded document collection.\n\nBuild one from already-tokenised documents with\n:meth:`Corpus.from_documents`, from a raw text file with\n:meth:`Corpus.from_text_file`, or load a binary corpus written by the\n``preprocess`` CLI with :meth:`Corpus.load`.'

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.

module `class-attribute` ¶

__module__ = 'topica'

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.

doc_labels `property` ¶

doc_labels

doc_lengths `property` ¶

doc_lengths

Tokens per document in the pruned vocabulary, one entry per kept document (parallel to the rows of a fitted model's doc_topic). This is the document length N_d that :func:topica.dirichlet_theta_samples needs to recover each document's Dirichlet posterior for method-of-composition standard errors.

doc_names `property` ¶

doc_names

kept_indices `property` ¶

kept_indices

Original document indices that survived pruning, parallel to the rows of this corpus. Use it to realign an external covariate array or DataFrame to the documents the corpus actually kept: X = X[corpus.kept_indices].

metadata `property` ¶

metadata

Optional per-document metadata, already aligned to the surviving rows (set by :func:topica.from_dataframe, or assign your own). None if unset.

num_docs `property` ¶

num_docs

num_words `property` ¶

num_words

preprocessing `property` ¶

preprocessing

The vocabulary-filtering parameters Topica applied when this corpus was built (min_doc_freq, max_doc_fraction, min_cf, rm_top), as a dict. None for a corpus loaded from disk, where they are not stored.

total_tokens `property` ¶

total_tokens

vocabulary `property` ¶

vocabulary

word_counts `property` ¶

word_counts

Corpus word frequencies: total occurrences of each vocabulary term across all documents, parallel to :attr:vocabulary (length num_words). This is the empirical P(w) (up to normalization) that stm's lift and FREX James-Stein shrinkage use; pass it (or the corpus) to :func:topica.label_topics / :func:topica.frex for stm-faithful labels.

new `builtin` ¶

__new__(*args, **kwargs)

Create and return a new object. See help(type) for accurate signature.

repr `method descriptor` ¶

__repr__()

Return repr(self).

documents `method descriptor` ¶

documents()

The corpus as token lists — one list of word strings per document, in the pruned vocabulary and the kept-document order. The inverse of from_documents: use it to recover tokens for prepare_pyldavis, coherence, or any function that wants list[list[str]] after you have committed to a Corpus.

from_documents `staticmethod` ¶

from_documents(documents, *, doc_names=None, doc_labels=None, stopwords=None, min_doc_freq=1, max_doc_fraction=1.0, min_cf=0, rm_top=0)

Build a corpus from pre-tokenised documents.

documents is a sequence of token lists. Optional doc_names / doc_labels (each the same length as documents) attach an id and a label to every document. stopwords are dropped. Vocabulary is pruned by min_doc_freq (minimum document frequency) and max_doc_fraction (maximum fraction of documents), by min_cf (minimum collection/total frequency), and by rm_top (drop the N most frequent words) — matching tomotopy's min_df / min_cf / rm_top.

A document left with no tokens by pruning is dropped, so num_docs can be smaller than len(documents). The surviving original indices are in kept_indices; realign any external covariate matrix with X[corpus.kept_indices]. (An input document that is empty before any pruning is retained.)

from_text_file `staticmethod` ¶

from_text_file(path, *, format='plain', id_field=False, id_column=0, label_column=1, text_column=2, token_regex=None, stopwords=None, min_doc_freq=1, max_doc_fraction=1.0)

Load and tokenise a raw text file (MALLET-style), matching the preprocess CLI.

format is "plain" (one document per line) or "tsv". In plain mode, id_field=True treats the first whitespace token as the doc id. In tsv mode, id_column/label_column/text_column select columns (label_column=None disables labels).

load `staticmethod` ¶

load(path)

Load a binary corpus file written by the preprocess CLI or :meth:save.

save `method descriptor` ¶

save(path)

Write this corpus to a binary file (the preprocess format), so it can be reused by the CLI tools or reloaded with :meth:load.

Models¶

topica.LDA ¶

__doc__ class-attribute ¶

__module__ class-attribute ¶

alpha property ¶

beta property ¶

converged property ¶

doc_lengths property ¶

doc_names property ¶

doc_topic property ¶

fit_history property ¶

log_likelihood_history property ¶

num_topics property ¶

seed property ¶

settings property ¶

theta_draws property ¶

topic_divergence property ¶

topic_names property ¶

topic_word property ¶

vocabulary property ¶

__new__ builtin ¶

__repr__ method descriptor ¶

coherence method descriptor ¶

diagnostics method descriptor ¶

evaluate method descriptor ¶

fit method descriptor ¶

load staticmethod ¶

load_state staticmethod ¶

log_likelihood method descriptor ¶

perplexity method descriptor ¶

save method descriptor ¶

save_doc_topic method descriptor ¶

save_state method descriptor ¶

save_topic_word method descriptor ¶

similar_documents method descriptor ¶

top_documents method descriptor ¶

top_words method descriptor ¶

transform method descriptor ¶

topica.DMR ¶

__doc__ class-attribute ¶

__module__ class-attribute ¶

alpha property ¶

converged property ¶

doc_lengths property ¶

doc_names property ¶

doc_topic property ¶

feature_effect_se property ¶

feature_effects property ¶

feature_names property ¶

fit_history property ¶

num_topics property ¶

seed property ¶

settings property ¶

theta_draws property ¶

topic_names property ¶

topic_word property ¶

vocabulary property ¶

__new__ builtin ¶

__repr__ method descriptor ¶

coherence method descriptor ¶

fit method descriptor ¶

load staticmethod ¶

save method descriptor ¶

top_words method descriptor ¶

transform method descriptor ¶

topica.GDMR ¶

__doc__ class-attribute ¶

__module__ class-attribute ¶

__weakref__ property ¶

alpha property ¶

converged property ¶

decay property ¶

degrees property ¶

doc_lengths property ¶

doc_names property ¶

doc_topic property ¶

feature_effect_se property ¶

feature_effects property ¶

feature_names property ¶

fit_history property ¶

doc `class-attribute` ¶

module `class-attribute` ¶

alpha `property` ¶

beta `property` ¶

converged `property` ¶

doc_lengths `property` ¶

doc_names `property` ¶

doc_topic `property` ¶

fit_history `property` ¶

log_likelihood_history `property` ¶

num_topics `property` ¶

seed `property` ¶

settings `property` ¶

theta_draws `property` ¶

topic_divergence `property` ¶

topic_names `property` ¶

topic_word `property` ¶

vocabulary `property` ¶

new `builtin` ¶

repr `method descriptor` ¶

coherence `method descriptor` ¶

diagnostics `method descriptor` ¶

evaluate `method descriptor` ¶

fit `method descriptor` ¶

load `staticmethod` ¶

load_state `staticmethod` ¶

log_likelihood `method descriptor` ¶

perplexity `method descriptor` ¶

save `method descriptor` ¶

save_doc_topic `method descriptor` ¶

save_state `method descriptor` ¶

save_topic_word `method descriptor` ¶

similar_documents `method descriptor` ¶

top_documents `method descriptor` ¶

top_words `method descriptor` ¶

transform `method descriptor` ¶

doc `class-attribute` ¶

module `class-attribute` ¶

alpha `property` ¶

converged `property` ¶

doc_lengths `property` ¶

doc_names `property` ¶

doc_topic `property` ¶

feature_effect_se `property` ¶

feature_effects `property` ¶

feature_names `property` ¶

fit_history `property` ¶

num_topics `property` ¶

seed `property` ¶

settings `property` ¶

theta_draws `property` ¶

topic_names `property` ¶

topic_word `property` ¶

vocabulary `property` ¶

new `builtin` ¶

repr `method descriptor` ¶

coherence `method descriptor` ¶

fit `method descriptor` ¶

load `staticmethod` ¶

save `method descriptor` ¶

top_words `method descriptor` ¶

transform `method descriptor` ¶

doc `class-attribute` ¶

module `class-attribute` ¶

weakref `property` ¶

alpha `property` ¶

converged `property` ¶

decay `property` ¶

degrees `property` ¶

doc_lengths `property` ¶

doc_names `property` ¶

doc_topic `property` ¶

feature_effect_se `property` ¶

feature_effects `property` ¶

feature_names `property` ¶

fit_history `property` ¶

metadata_names `property` ¶

metadata_range `property` ¶

num_topics `property` ¶

settings `property` ¶