STM toolkit¶

The structural / covariate operations live in topica.stm. The general post-hoc diagnostics (labeling, alignment, pyLDAvis, …) are on the Diagnostics page.

topica.standard_errors ¶

standard_errors(model, corpus=None, *, of='effect', method='composition', formula=None, data=None, X=None, feature_names=None, nsims=25, n_boot=200, topn=10, ci=0.95, seed=0, min_alignment=0.5, min_margin=0.1, model_factory=None, refit=None, **fit_kwargs)

Standard errors for the quantities people publish, with topic-estimation uncertainty propagated — one entry point across the model families (issue #15).

Parameters:

Name	Type	Description	Default
`model`	`a fitted topica model.`		required
`corpus`	the ``Corpus`` (or token lists) the model was fit on. Required for	`method="composition"` on a Gibbs model (for document lengths) and for `method="bootstrap"` (to resample documents).	`None`
`of`	``"effect"`` (covariate effects, needs ``formula``/``data`` or ``X``),	`"prevalence"` (each topic's mean proportion), or `"top_words"` (per-topic top-word stability; `method="bootstrap"` only).	`'effect'`
`method`	``"composition"`` (default) draws theta from the model's posterior and	pools by Rubin's rules — cheap, no refit, sound for effects/prevalence on STM/CTM/LDA/keyATM. `"bootstrap"` refits on resampled documents and aligns topics across refits — the only route for `of="top_words"` and for the embedding models, but it flags topics whose alignment is unstable.	`'composition'`
`nsims`	`composition theta draws. n_boot : bootstrap resamples.`		`25`
`min_alignment`	`a bootstrap topic whose mean top-word Jaccard with the`	reference falls below this is flagged `reliable=False` and its SE is suppressed (set to NaN), since a split/merge corrupts the estimate.	`0.5`
`min_margin`	`a topic is also flagged unreliable when its match is ambiguous`	— the best-matching refit topic is less than `min_margin` better (in Jaccard) than the next-best. This catches the case a high Jaccard misses: topics whose top words are not distinct, so the alignment is arbitrary.	`0.1`

Returns:

Type	Description
``of="effect"`` -> ``list[TopicEffect]`` (as :func:`estimate_effect`);
``of="prevalence"`` -> ``list[TopicPrevalence]``;
``of="top_words"`` -> ``list[TopWordUncertainty]``.

topica.stm.estimate_effect ¶

estimate_effect(doc_topic, X=None, *, data=None, formula=None, feature_names=None, topics=None, add_intercept=True, ci=0.95, cluster=None, weights=None, random=None, link='identity', corpus=None, nsims=None, seed=0)

Regress each topic's proportion on document covariates.

Pass a point estimate of θ for an ordinary OLS, or a stack of posterior draws of θ for the method of composition — the uncertainty-propagating procedure R stm uses (Treier & Jackman 2008). With draws, each one is regressed and the results are pooled by Rubin's rules, so the reported standard errors include the topic-estimation uncertainty, not just OLS sampling error. Get draws with :func:posterior_theta_samples.

For paper-grade inference two extras matter:

cluster — a length-num_docs array of group labels (e.g. speaker, user, outlet). Text data is almost always nested, and ignoring it understates uncertainty. Supplying it switches the standard errors to the cluster-robust (CR1) sandwich estimator. (With posterior draws, each draw is clustered and the per-draw covariances are then Rubin-pooled.)
link — "identity" (default OLS), "logit" (fractional logit, via binomial quasi-likelihood), or "log" (quasi-Poisson). Because topic proportions live in [0, 1], the logit link keeps fitted values in bounds where OLS can wander outside them (Papke & Wooldridge). Non-identity links report heteroskedasticity- or cluster-robust standard errors.
weights — a length-num_docs array of (survey) weights, or a column name in data. Switches to weighted least squares: documents enter the regression in proportion to their weight, so a weighted sample (e.g. a survey-weighted corpus, or documents weighted by length) estimates the population-level effect. Composes with cluster (weighted cluster-robust SEs) and with link. Matches faSTM's weighted estimateEffect.
random — an lme4-style random-intercept term "(1 | group)" (with group a column of data). Fits a mixed model — the fixed-effect design plus a random intercept per group — by REML for each posterior draw, then Rubin-pools the fixed effects, matching faSTM's estimateEffect(... ~ x + (1 | group)). Use it when documents are nested in units (state, outlet, author) whose baseline topic level varies: the random intercept soaks up that between-unit variation so the fixed-effect SEs are not understated. The estimated group and residual standard deviations are attached as TopicEffect.varcomp. Only a random intercept is supported (not random slopes), with link="identity" and no cluster/weights.

Specifying the design. Give the covariates one of two ways: a prebuilt design matrix as X (with feature_names), or an R-style formula together with a data frame, which builds X for you via :func:topica.design_matrix. Use the same design you fit the model with. The effects regression is on the covariates you pass here, not on whatever went into STM.fit; if they differ, the coefficients answer a different question than the model. The reliable pattern is to build the design once and pass the identical X (or the identical formula + data) to both fit and estimate_effect.

Parameters:

Name	Type	Description	Default
`doc_topic`	`array or fitted model`	Either `(num_docs, num_topics)` — a point θ (`model.doc_topic`) for plain OLS — or `(nsims, num_docs, num_topics)` — posterior θ draws for method-of-composition pooling. You may also pass the fitted model itself: with `nsims` (and `corpus=` for a Gibbs model) the right θ posterior is drawn for you; without `nsims` its point θ is used.	required
`X`	`array(num_docs, p)`	Document covariates (design matrix); build nonlinear/interaction terms with :func:`spline` / :func:`interaction`. An intercept is prepended when `add_intercept` is True.	`None`
`feature_names`	`list[str]`	Column names for `X`. Defaults to `feature_0 ...`.	`None`
`data`	`DataFrame`	Used with `formula` to build the design matrix; ignored when `X` is given. A string `cluster` is read as a column of this frame.	`None`
`formula`	`str`	R-style formula (e.g. `"~ party + spline(year, df=3)"`) evaluated against `data` to build `X` and `feature_names`, via :func:`topica.design_matrix` (needs the optional `topica[formula]` extra). Pass either `X` or `formula` + `data`, not both.	`None`
`topics`	`sequence[int]`	Restrict to these topics. Defaults to all.	`None`
`ci`	`float`	Confidence level for the (normal-approximation) intervals.	`0.95`

Returns:

Type	Description
`list[TopicEffect]`	One regression per topic. For a tidy long table with one row per (topic, feature), concatenate the per-topic frames:: `import pandas as pd table = pd.concat([e.to_frame() for e in result], ignore_index=True)`

topica.stm.average_marginal_effects ¶

average_marginal_effects(doc_topic, covariate, *, formula, data, topics=None, h=None, ci=0.95, cluster=None, weights=None, corpus=None, nsims=None, seed=0, add_intercept=True)

Average marginal effects of a covariate on topic prevalence.

The average expected change in a topic's proportion per unit of covariate, averaged over the observed documents. For a continuous covariate this is the average numeric derivative (central difference); for a factor it is the average contrast of each non-reference level against the reference level. This is cleaner than reading raw regression coefficients, especially when the design has splines or interactions, where no single coefficient is the effect (cf. the margins package, and faSTM's ame()).

The marginal effect is computed on the identity (proportion) scale: each topic's prevalence is regressed on the design via the method of composition (the same path as :func:estimate_effect), and the averaged design-change vector is contracted with the per-topic coefficient posterior, propagating topic-estimation uncertainty into the standard error via the Rubin-pooled coefficient covariance.

Parameters:

Name	Type	Description	Default
`doc_topic`	`array or fitted model`	As in :func:`estimate_effect` — a fitted model (theta drawn internally when `nsims` is given), a `(num_docs, K)` point theta, or `(nsims, num_docs, K)` posterior draws.	required
`covariate`	`str`	Column in `data` to compute marginal effects for.	required
`formula`	`str`	R-style formula for the design (must reference `covariate`). Splines are replayed with the training knots, so a perturbed covariate uses the same basis as the fit.	required
`data`	`DataFrame`	One row per document; the design is rebuilt on perturbed copies of it.	required
`topics`	`sequence[int]`	Restrict to these topics. Defaults to all.	`None`
`h`	`float`	Step for the numeric derivative of a continuous covariate. Defaults to `0.01 * sd(covariate)`.	`None`
`ci`	`float`	Confidence level for the (normal-approximation) intervals.	`0.95`
`cluster`	`optional`	Passed through to :func:`estimate_effect` for the underlying regression (cluster-robust SEs, survey weights). When `weights` is given the design-change is averaged with those weights (a population marginal effect); otherwise it is a plain sample average.	`None`
`weights`	`optional`	Passed through to :func:`estimate_effect` for the underlying regression (cluster-robust SEs, survey weights). When `weights` is given the design-change is averaged with those weights (a population marginal effect); otherwise it is a plain sample average.	`None`
`corpus`	`optional`	As in :func:`estimate_effect`.	`None`
`nsims`	`optional`	As in :func:`estimate_effect`.	`None`
`seed`	`optional`	As in :func:`estimate_effect`.	`None`
`add_intercept`	`optional`	As in :func:`estimate_effect`.	`None`

Returns:

Type	Description
`AverageMarginalEffects`	Iterable of :class:`MarginalEffect`; `.to_frame()` for a tidy table.

topica.stm.MarginalEffect ¶

One average marginal effect: a (topic, covariate term) pair.

Produced by :func:average_marginal_effects. topic_name is the topic's label. ame is the average expected change in the topic's proportion for the covariate term (a unit change for a continuous covariate, a level-vs-reference contrast for a factor), averaged over the observed documents. se is its standard error and ci_low/ci_high the bounds of the confidence interval.

annotations `class-attribute` ¶

__annotations__ = {'topic': 'int', 'topic_name': 'str', 'term': 'str', 'ame': 'float', 'se': 'float', 'ci_low': 'float', 'ci_high': 'float'}

dict() -> new empty dictionary dict(mapping) -> new dictionary initialized from a mapping object's (key, value) pairs dict(iterable) -> new dictionary initialized as if via: d = {} for k, v in iterable: d[k] = v dict(**kwargs) -> new dictionary initialized with the name=value pairs in the keyword argument list. For example: dict(one=1, two=2)

__dataclass_fields__ `class-attribute` ¶

__dataclass_fields__ = {'topic': Field(name='topic',type='int',default=<dataclasses._MISSING_TYPE object at 0x7f4a64521850>,default_factory=<dataclasses._MISSING_TYPE object at 0x7f4a64521850>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'topic_name': Field(name='topic_name',type='str',default=<dataclasses._MISSING_TYPE object at 0x7f4a64521850>,default_factory=<dataclasses._MISSING_TYPE object at 0x7f4a64521850>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'term': Field(name='term',type='str',default=<dataclasses._MISSING_TYPE object at 0x7f4a64521850>,default_factory=<dataclasses._MISSING_TYPE object at 0x7f4a64521850>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'ame': Field(name='ame',type='float',default=<dataclasses._MISSING_TYPE object at 0x7f4a64521850>,default_factory=<dataclasses._MISSING_TYPE object at 0x7f4a64521850>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'se': Field(name='se',type='float',default=<dataclasses._MISSING_TYPE object at 0x7f4a64521850>,default_factory=<dataclasses._MISSING_TYPE object at 0x7f4a64521850>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'ci_low': Field(name='ci_low',type='float',default=<dataclasses._MISSING_TYPE object at 0x7f4a64521850>,default_factory=<dataclasses._MISSING_TYPE object at 0x7f4a64521850>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'ci_high': Field(name='ci_high',type='float',default=<dataclasses._MISSING_TYPE object at 0x7f4a64521850>,default_factory=<dataclasses._MISSING_TYPE object at 0x7f4a64521850>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD)}

dict() -> new empty dictionary dict(mapping) -> new dictionary initialized from a mapping object's (key, value) pairs dict(iterable) -> new dictionary initialized as if via: d = {} for k, v in iterable: d[k] = v dict(**kwargs) -> new dictionary initialized with the name=value pairs in the keyword argument list. For example: dict(one=1, two=2)

doc `class-attribute` ¶

__doc__ = "One average marginal effect: a (topic, covariate term) pair.\n\n    Produced by :func:`average_marginal_effects`. ``topic_name`` is the topic's\n    label. ``ame`` is the average expected change in the topic's proportion for the\n    covariate term (a unit change for a continuous covariate, a level-vs-reference\n    contrast for a factor), averaged over the observed documents. ``se`` is its\n    standard error and ``ci_low``/``ci_high`` the bounds of the confidence\n    interval.\n    "

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.

__match_args__ `class-attribute` ¶

__match_args__ = ('topic', 'topic_name', 'term', 'ame', 'se', 'ci_low', 'ci_high')

Built-in immutable sequence.

If no argument is given, the constructor returns an empty tuple. If iterable is specified the tuple is initialized from iterable's items.

If the argument is a tuple, the return value is the same object.

module `class-attribute` ¶

__module__ = 'topica.stm'

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.

weakref `property` ¶

__weakref__

list of weak references to the object

topica.stm.AverageMarginalEffects ¶

The full set of average marginal effects for one covariate.

Returned by :func:average_marginal_effects. Iterate .effects for the per-(topic, term) :class:MarginalEffect rows, or call :meth:to_frame for a tidy DataFrame.

annotations `class-attribute` ¶

__annotations__ = {'covariate': 'str', 'effects': 'list'}

dict() -> new empty dictionary dict(mapping) -> new dictionary initialized from a mapping object's (key, value) pairs dict(iterable) -> new dictionary initialized as if via: d = {} for k, v in iterable: d[k] = v dict(**kwargs) -> new dictionary initialized with the name=value pairs in the keyword argument list. For example: dict(one=1, two=2)

__dataclass_fields__ `class-attribute` ¶

__dataclass_fields__ = {'covariate': Field(name='covariate',type='str',default=<dataclasses._MISSING_TYPE object at 0x7f4a64521850>,default_factory=<dataclasses._MISSING_TYPE object at 0x7f4a64521850>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'effects': Field(name='effects',type='list',default=<dataclasses._MISSING_TYPE object at 0x7f4a64521850>,default_factory=<dataclasses._MISSING_TYPE object at 0x7f4a64521850>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD)}

dict() -> new empty dictionary dict(mapping) -> new dictionary initialized from a mapping object's (key, value) pairs dict(iterable) -> new dictionary initialized as if via: d = {} for k, v in iterable: d[k] = v dict(**kwargs) -> new dictionary initialized with the name=value pairs in the keyword argument list. For example: dict(one=1, two=2)

doc `class-attribute` ¶

__doc__ = 'The full set of average marginal effects for one covariate.\n\n    Returned by :func:`average_marginal_effects`. Iterate ``.effects`` for the\n    per-(topic, term) :class:`MarginalEffect` rows, or call :meth:`to_frame` for a\n    tidy DataFrame.\n    '

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.

__match_args__ `class-attribute` ¶

__match_args__ = ('covariate', 'effects')

Built-in immutable sequence.

If no argument is given, the constructor returns an empty tuple. If iterable is specified the tuple is initialized from iterable's items.

If the argument is a tuple, the return value is the same object.

module `class-attribute` ¶

__module__ = 'topica.stm'

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.

weakref `property` ¶

__weakref__

list of weak references to the object

to_frame ¶

to_frame()

Return a tidy pandas DataFrame, one row per (topic, term).

topica.effects.dirichlet_theta_samples ¶

dirichlet_theta_samples(doc_topic, doc_lengths, *, nsims=25, seed=0, prior=0.0)

Draw nsims samples of the document-topic matrix θ for a Gibbs model.

A collapsed-Gibbs model's doc_topic is the posterior mean of each document's θ given its token-topic assignments, where θ_d ~ Dirichlet(α + n_d) and (α + n_d) = doc_topic_d · (N_d + Σα). With the document length N_d we recover that Dirichlet and sample it, so the draws carry each document's within-document estimation uncertainty. Feed the result to :func:estimate_effect for method-of-composition standard errors on a model that has no logistic-normal posterior of its own.

Parameters:

Name	Type	Description	Default
`doc_topic`	`array(num_docs, num_topics)`	The fitted θ (rows sum to one), e.g. `model.doc_topic`.	required
`doc_lengths`	`array(num_docs)`	Tokens per document (`[len(d) for d in docs]`). Longer documents give tighter draws, exactly as they pin θ more firmly in the model.	required
`nsims`	`int`	Number of θ draws.	`25`
`seed`	`int`	RNG seed.	`0`
`prior`	`float`	Extra concentration added to every document (a flat pseudo-count `Σα` spread over the topics). 0 uses the token counts alone.	`0.0`

Returns:

Type	Description
`array(nsims, num_docs, num_topics)`	Matches :func:`posterior_theta_samples`, ready for :func:`estimate_effect`.

topica.stm.posterior_theta_samples ¶

posterior_theta_samples(model, nsims=25, seed=0)

Draw nsims samples of the document-topic matrix θ from a fitted :class:STM/:class:CTM's variational posterior.

Each document's logistic-normal posterior is η_d ~ N(λ_d, ν_d) (from model.eta_mean / model.eta_cov); a draw of η is mapped through the softmax (with the reference category fixed at 0) to a θ row. Feed the result to :func:estimate_effect for method-of-composition uncertainty.

Returns an array of shape (nsims, num_docs, num_topics).

topica.effects.model_family ¶

model_family(model)

Which method-of-composition theta sampler suits model.

"logistic_normal" for STM/CTM (a variational eta posterior), "dirichlet" for the collapsed-Gibbs models (LDA, keyATM, SeededLDA, ...), or "none" for models with no posterior over theta (the embedding models), which need method="bootstrap".

topica.stm.spline ¶

spline(x, df=4, knots=None)

Restricted (natural) cubic-spline basis for a covariate — the building block for nonlinear prevalence terms like R stm's ~ s(day).

Uses Harrell's restricted-cubic-spline parameterization: df+1 knots (at evenly spaced quantiles of x unless knots is given) yield df basis columns whose first is the linear term. np.column_stack the result into your design matrix and extend feature_names with the returned names.

Returns (basis (n, df), names).

topica.stm.interaction ¶

interaction(a, b, name='interaction')

Interaction columns between two covariate blocks (all pairwise products of their columns) — for terms like R stm's ~ treatment * party.

a, b are 1-D or 2-D arrays with the same number of rows. Returns (products (n, ncols), names); np.column_stack into your design matrix.

Predicted prevalence¶

Compute predicted topic prevalence at chosen covariate values, with simulation-based credible intervals — the model-agnostic counterpart of R stm's plot.estimateEffect.

topica.predicted_prevalence ¶

predicted_prevalence(model, *, X=None, formula=None, data=None, feature_names=None, at=None, contrast=None, continuous=None, npoints=50, topics=None, link='identity', ci=0.95, nsims=25, n_sim=2000, corpus=None, seed=0, add_intercept=True)

Predicted topic prevalence at chosen covariate values, with simulation-based CIs.

This is the model-agnostic counterpart of R stm's plot.estimateEffect. It works on any model whose document-topic matrix supports :func:~topica.effects.composition_theta (STM, CTM, LDA, keyATM covariate, DMR, SeededLDA, ...) because it regresses the composition-theta draws on the design matrix — exactly as :func:estimate_effect does — and then pushes coefficient posterior draws through the link at new covariate values rather than reporting the coefficients themselves.

Three modes mirror stm's method argument:

at= (point grid) — a dict {covariate: value} or a small DataFrame of reference rows; returns predicted theta per topic per row, with CI.
contrast= (difference) — two covariate settings, e.g. contrast={"party": ["D", "R"]}; returns the difference in predicted theta between the two settings per topic, with CI.
continuous= (smooth curve) — a column name; sweeps the covariate over its observed range on a npoints-point grid, holding all other columns at their means. Spline terms in formula are evaluated with the training knots, not re-fit to the new grid.

Parameters:

Name	Type	Description	Default
`model`	`fitted topica model`	Any model whose theta supports the composition method (Gibbs or logistic-normal). Pass the model itself; theta draws are generated internally.	required
`X`	`array(num_docs, p)`	Raw design matrix. Provide either `X` (with optional `feature_names`) or `formula` + `data`.	`None`
`formula`	`str`	R-style formula, e.g. `"~ party + spline(year, df=3)"`.	`None`
`data`	`DataFrame`	One row per document; required with `formula=`. Also used to build reference rows for `continuous=` / `contrast=`.	`None`
`feature_names`	`list[str]`	Column names for `X`. Required for `continuous=` or `contrast=` when using the raw `X` path.	`None`
`at`	`dict or DataFrame`	Reference covariate settings for point predictions.	`None`
`contrast`	`dict or 2 - tuple`	Two covariate settings; the result is their difference.	`None`
`continuous`	`str`	Column name to sweep over its observed range.	`None`
`npoints`	`int`	Number of grid points for `continuous=`. Default 50.	`50`
`topics`	`list[int]`	Restrict to these topics. Defaults to all.	`None`
`link`	`str`	`"identity"` (default), `"logit"`, or `"log"`. Applied to the linear predictor when computing predicted prevalence.	`'identity'`
`ci`	`float`	Confidence level for the simulation-based interval. Default 0.95.	`0.95`
`nsims`	`int`	Composition theta draws for Rubin's-rules pooling. Default 25.	`25`
`n_sim`	`int`	Number of coefficient posterior draws for the simulation CI. Default 2000.	`2000`
`corpus`	`Corpus or token lists`	Required for Gibbs models that did not retain `theta_draws`.	`None`
`seed`	`int`	RNG seed.	`0`
`add_intercept`	`bool`	Prepend an intercept column to the design matrix. Default True.	`True`

Returns:

Type	Description
`list[PredictedPrevalence]`	One object per topic (in `topics` order, or all topics). Each has `.estimate`, `.ci_low`, `.ci_high` arrays (one entry per grid point) and a `.to_frame()` method for a tidy DataFrame.

topica.PredictedPrevalence ¶

Predicted topic prevalence at a covariate grid, with simulation-based CIs.

Produced by :func:predicted_prevalence. Each entry covers one topic across all grid points (for at/continuous) or the contrast between two settings (for contrast).

Attributes:

Name	Type	Description
`topic`	`int`	Zero-based topic index.
`topic_name`	`str`	Human-readable label (`topic_names` from the model, or `"topic_k"`).
`mode`	`str`	One of `"at"`, `"contrast"`, or `"continuous"`.
`grid`	`list`	Reference covariate values: a list of dicts for `at` / `continuous` (one per grid row), or `[setting_a, setting_b]` for `contrast`.
`estimate`	`ndarray`	Mean predicted prevalence (or contrast), one entry per grid point.
`ci_low`	`ndarray`	Lower bound of the `ci`-level simulation interval.
`ci_high`	`ndarray`	Upper bound.
`covariate`	`str or None`	For `continuous`, the name of the swept covariate (convenient for plotting); `None` otherwise.

annotations `class-attribute` ¶

__annotations__ = {'topic': 'int', 'topic_name': 'str', 'mode': 'str', 'grid': 'list', 'estimate': 'np.ndarray', 'ci_low': 'np.ndarray', 'ci_high': 'np.ndarray', 'covariate': 'str | None'}

dict() -> new empty dictionary dict(mapping) -> new dictionary initialized from a mapping object's (key, value) pairs dict(iterable) -> new dictionary initialized as if via: d = {} for k, v in iterable: d[k] = v dict(**kwargs) -> new dictionary initialized with the name=value pairs in the keyword argument list. For example: dict(one=1, two=2)

__dataclass_fields__ `class-attribute` ¶

__dataclass_fields__ = {'topic': Field(name='topic',type='int',default=<dataclasses._MISSING_TYPE object at 0x7f4a64521850>,default_factory=<dataclasses._MISSING_TYPE object at 0x7f4a64521850>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'topic_name': Field(name='topic_name',type='str',default=<dataclasses._MISSING_TYPE object at 0x7f4a64521850>,default_factory=<dataclasses._MISSING_TYPE object at 0x7f4a64521850>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'mode': Field(name='mode',type='str',default=<dataclasses._MISSING_TYPE object at 0x7f4a64521850>,default_factory=<dataclasses._MISSING_TYPE object at 0x7f4a64521850>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'grid': Field(name='grid',type='list',default=<dataclasses._MISSING_TYPE object at 0x7f4a64521850>,default_factory=<dataclasses._MISSING_TYPE object at 0x7f4a64521850>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'estimate': Field(name='estimate',type='np.ndarray',default=<dataclasses._MISSING_TYPE object at 0x7f4a64521850>,default_factory=<dataclasses._MISSING_TYPE object at 0x7f4a64521850>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'ci_low': Field(name='ci_low',type='np.ndarray',default=<dataclasses._MISSING_TYPE object at 0x7f4a64521850>,default_factory=<dataclasses._MISSING_TYPE object at 0x7f4a64521850>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'ci_high': Field(name='ci_high',type='np.ndarray',default=<dataclasses._MISSING_TYPE object at 0x7f4a64521850>,default_factory=<dataclasses._MISSING_TYPE object at 0x7f4a64521850>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'covariate': Field(name='covariate',type='str | None',default=None,default_factory=<dataclasses._MISSING_TYPE object at 0x7f4a64521850>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD)}

dict() -> new empty dictionary dict(mapping) -> new dictionary initialized from a mapping object's (key, value) pairs dict(iterable) -> new dictionary initialized as if via: d = {} for k, v in iterable: d[k] = v dict(**kwargs) -> new dictionary initialized with the name=value pairs in the keyword argument list. For example: dict(one=1, two=2)

doc `class-attribute` ¶

__doc__ = 'Predicted topic prevalence at a covariate grid, with simulation-based CIs.\n\n    Produced by :func:`predicted_prevalence`. Each entry covers one topic across\n    all grid points (for ``at``/``continuous``) or the contrast between two\n    settings (for ``contrast``).\n\n    Attributes\n    ----------\n    topic : int\n        Zero-based topic index.\n    topic_name : str\n        Human-readable label (``topic_names`` from the model, or ``"topic_k"``).\n    mode : str\n        One of ``"at"``, ``"contrast"``, or ``"continuous"``.\n    grid : list\n        Reference covariate values: a list of dicts for ``at`` / ``continuous``\n        (one per grid row), or ``[setting_a, setting_b]`` for ``contrast``.\n    estimate : np.ndarray\n        Mean predicted prevalence (or contrast), one entry per grid point.\n    ci_low : np.ndarray\n        Lower bound of the ``ci``-level simulation interval.\n    ci_high : np.ndarray\n        Upper bound.\n    covariate : str or None\n        For ``continuous``, the name of the swept covariate (convenient for\n        plotting); ``None`` otherwise.\n    '

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.

__match_args__ `class-attribute` ¶

__match_args__ = ('topic', 'topic_name', 'mode', 'grid', 'estimate', 'ci_low', 'ci_high', 'covariate')

Built-in immutable sequence.

If no argument is given, the constructor returns an empty tuple. If iterable is specified the tuple is initialized from iterable's items.

If the argument is a tuple, the return value is the same object.

module `class-attribute` ¶

__module__ = 'topica.stm'

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.

weakref `property` ¶

__weakref__

list of weak references to the object

to_frame ¶

to_frame()

Return a tidy pandas DataFrame with one row per grid point.

Columns are topic, topic_name, any covariate column(s), estimate, ci_low, and ci_high.

Permutation test¶

Distribution-free test of whether a binary prevalence covariate genuinely shifts topic prevalence, or whether the association could arise by chance.

topica.permutation_test ¶

permutation_test(model, corpus, covariate, *, n_perm=100, topics=None, topn=10, seed=0, model_factory=None, iters=None)

Permutation test for a binary prevalence covariate (R stm's permutationTest).

Assesses whether a binary document-level covariate genuinely shifts topic prevalence, or whether an apparent association could arise by chance. Each permutation randomly reassigns the covariate at its empirical rate, refits the model from fresh starting values (passing the permuted covariate to the model for covariate-aware families), aligns the refit topics to the reference (using the Hungarian top-word matcher from :func:~topica.validation._hungarian), and records the covariate's effect on every topic. The observed effect for each topic is then compared to the permutation null to compute a two-sided p-value.

Parameters:

Name	Type	Description	Default
`model`	`a fitted topica model.`	The reference fit. Its type is used to build each permutation refit (`type(model)(num_topics=K, seed=s)`), unless `model_factory` is given. The model must expose `doc_topic`, `topic_word`, and `vocabulary`.	required
`corpus`	list of token lists or a ``Corpus``.	The documents the model was fit on. Each permutation refits on the same documents with a shuffled covariate.	required
`covariate`	`array-like (num_docs,), binary (0/1 or True/False).`	The binary prevalence covariate to test. Must have exactly two unique values; they are mapped to 0 and 1 in sorted order.	required
`n_perm`	`int`	Number of permutation refits. Higher values give more stable p-values; 100 is enough for a screening test, 500 for publication.	`100`
`topics`	`sequence of int`	Restrict the output to these topic indices. Defaults to all topics.	`None`
`topn`	`int`	Top-word count used for topic alignment across refits.	`10`
`seed`	`int`	Master RNG seed. Permutation seeds are derived as `seed + perm_index`.	`0`
`model_factory`	`callable(seed) -> unfitted model`	Override the default `type(model)(num_topics=K, seed=s)` builder. Use this when the model's constructor needs extra arguments (e.g. keyword lists for KeyATM, or content covariates for STM).	`None`
`iters`	`int`	Iterations for each permutation refit. When `None` the refit's default iteration count is used. Pass a smaller value (e.g. `iters=100`) to speed up screening tests.	`None`

Returns:

Type	Description
list of :class:`PermutationResult`	One entry per topic (restricted to `topics` when given), each with the observed effect, the permutation null distribution, and a two-sided p-value.

Notes

For covariate-aware models (STM, DMR, KeyATM) the permuted covariate is passed directly to each refit so the null model is correctly specified (matching R stm's permutationTest behaviour). For covariate-free models (LDA, HDP, etc.) permuting the labels used only in the effect statistic remains a valid null.

The p-value uses the (1 + count) / (1 + n_perm) convention, so it is never exactly zero. Permutation statistics that are NaN (from unmatched topics in variable-K refits such as HDP) are dropped from the null before computing the p-value; the effective n is reduced accordingly.

topica.PermutationResult ¶

Result of :func:permutation_test for one topic.

Attributes:

Name	Type	Description
`topic`	`int`	Topic index (aligned to the reference model).
`topic_name`	`str`	Topic label (or `"topic_t"` when no labels are set).
`observed`	`float`	Observed difference in mean prevalence between the two covariate groups.
`null`	`(ndarray, shape(n_perm))`	Per-permutation covariate effects (the null distribution).
`pvalue`	`float`	Two-sided p-value: proportion of permutations whose absolute effect equals or exceeds the absolute observed effect.

annotations `class-attribute` ¶

__annotations__ = {'topic': 'int', 'topic_name': 'str', 'observed': 'float', 'null': 'np.ndarray', 'pvalue': 'float'}

dict() -> new empty dictionary dict(mapping) -> new dictionary initialized from a mapping object's (key, value) pairs dict(iterable) -> new dictionary initialized as if via: d = {} for k, v in iterable: d[k] = v dict(**kwargs) -> new dictionary initialized with the name=value pairs in the keyword argument list. For example: dict(one=1, two=2)

__dataclass_fields__ `class-attribute` ¶

__dataclass_fields__ = {'topic': Field(name='topic',type='int',default=<dataclasses._MISSING_TYPE object at 0x7f4a64521850>,default_factory=<dataclasses._MISSING_TYPE object at 0x7f4a64521850>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'topic_name': Field(name='topic_name',type='str',default=<dataclasses._MISSING_TYPE object at 0x7f4a64521850>,default_factory=<dataclasses._MISSING_TYPE object at 0x7f4a64521850>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'observed': Field(name='observed',type='float',default=<dataclasses._MISSING_TYPE object at 0x7f4a64521850>,default_factory=<dataclasses._MISSING_TYPE object at 0x7f4a64521850>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'null': Field(name='null',type='np.ndarray',default=<dataclasses._MISSING_TYPE object at 0x7f4a64521850>,default_factory=<dataclasses._MISSING_TYPE object at 0x7f4a64521850>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'pvalue': Field(name='pvalue',type='float',default=<dataclasses._MISSING_TYPE object at 0x7f4a64521850>,default_factory=<dataclasses._MISSING_TYPE object at 0x7f4a64521850>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD)}

dict() -> new empty dictionary dict(mapping) -> new dictionary initialized from a mapping object's (key, value) pairs dict(iterable) -> new dictionary initialized as if via: d = {} for k, v in iterable: d[k] = v dict(**kwargs) -> new dictionary initialized with the name=value pairs in the keyword argument list. For example: dict(one=1, two=2)

doc `class-attribute` ¶

__doc__ = 'Result of :func:`permutation_test` for one topic.\n\n    Attributes\n    ----------\n    topic : int\n        Topic index (aligned to the reference model).\n    topic_name : str\n        Topic label (or ``"topic_t"`` when no labels are set).\n    observed : float\n        Observed difference in mean prevalence between the two covariate groups.\n    null : numpy.ndarray, shape (n_perm,)\n        Per-permutation covariate effects (the null distribution).\n    pvalue : float\n        Two-sided p-value: proportion of permutations whose absolute effect\n        equals or exceeds the absolute observed effect.\n    '

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.

__match_args__ `class-attribute` ¶

__match_args__ = ('topic', 'topic_name', 'observed', 'null', 'pvalue')

Built-in immutable sequence.

If no argument is given, the constructor returns an empty tuple. If iterable is specified the tuple is initialized from iterable's items.

If the argument is a tuple, the return value is the same object.

module `class-attribute` ¶

__module__ = 'topica.effects'

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.

weakref `property` ¶

__weakref__

list of weak references to the object

as_dict ¶

as_dict() -> dict

Return a plain-dict summary (omits the full null array).

Per-group prevalence with credible bands¶

Model-neutral per-group topic prevalence with posterior credible intervals drawn from the model's retained MCMC theta draws (or the logistic-normal posterior for STM/CTM).

topica.prevalence_ci ¶

prevalence_ci(model, groups, *, ci=0.95, normalize=True, corpus=None, nsims=None, seed=0, labels=None)

Per-group topic prevalence with posterior credible bands.

Splits the documents by groups (one label per document) and, within each group, reports the mean topic prevalence with an empirical credible interval drawn from the model's posterior over theta. For each posterior draw and each group we average theta over the documents in that group, giving a (S, num_groups, num_topics) stack of per-draw group prevalences; the point estimate is the posterior mean over draws and the band is the empirical (1-ci)/2 and (1+ci)/2 quantiles.

This is the draws-based companion to :func:by_strata. by_strata widens a descriptive interval by Rubin's rules (a normal approximation); prevalence_ci reads the credible band straight off the posterior draws, which is what keyATM's plot_timetrend does. It is model-neutral: the draws come from :func:composition_theta, so it prefers a Gibbs model's retained MCMC theta_draws (pass keep_theta_draws=True at fit) and otherwise falls back to the Dirichlet approximation (Gibbs, needs corpus=) or the logistic-normal posterior (STM/CTM). :func:topica.time_prevalence_ci is the dynamic-keyATM wrapper, with groups the timestamps and labels fixed to the model's time_labels.

Parameters:

Name	Description	Default
`model`	A fitted model with a posterior over theta (any Dirichlet or logistic-normal model).	required
`groups`	One label per document; documents are pooled within each distinct value.	required
`ci`	Credible-interval coverage (default 0.95 gives a 95 percent band).	`0.95`
`normalize`	When `True` (default), each per-draw per-group prevalence row is rescaled to sum to 1 before the summary statistics, so it reads as a topic share.	`True`
`corpus`	The `Corpus` (or token lists) the model was fit on. Needed only when the model has no retained `theta_draws` and the Dirichlet fallback must recover document lengths.	`None`
`nsims`	Number of posterior draws. `None` (default) uses all retained MCMC draws as they are; an integer resamples to that many via :func:`composition_theta`.	`None`
`seed`	Seed for the draw sampler (used only on the resample / fallback paths).	`0`
`labels`	Optional explicit ordering of the group labels (matched to `groups` by string form). When omitted, groups are sorted by their string form.	`None`

Returns:

Type	Description
`dict`	`labels` (the group labels in row order), and `mean`, `ci_low`, `ci_high`, `sd` each a `(num_groups, num_topics)` array.

Covariate-aware held-out inference¶

Infer topic proportions for new documents using a fitted STM's prevalence model, setting the per-document prior from mu_d = X_d gamma.

topica.stm.transform ¶

transform(model, docs, *, prevalence=None, data=None, formula=None, X=None)

Infer topic proportions for new documents, optionally using prevalence covariates.

When prevalence information is supplied the per-document prior mean is set to mu_d = X_d @ gamma (where gamma = model.prevalence_effects), which mirrors R stm's fitNewDocuments behavior. Without covariates the covariate-free baseline prior learned at fit time is used, giving the same result as model.transform(docs) directly.

The topic-word matrix used is always the marginal model.topic_word; a content model's per-group beta is not applied here. Documents should first be aligned to the fitted vocabulary with :func:align_corpus if the new corpus may contain out-of-vocabulary tokens.

Parameters:

Name	Type	Description	Default
`model`	`fitted STM`	A fitted `topica.STM` with `prevalence_effects` available when covariates are supplied.	required
`docs`	`list[list[str]] or Corpus`	Token lists (or a Corpus) for the new documents.	required
`prevalence`	`array - like(num_docs, F)`	Raw covariate matrix for the new documents, without the intercept column. An intercept is prepended to match how `gamma` was learned. Supply either `prevalence` or `X`; they are equivalent.	`None`
`data`	`DataFrame`	Document-level DataFrame for the new documents. Required when `formula` is given.	`None`
`formula`	`str`	R-style formula string (e.g. `"~ party + author"`). When supplied with `data`, the design matrix is built from the formula using the same column encoding as at fit time (categorical coding, intercept stripping); an intercept is then prepended so the column order matches `gamma`. Formulas with a `spline()` term are rejected here, because their knots would be recomputed on the new documents rather than reused from fit; build the design with `design_matrix_predict` and the fit-time knot context (as :func:`predicted_prevalence` does) and pass it as `X=`.	`None`
`X`	`array - like(num_docs, p)`	Pre-built design matrix without the intercept column. Alternative to `prevalence`; they are equivalent.	`None`

Returns:

Type	Description
`ndarray`	Topic proportions, shape `(num_docs, num_topics)`.

Map new token lists onto the fitted vocabulary before calling transform, dropping any out-of-vocabulary tokens.

topica.align_corpus ¶

align_corpus(new_docs, model)

Restrict token lists to the fitted model's vocabulary before transform.

Each document in new_docs is filtered to keep only tokens that appear in model.vocabulary. Tokens outside that vocabulary are silently dropped. Documents that become empty after filtering are represented as empty lists.

Parameters:

Name	Type	Description	Default
`new_docs`	`list[list[str]]`	Token lists for the new documents (one list per document).	required
`model`	`fitted STM or CTM`	A fitted model with a `vocabulary` attribute (list of strings).	required

Returns:

Type	Description
`list[list[str]]`	Aligned token lists ready to pass to `model.transform` or `topica.stm.transform`. Each output list is a subset of the corresponding input list, with out-of-vocabulary tokens removed.

Model selection at fixed K¶

Run multiple initializations at a fixed K and compare candidates on the coherence-exclusivity frontier — the analogue of R stm's selectModel.

topica.select_model ¶

select_model(docs, K, *, runs=20, model='lda', prevalence=None, word_embeddings=None, vocabulary=None, doc_embeddings=None, iters=500, num_samples=3, sample_interval=10, seed=42, coherence_n=10, fraction=None, burn_in_iters=None)

Run N initializations at a fixed K and return the fitted candidates (stm's selectModel).

All runs models are fit from different random seeds. With fraction set, the procedure uses two stages: a short burn-in (burn_in_iters, defaulting to 20% of iters) followed by full training of the top ceil(fraction * runs) models by their objective (ELBO where the model has one, else log-likelihood, else mean coherence). This mirrors stm's "run briefly, keep the best ~20%" heuristic.

This is for models whose fit depends on the random seed — the ones that scatter across local optima. ETM, ProdLDA, FASTopic, CombinedTM, and ZeroShotTM all benefit. STM/CTM use a deterministic spectral init, so every run is identical and multi-start buys nothing — pick one of the stochastic models instead. (DTM is not selected here: its topics are time-varying, so coherence/exclusivity are not a single number; use DTM(init="spectral") for a deterministic fit.)

Parameters:

Name	Type	Description	Default
`docs`	training documents (``list[list[str]]`` or a ``Corpus``).		required
`K`	`number of topics for every run.`		required
`runs`	`number of random initializations.`		`20`
`model`	which model to fit. One of ``"lda"`` (default), ``"stm"``,	`"prodlda"`, `"etm"`, `"fastopic"`, `"combinedtm"`, `"zeroshottm"`.	`'lda'`
`prevalence`	covariate design matrix; required when ``model="stm"``.		`None`
`word_embeddings`	``(vocab, dim)`` word-embedding matrix; required when	`model="etm"` (paired with `vocabulary`).	`None`
`vocabulary`	the word list aligning ``word_embeddings`` rows; required when	`model="etm"`.	`None`
`doc_embeddings`	``(num_docs, dim)`` document-embedding matrix; required when	`model` is `"fastopic"`, `"combinedtm"`, or `"zeroshottm"`.	`None`
`iters`	`full-training iterations per run (or per survivor when`	`fraction` is used).	`500`
`num_samples`	`Gibbs samples per run (LDA only).`		`3`
`sample_interval`	`iterations between Gibbs samples (LDA only).`		`10`
`seed`	base RNG seed; run ``r`` uses seed ``seed + r``.		`42`
`coherence_n`	`top-word count for coherence and exclusivity.`		`10`
`fraction`	if given (a float in ``(0, 1]``), keep only the top	`ceil(fraction * runs)` models (by their objective) after `burn_in_iters` and run those survivors to full `iters`. `None` (default) runs all initializations to full `iters`.	`None`
`burn_in_iters`	`burn-in length used for early discard; defaults to`	`max(1, round(0.2 * iters))` when `fraction` is set.	`None`

Returns:

Name	Type	Description
`A`	class:`SelectModelResult` with ``models``, ``coherence``,
	``exclusivity``, and ``run_seeds`` arrays of length equal to the
	number of survivors (all ``runs`` when ``fraction`` is ``None``).

topica.SelectModelResult ¶

Result of :func:select_model.

Attributes:

Name	Type	Description
`models`	`list of N fitted models, one per run.`
`coherence`	array of shape ``(N,)`` — per-run mean UMass coherence.
`exclusivity`	array of shape ``(N,)`` — per-run mean top-word exclusivity.
`run_seeds`	array of shape ``(N,)`` — seed used for each run.

annotations `class-attribute` ¶

__annotations__ = {'models': 'list', 'coherence': 'np.ndarray', 'exclusivity': 'np.ndarray', 'run_seeds': 'np.ndarray'}

dict() -> new empty dictionary dict(mapping) -> new dictionary initialized from a mapping object's (key, value) pairs dict(iterable) -> new dictionary initialized as if via: d = {} for k, v in iterable: d[k] = v dict(**kwargs) -> new dictionary initialized with the name=value pairs in the keyword argument list. For example: dict(one=1, two=2)

__dataclass_fields__ `class-attribute` ¶

__dataclass_fields__ = {'models': Field(name='models',type='list',default=<dataclasses._MISSING_TYPE object at 0x7f4a64521850>,default_factory=<dataclasses._MISSING_TYPE object at 0x7f4a64521850>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'coherence': Field(name='coherence',type='np.ndarray',default=<dataclasses._MISSING_TYPE object at 0x7f4a64521850>,default_factory=<dataclasses._MISSING_TYPE object at 0x7f4a64521850>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'exclusivity': Field(name='exclusivity',type='np.ndarray',default=<dataclasses._MISSING_TYPE object at 0x7f4a64521850>,default_factory=<dataclasses._MISSING_TYPE object at 0x7f4a64521850>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'run_seeds': Field(name='run_seeds',type='np.ndarray',default=<dataclasses._MISSING_TYPE object at 0x7f4a64521850>,default_factory=<dataclasses._MISSING_TYPE object at 0x7f4a64521850>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD)}

dict() -> new empty dictionary dict(mapping) -> new dictionary initialized from a mapping object's (key, value) pairs dict(iterable) -> new dictionary initialized as if via: d = {} for k, v in iterable: d[k] = v dict(**kwargs) -> new dictionary initialized with the name=value pairs in the keyword argument list. For example: dict(one=1, two=2)

doc `class-attribute` ¶

__doc__ = 'Result of :func:`select_model`.\n\n    Attributes\n    ----------\n    models : list of N fitted models, one per run.\n    coherence : array of shape ``(N,)`` — per-run mean UMass coherence.\n    exclusivity : array of shape ``(N,)`` — per-run mean top-word exclusivity.\n    run_seeds : array of shape ``(N,)`` — seed used for each run.\n    '

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.

__match_args__ `class-attribute` ¶

__match_args__ = ('models', 'coherence', 'exclusivity', 'run_seeds')

Built-in immutable sequence.

If no argument is given, the constructor returns an empty tuple. If iterable is specified the tuple is initialized from iterable's items.

If the argument is a tuple, the return value is the same object.

module `class-attribute` ¶

__module__ = 'topica.validation'

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.

weakref `property` ¶

__weakref__

list of weak references to the object

Visualize the coherence-versus-exclusivity scatter across candidate runs.

topica.plot_models ¶

plot_models(result, *, ax=None, label_runs=True)

Coherence-vs-exclusivity scatter for :func:select_model candidates (stm's plotModels).

Each point is one run. The upper-right corner is the best region: both coherent (interpretable) and exclusive (distinctive). Use this plot to pick a run from :func:select_model before fitting your full analysis.

Parameters:

Name	Type	Description	Default
`result`	a :class:`SelectModelResult` returned by :func:`select_model`.		required
`ax`	matplotlib ``Axes`` to draw on; a new figure is created if	`None`.	`None`
`label_runs`	`annotate each point with its run index; default`	`True`.	`True`

Returns:

Type	Description
The matplotlib ``Axes``.

STM toolkit¶

topica.standard_errors ¶

topica.stm.estimate_effect ¶

topica.stm.average_marginal_effects ¶

topica.stm.MarginalEffect ¶

__annotations__ class-attribute ¶

__dataclass_fields__ class-attribute ¶

__doc__ class-attribute ¶

__match_args__ class-attribute ¶

__module__ class-attribute ¶

__weakref__ property ¶

topica.stm.AverageMarginalEffects ¶

__annotations__ class-attribute ¶

__dataclass_fields__ class-attribute ¶

__doc__ class-attribute ¶

__match_args__ class-attribute ¶

__module__ class-attribute ¶

__weakref__ property ¶

to_frame ¶

topica.effects.dirichlet_theta_samples ¶

topica.stm.posterior_theta_samples ¶

topica.effects.model_family ¶

topica.stm.spline ¶

topica.stm.interaction ¶

Predicted prevalence¶

topica.predicted_prevalence ¶

topica.PredictedPrevalence ¶

__annotations__ class-attribute ¶

__dataclass_fields__ class-attribute ¶

__doc__ class-attribute ¶

__match_args__ class-attribute ¶

__module__ class-attribute ¶

__weakref__ property ¶

to_frame ¶

Permutation test¶

topica.permutation_test ¶

topica.PermutationResult ¶

__annotations__ class-attribute ¶

__dataclass_fields__ class-attribute ¶

__doc__ class-attribute ¶

__match_args__ class-attribute ¶

__module__ class-attribute ¶

__weakref__ property ¶

as_dict ¶

Per-group prevalence with credible bands¶

topica.prevalence_ci ¶

Covariate-aware held-out inference¶

topica.stm.transform ¶

topica.align_corpus ¶

Model selection at fixed K¶

topica.select_model ¶

topica.SelectModelResult ¶

__annotations__ class-attribute ¶

__dataclass_fields__ class-attribute ¶

__doc__ class-attribute ¶

__match_args__ class-attribute ¶

__module__ class-attribute ¶

__weakref__ property ¶

topica.plot_models ¶

annotations `class-attribute` ¶

__dataclass_fields__ `class-attribute` ¶

doc `class-attribute` ¶

__match_args__ `class-attribute` ¶

module `class-attribute` ¶

weakref `property` ¶

annotations `class-attribute` ¶

__dataclass_fields__ `class-attribute` ¶

doc `class-attribute` ¶

__match_args__ `class-attribute` ¶

module `class-attribute` ¶

weakref `property` ¶

annotations `class-attribute` ¶

__dataclass_fields__ `class-attribute` ¶

doc `class-attribute` ¶

__match_args__ `class-attribute` ¶

module `class-attribute` ¶

weakref `property` ¶

annotations `class-attribute` ¶

__dataclass_fields__ `class-attribute` ¶

doc `class-attribute` ¶

__match_args__ `class-attribute` ¶

module `class-attribute` ¶

weakref `property` ¶

annotations `class-attribute` ¶

__dataclass_fields__ `class-attribute` ¶

doc `class-attribute` ¶

__match_args__ `class-attribute` ¶

module `class-attribute` ¶

weakref `property` ¶