Skip to content

STM toolkit

The structural / covariate operations live in topica.stm. The general post-hoc diagnostics (labeling, alignment, pyLDAvis, …) are on the Diagnostics page.

topica.stm.estimate_effect

estimate_effect(doc_topic, X=None, *, data=None, formula=None, feature_names=None, topics=None, add_intercept=True, ci=0.95, cluster=None, link='identity')

Regress each topic's proportion on document covariates.

Pass a point estimate of θ for an ordinary OLS, or a stack of posterior draws of θ for the method of composition — the uncertainty-propagating procedure R stm uses (Treier & Jackman 2008). With draws, each one is regressed and the results are pooled by Rubin's rules, so the reported standard errors include the topic-estimation uncertainty, not just OLS sampling error. Get draws with :func:posterior_theta_samples.

For paper-grade inference two extras matter:

  • cluster — a length-num_docs array of group labels (e.g. speaker, user, outlet). Text data is almost always nested, and ignoring it understates uncertainty. Supplying it switches the standard errors to the cluster-robust (CR1) sandwich estimator. (With posterior draws, each draw is clustered and the per-draw covariances are then Rubin-pooled.)
  • link"identity" (default OLS), "logit" (fractional logit, via binomial quasi-likelihood), or "log" (quasi-Poisson). Because topic proportions live in [0, 1], the logit link keeps fitted values in bounds where OLS can wander outside them (Papke & Wooldridge). Non-identity links report heteroskedasticity- or cluster-robust standard errors.

Parameters:

Name Type Description Default
doc_topic array

Either (num_docs, num_topics) — a point θ (model.doc_topic) for plain OLS — or (nsims, num_docs, num_topics) — posterior θ draws for method-of-composition pooling.

required
X array(num_docs, p)

Document covariates (design matrix); build nonlinear/interaction terms with :func:spline / :func:interaction. An intercept is prepended when add_intercept is True.

None
feature_names list[str]

Column names for X. Defaults to feature_0 ....

None
topics sequence[int]

Restrict to these topics. Defaults to all.

None
ci float

Confidence level for the (normal-approximation) intervals.

0.95

Returns:

Type Description
list[TopicEffect]

One regression per topic. [e.as_dict() for e in result] builds a table.

topica.stm.posterior_theta_samples

posterior_theta_samples(model, nsims=25, seed=0)

Draw nsims samples of the document-topic matrix θ from a fitted :class:STM/:class:CTM's variational posterior.

Each document's logistic-normal posterior is η_d ~ N(λ_d, ν_d) (from model.eta_mean / model.eta_cov); a draw of η is mapped through the softmax (with the reference category fixed at 0) to a θ row. Feed the result to :func:estimate_effect for method-of-composition uncertainty.

Returns an array of shape (nsims, num_docs, num_topics).

topica.stm.spline

spline(x, df=4, knots=None)

Restricted (natural) cubic-spline basis for a covariate — the building block for nonlinear prevalence terms like R stm's ~ s(day).

Uses Harrell's restricted-cubic-spline parameterization: df+1 knots (at evenly spaced quantiles of x unless knots is given) yield df basis columns whose first is the linear term. np.column_stack the result into your design matrix and extend feature_names with the returned names.

Returns (basis (n, df), names).

topica.stm.interaction

interaction(a, b, name='interaction')

Interaction columns between two covariate blocks (all pairwise products of their columns) — for terms like R stm's ~ treatment * party.

a, b are 1-D or 2-D arrays with the same number of rows. Returns (products (n, ncols), names); np.column_stack into your design matrix.