STM toolkit¶
The structural / covariate operations live in topica.stm. The general
post-hoc diagnostics (labeling, alignment, pyLDAvis, …) are on the
Diagnostics page.
topica.stm.estimate_effect ¶
estimate_effect(doc_topic, X=None, *, data=None, formula=None, feature_names=None, topics=None, add_intercept=True, ci=0.95, cluster=None, link='identity')
Regress each topic's proportion on document covariates.
Pass a point estimate of θ for an ordinary OLS, or a stack of posterior
draws of θ for the method of composition — the uncertainty-propagating
procedure R stm uses (Treier & Jackman 2008). With draws, each one is
regressed and the results are pooled by Rubin's rules, so the reported
standard errors include the topic-estimation uncertainty, not just OLS
sampling error. Get draws with :func:posterior_theta_samples.
For paper-grade inference two extras matter:
cluster— a length-num_docsarray of group labels (e.g. speaker, user, outlet). Text data is almost always nested, and ignoring it understates uncertainty. Supplying it switches the standard errors to the cluster-robust (CR1) sandwich estimator. (With posterior draws, each draw is clustered and the per-draw covariances are then Rubin-pooled.)link—"identity"(default OLS),"logit"(fractional logit, via binomial quasi-likelihood), or"log"(quasi-Poisson). Because topic proportions live in[0, 1], the logit link keeps fitted values in bounds where OLS can wander outside them (Papke & Wooldridge). Non-identity links report heteroskedasticity- or cluster-robust standard errors.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
doc_topic
|
array
|
Either |
required |
X
|
array(num_docs, p)
|
Document covariates (design matrix); build nonlinear/interaction terms
with :func: |
None
|
feature_names
|
list[str]
|
Column names for |
None
|
topics
|
sequence[int]
|
Restrict to these topics. Defaults to all. |
None
|
ci
|
float
|
Confidence level for the (normal-approximation) intervals. |
0.95
|
Returns:
| Type | Description |
|---|---|
list[TopicEffect]
|
One regression per topic. |
topica.stm.posterior_theta_samples ¶
Draw nsims samples of the document-topic matrix θ from a fitted
:class:STM/:class:CTM's variational posterior.
Each document's logistic-normal posterior is η_d ~ N(λ_d, ν_d) (from
model.eta_mean / model.eta_cov); a draw of η is mapped through the
softmax (with the reference category fixed at 0) to a θ row. Feed the result
to :func:estimate_effect for method-of-composition uncertainty.
Returns an array of shape (nsims, num_docs, num_topics).
topica.stm.spline ¶
Restricted (natural) cubic-spline basis for a covariate — the building
block for nonlinear prevalence terms like R stm's ~ s(day).
Uses Harrell's restricted-cubic-spline parameterization: df+1 knots (at
evenly spaced quantiles of x unless knots is given) yield df basis
columns whose first is the linear term. np.column_stack the result into
your design matrix and extend feature_names with the returned names.
Returns (basis (n, df), names).
topica.stm.interaction ¶
Interaction columns between two covariate blocks (all pairwise products of
their columns) — for terms like R stm's ~ treatment * party.
a, b are 1-D or 2-D arrays with the same number of rows. Returns
(products (n, ncols), names); np.column_stack into your design matrix.