Skip to content

4. Measure effects properly

Most social-science topic-model papers don't stop at "here are the themes." They ask how topic prevalence relates to something: time, group, treatment, ideology. Doing this credibly means modeling the relationship and reporting honest uncertainty.

Make prevalence depend on covariates: STM

The Structural Topic Model lets a document's topic proportions depend on its metadata. Fit prevalence as a regression on your covariates:

import topica

X, names = topica.one_hot(party)                     # or build any design matrix
model = topica.STM(num_topics=20, seed=1)
model.fit(docs, prevalence=X, prevalence_names=names)

Estimate effects with honest uncertainty

A naive regression of point topic proportions on covariates treats θ as if it were observed exactly. It isn't. R's stm uses the method of composition (Treier & Jackman 2008): draw θ from the model's posterior, regress each draw, and pool by Rubin's rules so the standard errors include topic-estimation uncertainty. topica does the same:

from topica import stm

draws = stm.posterior_theta_samples(model, nsims=50, seed=0)   # (50, D, K)
effects = stm.estimate_effect(draws, X, feature_names=names)
for e in effects:
    d = e.as_dict()
    print(f"Topic {d['topic']}: {names[0]} coef={d[names[0]]['coef']:+.4f} "
          f"z={d[names[0]]['z']:+.2f}")

For non-linear time trends and interactions, build the design matrix with stm.spline and stm.interaction, the same ~ s(year) and ~ a*b you'd write in R.

It is not just for STM

topica.estimate_effect regresses any model's θ on covariates, and topica.by_strata (mean prevalence by group, with intervals) and topica.top_topics work on any model's θ too.

STM and CTM have a logistic-normal posterior, so posterior_theta_samples draws from it directly. A Gibbs model (LDA, keyATM, SeededLDA, ...) has no such posterior, but each document's θ still has a Dirichlet conditional given its length, so dirichlet_theta_samples gives you draws to feed the same method-of-composition machinery:

import numpy as np, topica

model = topica.LDA(num_topics=20, seed=1)
model.fit(docs, iterations=1000)

lengths = np.array([len(d) for d in docs])
draws = topica.dirichlet_theta_samples(model.doc_topic, lengths, nsims=50, seed=0)
effects = topica.estimate_effect(draws, X, feature_names=names)

Pass the point θ (model.doc_topic) instead of draws and you get a plain OLS fit with no uncertainty propagation, which is the right baseline but understates the standard errors.

Cluster your standard errors

Text data is almost always nested: multiple speeches by the same legislator, many tweets per user, several articles per outlet, or, if you split long documents, many chunks per source document. Ignoring this nesting understates uncertainty and is a common reason reviewers reject a result.

Pass a cluster variable and the standard errors become cluster-robust (CR1):

effects = stm.estimate_effect(
    draws, X, feature_names=names,
    cluster=speaker_id,        # one label per document
)

This composes with the method of composition: each posterior draw is clustered, then the per-draw covariances are Rubin-pooled.

Topic proportions live in [0, 1], but OLS on θ can predict values outside that range. For a bounded model, use a fractional-logit link (Papke & Wooldridge), fit by quasi-likelihood with robust standard errors:

effects = stm.estimate_effect(draws, X, feature_names=names, link="logit")
# link="log" gives a quasi-Poisson alternative.

Report effects as a table

import pandas as pd
rows = []
for e in effects:
    d = e.as_dict()
    for feat in e.feature_names:
        rows.append({"topic": d["topic"], "term": feat,
                     "estimate": d[feat]["coef"], "se": d[feat]["se"],
                     "z": d[feat]["z"]})
table = pd.DataFrame(rows)

Report point estimates, (clustered) standard errors, and confidence intervals, and say plainly which effects clear conventional thresholds and which don't.

→ Next: Report and make reproducible.