Skip to content

Covariates & STM

The Structural Topic Model lets topics depend on document metadata in two ways: prevalence (how much a document discusses each topic) and content (how a topic is worded). For the publication-grade version of this workflow, with proper uncertainty and clustered errors, see Measure effects properly.

Prevalence covariates

import topica

X, names = topica.one_hot(party)                      # design matrix + column names
model = topica.STM(num_topics=20, seed=1)
model.fit(docs, prevalence=X, prevalence_names=names)

model.prevalence_effects        # learned γ
topica.topic_correlation(model.doc_topic)

Content covariates

A content model makes the topic-word distribution vary by group (the SAGE mechanism), so the same topic is phrased differently across, say, conservative and liberal sources:

model = topica.STM(num_topics=20, seed=1)
model.fit(docs, prevalence=X, content=source, content_names=groups)

model.topic_word_by_group        # per-group β
# words that most distinguish how a topic is worded across two groups:
# model.word_contrast(topic, "liberal", "conservative")

Estimating effects

Regress topic proportions on covariates with honest uncertainty, using the method of composition, optionally with clustered standard errors and GLM links:

from topica import stm

draws = stm.posterior_theta_samples(model, nsims=50, seed=0)
effects = stm.estimate_effect(
    draws, X, feature_names=names,
    cluster=source_id,     # cluster-robust SEs for nested data
    # link="logit",        # keep predictions in [0, 1]
)
for e in effects:
    print(e.as_dict())

Build non-linear and interaction terms with stm.spline and stm.interaction. Full detail and the journal-grade treatment are in the Publishing track.

Choosing K for STM

Use search_k, the coherence×exclusivity frontier, and an HDP sanity check. See Choose and justify K.