Structural Topic Model: the stm vignette¶
Source. Roberts, M. E., Stewart, B. M., & Tingley, D. (2019). stm: An R
Package for Structural Topic Models. Journal of Statistical Software, 91(2).
The stm package is the field standard for prevalence- and content-covariate
topic models in the social sciences.
topica's STM reimplements the same model: correlated topics with a prevalence
regression, a content (SAGE) covariate, spectral initialization, and effect
estimation by the method of composition. This page asks whether it produces the
same answers as R's stm.
What "replicate" means for STM¶
STM is fit by variational EM, which is non-convex: the objective has many local
optima, and the solution depends on where the optimization starts. R's own stm
does not return one canonical answer. Fit it twice from different random seeds
and the two topic-word matrices agree only to a cosine of about 0.81. So the bar
is not bit-identical output. It is statistical: under a matched initialization,
topica should land in the same neighborhood of solutions that R lands in, and its
agreement with R should sit inside the spread of R's agreement with itself.
We feed identical integer-coded documents to both engines and align topics
one-to-one before comparing. The harness lives in
parity/stm_r_compare.py
and parity/stm_content_r_compare.py.
Content model: exact agreement¶
The content (SAGE) covariate is the deterministic part of STM: given the topic
assignments, the per-group word distributions follow in closed form. Here topica
and R agree exactly. On a bilingual corpus fit with content = ~group, K = 2,
the best-aligned cosine between R's and topica's per-group word distributions is
1.000 in both groups, and both engines separate the two topics rather than
collapsing them (topic-separation near 0 in each):
| Content group | topica–R cosine | both separated |
|---|---|---|
de |
1.000 | yes |
en |
1.000 | yes |
This is the path where a symmetric-initialization bug once collapsed all topics to the background; the exact match against R is how we know it is fixed.
Prevalence model: same neighborhood as R¶
For the prevalence model we compare topica's spectral fit to R's spectral fit on a 339-document, 303-word corpus, against the floor of R's agreement with itself:
| Comparison | aligned cosine |
|---|---|
| R Spectral vs R Random (R's own basin spread) | 0.62 |
| R Random vs R Random (R's self-consistency) | 0.81 |
| R Spectral vs topica Spectral | 0.51 |
topica's agreement with R (0.51) sits within the spread of R's own Spectral-versus-Random runs (gap 0.11). The two engines find the same family of solutions and differ by the local optimum the optimizer settled in, exactly as two R runs do. This is the expected behavior for a non-convex model, not a discrepancy: there is no single STM fit to reproduce.
What does replicate stably across optima is the substantive conclusion. The
Poliblog and Gadarian
worked examples refit the canonical stm vignettes end to end and recover the
same prevalence effects the package documents, with honest standard errors from
the method of composition.
Speed¶
On matched iterations from a spectral start, topica fits the same model 3–22×
faster than R stm, single-threaded, and more with multiple cores, since topica
parallelizes the variational E-step while stm is single-threaded. The full
table is on the benchmarks page.