Topica¶
topica is a fast topic-modeling library for Python with more than a dozen
models, built for social scientists who want to move from text data to
publishable results in a single workflow. It brings together models and tools
usually split across JVM software like MALLET and R packages like stm, and
runs them on a parallel Rust core competitive with the standard implementations,
with every fit reproducible from a fixed seed. Each model comes with the
validation, covariate-effect, and reporting tools to meet the standards
reviewers expect.
import topica
docs = [["cat", "dog", "fish"]] * 15 + [["planet", "star", "moon"]] * 15
model = topica.LDA(num_topics=2, seed=42)
model.fit(docs, iterations=1000)
for i, words in enumerate(model.top_words(5)):
print(f"Topic {i}:", " ".join(w for w, _ in words))
Why topica¶
- One package, many models. LDA, DMR, Labeled LDA, SAGE, CTM, the full STM (prevalence and content covariates), HDP, dynamic topics, supervised LDA, short-text models, and embedding-based models (BERTopic, Top2Vec, ETM, FASTopic). See the models and embedding topics.
- Built for social science. Covariate effects with the method of
composition, clustered standard errors, GLM links, Fighting Words,
intrusion tests, bootstrap stability, and
searchK: the things reviewers ask for. See covariates and diagnostics. - Fast and deterministic. A Rust core with bit-for-bit reproducible fits. The variational models parallelize across cores automatically.
- No heavy dependencies. NumPy only. Optional integrations (pyLDAvis, matplotlib) light up if installed.
The model families¶
| Model | What it's for |
|---|---|
LDA |
Classic topics via fast collapsed-Gibbs (SparseLDA) |
DMR |
Topics conditioned on document metadata |
LabeledLDA |
Supervised topics tied to document labels |
CTM |
Correlated topics (logistic-normal) |
STM |
Structural Topic Model: prevalence and content covariates |
SAGE |
The same topic worded differently across groups |
HDP |
Nonparametric LDA that infers the number of topics |
DTM |
Dynamic topics that evolve across time slices |
SupervisedLDA |
Topics shaped to predict a per-document response |
PT / GSDMM |
Short-text models for tweets, survey answers |
SeededLDA / KeyATM |
Guided topics steered by seed words |
PA / HLDA |
Topic hierarchies (Pachinko, nested-CRP) |
BERTopic / Top2Vec |
Cluster document embeddings you supply into topics |
ETM / FASTopic |
Generative topics from embeddings (factored β; optimal transport) |
Worked examples¶
Three end-to-end analyses on real, redistributable corpora:
- W.E.B. Du Bois in The Crisis: 704 articles, 1910–1934, the full workflow from preprocessing to dynamic topics.
- Gadarian immigration experiment: the canonical STM vignette, reproduced.
- Political blogs: STM with ideology and time covariates.
topica is open source on GitHub.