Topica¶

topica is a fast topic-modeling library for Python with more than two dozen models, built for social scientists who want to move from text data to publishable results in a single workflow. It brings together models and tools usually split across JVM software like MALLET and R packages like stm, and runs them on a parallel Rust core competitive with the standard implementations, with every fit reproducible from a fixed seed. Each model comes with the validation, covariate-effect, and reporting tools to meet the standards reviewers expect.

pip install topica

import topica

docs = [["cat", "dog", "fish"]] * 15 + [["planet", "star", "moon"]] * 15
model = topica.LDA(num_topics=2, seed=42)
model.fit(docs, iters=1000)

for i, words in enumerate(model.top_words(5)):
    print(f"Topic {i}:", " ".join(w for w, _ in words))

Why topica¶

One package, many models. LDA, ProdLDA, DMR, Labeled LDA, SAGE, CTM, the full STM (prevalence and content covariates) and its STS sentiment-discourse extension, HDP, dynamic topics, supervised LDA, short-text models, and embedding-based models (BERTopic, Top2Vec, ETM, FASTopic). See the models and embedding topics.
Built for social science. Covariate effects with the method of composition, clustered standard errors, GLM links, Fighting Words, intrusion tests, bootstrap stability, and searchK: the things reviewers ask for. See covariates and diagnostics.
Fast and deterministic. A Rust core with bit-for-bit reproducible fits. The variational models parallelize across cores automatically.
No heavy dependencies. A NumPy-only core. Optional extras add what you need — topica[viz] for plots, topica[formula] for the formula interface, topica[polars] for Polars, topica[llm] for LLM labels and embeddings — and PyTorch is never required. See installation.

The model families¶

Model	What it's for
`LDA`	Classic topics via fast collapsed-Gibbs (SparseLDA)
`ProdLDA`	Sharper, more coherent topics via a product-of-experts word model (amortized VAE)
`DMR`	Topics conditioned on document metadata
`LabeledLDA`	Supervised topics tied to document labels
`CTM`	Correlated topics (logistic-normal)
`STM`	Structural Topic Model: prevalence and content covariates
`STS`	Structural Topic and Sentiment-Discourse: covariate-driven topic sentiment on top of STM
`SAGE`	The same topic worded differently across groups
`HDP`	Nonparametric LDA that infers the number of topics
`DTM`	Dynamic topics that evolve across time slices
`SupervisedLDA`	Topics shaped to predict a per-document response
`PT` / `GSDMM`	Short-text models for tweets, survey answers
`SeededLDA` / `KeyATM`	Guided topics steered by seed words
`PA` / `HLDA`	Topic hierarchies (Pachinko, nested-CRP)
`BERTopic` / `Top2Vec`	Cluster document embeddings you supply into topics
`ETM` / `FASTopic`	Generative topics from embeddings (factored β; optimal transport)

Worked examples¶

End-to-end analyses on real, redistributable corpora:

W.E.B. Du Bois in The Crisis: 704 articles, 1910–1934, the full workflow from preprocessing to dynamic topics.
Gadarian immigration experiment: the canonical STM vignette, reproduced.
Political blogs: STM with ideology and time covariates.
Party platforms: how Democrats and Republicans word the same topics, and how that contrast has shifted across 20 elections (STM content_time).
LLM embeddings and labels: embedding-based topics with contextual document vectors and LLM-generated topic labels.

topica is open source on GitHub.