Skip to content

Topica

topica is a fast topic-modeling library for Python with more than a dozen models, built for social scientists who want to move from text data to publishable results in a single workflow. It brings together models and tools usually split across JVM software like MALLET and R packages like stm, and runs them on a parallel Rust core competitive with the standard implementations, with every fit reproducible from a fixed seed. Each model comes with the validation, covariate-effect, and reporting tools to meet the standards reviewers expect.

pip install topica
import topica

docs = [["cat", "dog", "fish"]] * 15 + [["planet", "star", "moon"]] * 15
model = topica.LDA(num_topics=2, seed=42)
model.fit(docs, iterations=1000)

for i, words in enumerate(model.top_words(5)):
    print(f"Topic {i}:", " ".join(w for w, _ in words))

Why topica

  • One package, many models. LDA, DMR, Labeled LDA, SAGE, CTM, the full STM (prevalence and content covariates), HDP, dynamic topics, supervised LDA, short-text models, and embedding-based models (BERTopic, Top2Vec, ETM, FASTopic). See the models and embedding topics.
  • Built for social science. Covariate effects with the method of composition, clustered standard errors, GLM links, Fighting Words, intrusion tests, bootstrap stability, and searchK: the things reviewers ask for. See covariates and diagnostics.
  • Fast and deterministic. A Rust core with bit-for-bit reproducible fits. The variational models parallelize across cores automatically.
  • No heavy dependencies. NumPy only. Optional integrations (pyLDAvis, matplotlib) light up if installed.

The model families

Model What it's for
LDA Classic topics via fast collapsed-Gibbs (SparseLDA)
DMR Topics conditioned on document metadata
LabeledLDA Supervised topics tied to document labels
CTM Correlated topics (logistic-normal)
STM Structural Topic Model: prevalence and content covariates
SAGE The same topic worded differently across groups
HDP Nonparametric LDA that infers the number of topics
DTM Dynamic topics that evolve across time slices
SupervisedLDA Topics shaped to predict a per-document response
PT / GSDMM Short-text models for tweets, survey answers
SeededLDA / KeyATM Guided topics steered by seed words
PA / HLDA Topic hierarchies (Pachinko, nested-CRP)
BERTopic / Top2Vec Cluster document embeddings you supply into topics
ETM / FASTopic Generative topics from embeddings (factored β; optimal transport)

Worked examples

Three end-to-end analyses on real, redistributable corpora:


topica is open source on GitHub.