Skip to content

Guided topics (seed words)

Plain LDA is unsupervised: you label the topics after the fact and have no control over whether the themes you care about appear. Guided models let you inject prior knowledge as a few seed words per topic, so a topic forms around words you already believe belong together. You seed topics with keywords, not documents with labels, so there is no hand-coding.

This is squarely a social-science tool: it improves measurement validity and reproducibility, the things reviewers push on. topica has two, matching the two standard R packages.

SeededLDA

Seed-word priors steer some topics; residual unseeded topics are learned freely. Faithful to the seededlda package (Watanabe): a seed word gets a weight × 100 prior pseudocount in its topic, plus seeded initialization.

import topica

model = topica.SeededLDA(
    {"economy": ["jobs", "wages", "tax"],
     "immigration": ["border", "visa", "deport"]},
    residual=3,           # 3 extra unseeded topics
    seed=1,
)
model.fit(docs, iters=2000)

model.topic_names         # ['economy', 'immigration', 'residual_1', ...]
for t in range(model.num_topics):
    print(model.topic_names[t], [w for w, _ in model.top_words(8, topic=t)])

KeyATM

The Keyword-Assisted Topic Model (Eshima, Imai & Sasaki 2024) is the modern, well-validated version. A token in a keyword topic comes either from a distribution over only that topic's keywords or from the topic's full distribution; the learned mix is the keyword rate.

model = topica.KeyATM(
    {"economy": ["jobs", "wages", "tax"],
     "immigration": ["border", "visa", "deport"]},
    num_topics=10,        # 2 keyword topics + 8 regular topics
    seed=1,
)
model.fit(docs, iters=1500)

model.keyword_rate        # per-topic share drawn from the keyword distribution

By default keyATM applies information-theory token weighting (each token counts by its word's surprisal in bits), which downweights frequent words and sharpens topics. Set weights="inv-freq" or weights="none" to change it. On large corpora, pass num_threads=N to sample document partitions in parallel (approximate distributed Gibbs); both options apply to every variant below.

Covariate keyATM

Pass covariates to let document metadata shape topic prevalence, the keyATM covariate model. The document-topic prior becomes a Dirichlet-multinomial regression, α_{d,k} = exp(x_d · λ_k) (Mimno & McCallum 2008, the same engine as DMR), so you can ask whether a covariate moves a named topic. An intercept is prepended; the learned coefficients are in feature_effects.

import numpy as np
is_dem = np.array([...]).reshape(-1, 1)          # one row per document
model = topica.KeyATM(seeds, num_topics=2, seed=1)
model.fit(docs, covariates=is_dem, feature_names=["is_dem"], iters=1000)

model.feature_names       # ['intercept', 'is_dem']
model.feature_effects     # (num_topics, 2): coefficient of each covariate per topic

A larger feature_effects[k, j] means covariate j raises topic k's prevalence. For uncertainty, pair the fitted doc_topic with estimate_effect.

Dynamic keyATM

Pass timestamps (one per document) to let topic prevalence shift over time. This is the keyATM dynamic model, a Chib (1998) change-point hidden Markov model: the timeline is split into num_states latent regimes, each with its own document-topic prior, and the model estimates where prevalence changes. Following the keyATM Supreme Court application (Eshima, Imai & Sasaki 2024, Section 3.3), documents carry a year and the model recovers when each topic rises or falls.

model = topica.KeyATM(seeds, num_topics=14, seed=1)
model.fit(docs, timestamps=years, num_states=5, iters=3000)

model.time_labels        # ['1946', '1947', ..., '2012']  (T distinct timestamps)
model.time_state         # [0, 0, 1, 1, ..., 4]  regime of each segment
model.time_prevalence    # (T, num_topics): smoothed prevalence path, rows sum to 1
model.transition_matrix  # (num_states, num_states), left-to-right

Documents may be passed in any order; they are sorted by timestamp internally and doc_topic is returned in the original order. Plot a column of time_prevalence against time_labels to see a topic's trajectory.

Embedding-guided topics (EmbeddingLDA)

SeededLDA and KeyATM ask you to name the seed words. EmbeddingLDA instead discovers them from a pre-trained embedding space: it clusters the vocabulary's embeddings into num_topics semantic groups, seeds each topic with the words nearest its cluster centroid, and fits a SeededLDA underneath. The embeddings warm-start where topics form; the Gibbs sampler can still override any seed the text contradicts, so this is a prior, not a constraint.

You supply the embeddings (topica does not call any model itself), aligned to the vocabulary:

from sentence_transformers import SentenceTransformer
import topica

vocab = sorted({w for d in docs for w in d})
emb = SentenceTransformer("all-MiniLM-L6-v2").encode(vocab)

model = topica.EmbeddingLDA(num_topics=10, embeddings=emb, vocabulary=vocab,
                            top_m=20, weight=1.0)
model.fit(docs, iters=1000)
for i, words in enumerate(model.top_words(8)):
    print(f"Topic {i}:", ", ".join(w for w, _ in words))

top_m sets how many of each cluster's nearest words become seeds, and weight how hard they anchor (a seed gets weight * 100 prior pseudocounts; raise it to hold topics closer to their semantic cluster, lower it to let the data lead). The whole fitted-model surface (topic_word, doc_topic, top_words, coherence, ...) is delegated to the underlying SeededLDA, and model.seeds holds the embedding-derived seed sets. topica.embedding_seeds(...) exposes just the clustering step if you want to inspect or edit the seeds before fitting.

Which to use

  • KeyATM is the better-validated choice and the one with the political- science following; prefer it for new work.
  • SeededLDA is simpler and maps directly onto the seededlda workflow.
  • EmbeddingLDA when you have embeddings but no hand-picked seed list, and want the topic structure anchored to semantic similarity.

Both feed the same diagnostics, effects, and validation as every other model.

Faithful to the references

On a shared corpus with identical seeds, topica recovers the same seeded-topic vocabulary as R's seededlda and the same keyword-topic words as R's keyATM (verified word-for-word against both packages).