Guided topics (seed words)¶
Plain LDA is unsupervised: you label the topics after the fact and have no control over whether the themes you care about appear. Guided models let you inject prior knowledge as a few seed words per topic, so a topic forms around words you already believe belong together. You seed topics with keywords, not documents with labels, so there is no hand-coding.
This is squarely a social-science tool: it improves measurement validity and reproducibility, the things reviewers push on. topica has two, matching the two standard R packages.
SeededLDA¶
Seed-word priors steer some topics; residual unseeded topics are learned
freely. Faithful to the seededlda package (Watanabe): a seed word gets a
weight × 100 prior pseudocount in its topic, plus seeded initialization.
import topica
model = topica.SeededLDA(
{"economy": ["jobs", "wages", "tax"],
"immigration": ["border", "visa", "deport"]},
residual=3, # 3 extra unseeded topics
seed=1,
)
model.fit(docs, iters=2000)
model.topic_names # ['economy', 'immigration', 'residual_1', ...]
for t in range(model.num_topics):
print(model.topic_names[t], [w for w, _ in model.top_words(8, topic=t)])
KeyATM¶
The Keyword-Assisted Topic Model (Eshima, Imai & Sasaki 2024) is the modern, well-validated version. A token in a keyword topic comes either from a distribution over only that topic's keywords or from the topic's full distribution; the learned mix is the keyword rate.
model = topica.KeyATM(
{"economy": ["jobs", "wages", "tax"],
"immigration": ["border", "visa", "deport"]},
num_topics=10, # 2 keyword topics + 8 regular topics
seed=1,
)
model.fit(docs, iters=1500)
model.keyword_rate # per-topic share drawn from the keyword distribution
By default keyATM applies information-theory token weighting (each token counts
by its word's surprisal in bits), which downweights frequent words and sharpens
topics. Set weights="inv-freq" or weights="none" to change it. On large
corpora, pass num_threads=N to sample document partitions in parallel
(approximate distributed Gibbs); both options apply to every variant below.
Covariate keyATM¶
Pass covariates to let document metadata shape topic prevalence, the keyATM
covariate model. The document-topic prior becomes a Dirichlet-multinomial
regression, α_{d,k} = exp(x_d · λ_k) (Mimno & McCallum 2008, the same engine as
DMR), so you can ask whether a covariate moves a named topic.
An intercept is prepended; the learned coefficients are in feature_effects.
import numpy as np
is_dem = np.array([...]).reshape(-1, 1) # one row per document
model = topica.KeyATM(seeds, num_topics=2, seed=1)
model.fit(docs, covariates=is_dem, feature_names=["is_dem"], iters=1000)
model.feature_names # ['intercept', 'is_dem']
model.feature_effects # (num_topics, 2): coefficient of each covariate per topic
A larger feature_effects[k, j] means covariate j raises topic k's
prevalence. For uncertainty, pair the fitted doc_topic with
estimate_effect.
Dynamic keyATM¶
Pass timestamps (one per document) to let topic prevalence shift over time.
This is the keyATM dynamic model, a Chib (1998) change-point hidden Markov model:
the timeline is split into num_states latent regimes, each with its own
document-topic prior, and the model estimates where prevalence changes. Following
the keyATM Supreme Court application (Eshima, Imai & Sasaki 2024, Section 3.3),
documents carry a year and the model recovers when each topic rises or falls.
model = topica.KeyATM(seeds, num_topics=14, seed=1)
model.fit(docs, timestamps=years, num_states=5, iters=3000)
model.time_labels # ['1946', '1947', ..., '2012'] (T distinct timestamps)
model.time_state # [0, 0, 1, 1, ..., 4] regime of each segment
model.time_prevalence # (T, num_topics): smoothed prevalence path, rows sum to 1
model.transition_matrix # (num_states, num_states), left-to-right
Documents may be passed in any order; they are sorted by timestamp internally and
doc_topic is returned in the original order. Plot a column of time_prevalence
against time_labels to see a topic's trajectory.
Embedding-guided topics (EmbeddingLDA)¶
SeededLDA and KeyATM ask you to name the seed words. EmbeddingLDA instead
discovers them from a pre-trained embedding space: it clusters the vocabulary's
embeddings into num_topics semantic groups, seeds each topic with the words
nearest its cluster centroid, and fits a SeededLDA underneath. The embeddings
warm-start where topics form; the Gibbs sampler can still override any seed the
text contradicts, so this is a prior, not a constraint.
You supply the embeddings (topica does not call any model itself), aligned to the vocabulary:
from sentence_transformers import SentenceTransformer
import topica
vocab = sorted({w for d in docs for w in d})
emb = SentenceTransformer("all-MiniLM-L6-v2").encode(vocab)
model = topica.EmbeddingLDA(num_topics=10, embeddings=emb, vocabulary=vocab,
top_m=20, weight=1.0)
model.fit(docs, iters=1000)
for i, words in enumerate(model.top_words(8)):
print(f"Topic {i}:", ", ".join(w for w, _ in words))
top_m sets how many of each cluster's nearest words become seeds, and weight
how hard they anchor (a seed gets weight * 100 prior pseudocounts; raise it to
hold topics closer to their semantic cluster, lower it to let the data lead).
The whole fitted-model surface (topic_word, doc_topic, top_words,
coherence, ...) is delegated to the underlying SeededLDA, and model.seeds
holds the embedding-derived seed sets. topica.embedding_seeds(...) exposes just
the clustering step if you want to inspect or edit the seeds before fitting.
Which to use¶
KeyATMis the better-validated choice and the one with the political- science following; prefer it for new work.SeededLDAis simpler and maps directly onto theseededldaworkflow.EmbeddingLDAwhen you have embeddings but no hand-picked seed list, and want the topic structure anchored to semantic similarity.
Both feed the same diagnostics, effects, and validation as every other model.
Faithful to the references
On a shared corpus with identical seeds, topica recovers the same
seeded-topic vocabulary as R's seededlda and the same keyword-topic words
as R's keyATM (verified word-for-word against both packages).