2. Pick the right model¶
Before K, choose the model. This is a substantive decision. It encodes what
you think generates the text and what you want to learn, and reviewers will ask
you to justify it. Pick the simplest model that matches your question and your
data structure.
First: is a topic model even the right tool?¶
Good fit
- Exploratory: what themes exist in this corpus?
- Large collections where manual reading is infeasible.
- You want to discover structure, not impose categories.
- You need to track theme prevalence over time or across groups.
Poor fit
- Purely confirmatory counting against a fixed code list, where you do not need topics at all: a dictionary or supervised classifier is simpler. If you want to measure pre-theorized concepts as topics, a guided model fits, and the framing below explains where.
- Very short documents (< ~50 words) → use a short-text model, not standard LDA.
- Highly technical or formulaic text, or a need for crisp category boundaries.
Start here: do you already know your topics?¶
One question settles most of the choice. Do you know, from theory, the topics you are looking for, or are you trying to find out what is in the corpus?
- No. You want to discover what is there. Impose as little structure as
possible and read what emerges. Use
LDA, orHDPif you would rather not fix the number of topics in advance. - Yes. You have concepts and want to measure them. Name each topic with a few
seed words and let the model anchor a topic to them. This is the guided path,
KeyATMorSeededLDA, and it buys measurement validity and reproducibility a post-hoc reading of LDA topics cannot. From there you can regress prevalence on covariates or model it over time within the same family. - Yes, and you also care how the same concept is worded differently across
groups. Use
STMwith a content covariate. Modeling the topic-word distribution as a function of a group is the one thing the guided models cannot do.
Discovery and measurement are not a ranking. They answer different questions, and strong papers often run a guided model for the concepts they theorized and an unsupervised model as a robustness check on what they might have missed.
Match the model to your question¶
| Your question / data | Model | Why |
|---|---|---|
| "What themes are here?" (baseline) | LDA |
Standard, fast, well understood |
| "Does theme prevalence depend on metadata?" | STM |
Prevalence covariates with valid effect estimation; the social-science default |
| "Is the same theme worded differently by group?" | STM (content) / SAGE |
Content covariates on the topic-word distribution |
| "Do themes co-occur?" | CTM / STM |
Logistic-normal allows topic correlation |
| "How many themes are there?" | HDP |
Infers K nonparametrically (a check on your K) |
| "How does theme vocabulary drift over time?" | DTM |
Topics evolve across ordered time slices |
| "Do themes predict an outcome?" | SupervisedLDA / DMR |
Response- or covariate-conditioned topics |
| "I already know the themes I expect" | KeyATM, SeededLDA |
Seed words steer named topics for better validity and reproducibility (guided topics) |
| Tweets / survey answers / headlines | GSDMM, PT |
Built for short, sparse documents |
How far keyATM reaches, and where it stops
Within measurement, KeyATM is close to a complete
toolkit on its own: keyword-anchored topics, prevalence regression on
covariates, and a change-point model for time trends, all in one family
(validated against the R package in the replication).
Its boundary is its premise. Because it needs keywords, it does not discover
unknown topics, so reach for LDA or HDP there. And it models how much
each group discusses a topic, not how they word it, so reach for STM or
SAGE when content is the question.
Why STM is the social-science default¶
For most published social-science work the answer is the Structural Topic
Model. It is LDA's correlated cousin plus a regression layer, so a single fit
gives you topics, their correlations, and how covariates move them, with
proper uncertainty. The R stm package made it the field standard.
topica gives you the same model (spectral init, prevalence and content
covariates, estimateEffect, searchK, FREX) in Python, faster, with
clustered standard errors and GLM links on top.
Justify it in the paper¶
State the model and why it fits your design in one or two sentences:
Because our research question concerns how immigration framing differs by party and shifts across the campaign, we use the Structural Topic Model (STM), which lets topic prevalence depend on covariates (party, week) and provides valid uncertainty for those effects via the method of composition.
A nonparametric (HDP) or alternative-model robustness check pre-empts the
"why this model?" objection: "the substantive topics were stable when we instead
used LDA / a different K."
→ Next: Choose and justify K.