Skip to content

3. Validate the topics

This is the step that separates a publishable analysis from a fishing expedition. You must show, with evidence, that your topics are (a) coherent to humans, (b) reproducible, and (c) interpretable as substantive themes.

Human validation: intrusion tests

The field-standard test (Chang et al. 2009, Reading Tea Leaves) asks whether a human can tell a topic apart from noise. topica builds both variants for you, with an answer key.

Word intrusion: each topic's top words plus one intruder from another topic.

import topica

tests = topica.word_intrusion(model, n_words=5, seed=0)
for t in tests:
    print(f"Topic {t['topic']}: {t['words']}")
    # answer key for coding:
    # intruder '{t['intruder']}' at position {t['intruder_index']}

Show the shuffled words to coders (hide the key); a topic where coders reliably spot the intruder is coherent.

Document intrusion: a topic's representative documents plus one where the topic is nearly absent. This tests whether the topic captures real document similarity.

tests = topica.document_intrusion(model, texts=texts, n_docs=3, seed=0)

Report intrusion accuracy (and, ideally, inter-coder agreement) in your supplementary materials.

Computational validation: coherence and exclusivity

Per-topic metrics complement the human tests. Report both together: a good topic is coherent and exclusive.

coherence   = model.coherence(10)               # UMass, per topic
cv          = topica.coherence(model, texts, coherence_type="c_v", topn=10)
exclusivity = topica.exclusivity(model, 10)          # per topic

frontier = topica.quality_frontier(model, n=10)      # tidy: coherence, exclusivity, prevalence

c_v correlates best with human judgement; UMass is a fast intrinsic check; NPMI is the normalized middle ground. The coherence×exclusivity scatter is the canonical STM quality plot: weak topics sit in the lower-left.

Stability: answer the "fishing expedition" critique head-on

A topic that dissolves when you perturb the corpus is not a finding. Refit on bootstrap resamples and report which topics are robust:

boot = topica.bootstrap_stability(docs, k=20, n_boot=50, iterations=800)
for t, s in zip(boot["topic"], boot["stability"]):
    print(f"Topic {t}: stability {s:.2f}")     # mean top-word Jaccard across resamples
print("overall:", boot["mean"])

Also check that the same topics emerge across random seeds, aligning topics between fits and scoring their overlap:

a = topica.LDA(num_topics=20, seed=1); a.fit(docs, iterations=800)
b = topica.LDA(num_topics=20, seed=2); b.fit(docs, iterations=800)
pairs = topica.align_topics(a, b)                    # one-to-one matching (Hungarian)
print("stability across seeds:", topica.topic_stability([a, b], topn=10))

Flag fragile topics explicitly. A paper that says "topics 4 and 11 were unstable across resamples and we interpret them cautiously" is more credible, not less.

Interpretation: close reading, not just top words

Distant reading (top words) and close reading (the actual documents) are both required. Pull a topic's most representative documents and read them, with the topic's words highlighted in your notebook:

labels = topica.label_topics(model.topic_word, model.vocabulary, n=8)  # prob & FREX
html = topica.find_thoughts_html(model, texts, n_docs=3, n_words=8)    # highlighted quotes

Use FREX (frequent and exclusive) words, not just the most probable ones, when labeling. They are far more evocative of what makes a topic distinctive.

What topics are NOT

Topics are statistical patterns of word co-occurrence. They are not necessarily coherent concepts, and not necessarily what a document is "about." Write "this document has high probability for Topic 3" and "words associated with Topic 3 suggest…" Never write "the model discovered that…" or "this document is about Topic 3."

Common problems and what to do

Symptom Likely cause Fix
Junk topics (stopwords, numbers) Under-pruned vocabulary Add custom stopwords; raise min_doc_freq; rm_top
Duplicate topics K too high, or duplicate documents Lower K; dedupe; check align_topics / topic_correlation
Uninterpretable topics Not every topic is meaningful Document as "Mixed/Other"; consider lower K
One dominant topic Corpus-wide vocabulary May be legitimate; or treat as a background topic and rm_top

→ Next: Measure effects properly.