Skip to content

2. Choose and justify K

The principle

K is a research decision, not a tuning parameter. Multiple values of K are often defensible; your job is to pick one for a reason and show your conclusions don't hinge on it.

K sets the granularity of your themes: roughly, K=10 for broad themes, K=30 for specific topics, K=100 for fine distinctions. There is no single "correct" K, and you should resist any procedure that pretends otherwise.

Three converging justifications

Good practice combines all three:

1. Theory-driven. How many themes would you expect in this corpus? What level of granularity answers your research question? Start from theory and adjust.

2. Diagnostic-guided. Scan a range and look at quality metrics:

Metric What it measures Reading
Coherence (c_v / UMass) Do a topic's top words co-occur? Higher is better
Exclusivity Are words distinctive to a topic? Higher is better
Held-out perplexity Fit on unseen documents Lower is better

Do not just maximize coherence

A model with K=5 may have higher mean coherence yet miss distinctions that matter to your argument. Coherence trades off against exclusivity and against substantive richness. Use the metrics to inform a judgment, not to replace it.

3. Interpretability-focused. For each candidate K: can you label every topic? Do the topics make substantive sense? How many are "junk" (stopwords, artifacts)? Do topics split and merge sensibly as K grows?

A concrete procedure

import topica
import numpy as np

# 1) Scan a theoretically plausible range.
held_out = test_docs                     # a held-out split for perplexity
results = topica.search_k(
    train_docs, ks=[10, 15, 20, 25, 30],
    held_out=held_out, iterations=800,
)
for r in results:
    print(f"K={r['k']:>3}  coherence={r['coherence']:.3f}  "
          f"exclusivity={r['exclusivity']:.3f}  perplexity={r.get('perplexity'):.0f}")

Then, for the two or three best candidates, fit the model and read the topics. Count how many you can label, and look at the coherence×exclusivity spread per topic:

model = topica.STM(num_topics=20, seed=1)
model.fit(docs, prevalence=X)

frontier = topica.quality_frontier(model, n=10)   # per-topic coherence & exclusivity
# scatter frontier["coherence"] vs frontier["exclusivity"];
# weak topics cluster in the lower-left.

A nonparametric model is a useful sanity check on your choice: it infers a topic count rather than taking one.

hdp = topica.HDP(eta=0.3, seed=1)
hdp.fit(docs, iters=300)
print("HDP suggests ~", hdp.num_topics, "topics")

Report sensitivity

Pick the K that balances metrics, interpretability, and theory, then show your finding survives nearby K. Re-run the headline result at K-5 and K+5; if a covariate effect or a key topic only appears at one exact K, say so. Reviewers read "we used K=20" charitably only when followed by "results were robust to K ∈ {15, 25}."

→ Next: Validate the topics.