1. Build a defensible corpus¶
Before any modeling, a reviewer wants to know exactly what you fed the model and why. Preprocessing choices change topics, so they are part of your method, not plumbing to hide.
State the population and the unit¶
Report, in prose and with counts:
- What population the documents are drawn from, and how they were sampled.
- The unit of analysis. An article? A paragraph? A speech? A tweet? This is a substantive choice, not a technical one.
- The time span and any covariates you will use later.
Choose the unit deliberately: split long documents¶
LDA and STM assume documents are roughly comparable bags of words. A corpus of 60-word tweets and 6,000-word transcripts violates that badly. If your documents are long and heterogeneous, segment them into comparable chunks, keeping each chunk tied to its source document's metadata.
import topica
# `texts` are long documents; `meta` is one dict of covariates per document.
chunks, chunk_meta = topica.split_documents(
texts, meta,
max_words=200, # target chunk length
min_words=50, # merge a short tail back rather than drop it
)
# chunk_meta[j] carries the source covariates plus `parent` and `chunk`.
Report the splitting rule and how many chunks resulted. When you analyze effects later, remember that chunks from the same source document are nested, which is exactly when you'll want clustered standard errors.
Tokenize and prune the vocabulary, and say how¶
from topica import Corpus, tokenize
stop = open("stoplist.txt").read().split()
docs = [tokenize(t, stopwords=stop, min_length=3) for t in chunks]
corpus = Corpus.from_documents(
docs,
min_doc_freq=10, # a word must appear in >= 10 documents
max_doc_fraction=0.5, # drop words in > 50% of documents
rm_top=20, # drop the 20 most frequent residual words
)
A few defensible defaults, all of which you should report:
- Lowercase, drop punctuation and very short tokens. Standard.
- Do not stem. Stemming wrecks interpretability (
citizen,citizenship, andcitycan collapse together). Prefer lemmatization in your own pipeline if you need it; topica deliberately does neither for you. - Prune rare and ubiquitous terms (
min_doc_freq,max_doc_fraction,rm_top). Rare terms add noise andjunktopics; ubiquitous terms add nothing. - Custom stopwords for corpus-specific boilerplate (a magazine's own name, a transcription artifact). Report the list.
Preprocessing is a researcher degree of freedom
Different preprocessing yields different topics. Pick choices before you look at results, motivate them substantively, and check that your conclusions survive reasonable alternatives (see validation).
Detect phrases before modeling¶
Fixed expressions (jim crow, health care, climate change) carry more
meaning together than apart. Detect them first so a topic can be about the
phrase, not its scattered parts.
phrase_model = topica.learn_phrases(docs, min_count=8, threshold=12.0)
docs = topica.apply_phrases(docs, phrase_model)
Inspect what survived¶
Report the document count, vocabulary size, and token count after pruning. Those three numbers belong in your methods section.
→ Next: Choose and justify K.