Skip to content

Preprocessing

topica takes pre-tokenized documents, a list[list[str]], or a Corpus. You control tokenization and vocabulary, because those choices are part of your method (see Build a defensible corpus).

Tokenize

from topica import tokenize

stop = open("stoplist.txt").read().split()        # a list, not a set
tokens = tokenize(text, stopwords=stop, min_length=3)

tokenize lowercases, applies a regex, drops stopwords and short tokens. It does not stem (stemming hurts interpretability); lemmatize in your own pipeline if you need it.

Build a Corpus and prune the vocabulary

from topica import Corpus

corpus = Corpus.from_documents(
    docs,
    min_doc_freq=10,        # keep words in >= 10 documents
    max_doc_fraction=0.5,   # drop words in > 50% of documents
    min_cf=0,               # collection-frequency cutoff
    rm_top=20,              # drop the N most frequent residual words
)
print(corpus.num_docs, corpus.num_words, corpus.total_tokens)

The vocabulary is compiled in Rust, so even multi-gigabyte corpora build quickly. A Corpus can also load from disk (one document per line, or MALLET-style TSV).

Detect phrases

Fixed expressions carry meaning together. Detect collocations and rewrite the tokens before modeling:

import topica
phrases = topica.learn_phrases(docs, min_count=8, threshold=12.0)
docs = topica.apply_phrases(docs, phrases)            # "health care" -> "health_care"

Split long documents

Long, heterogeneous documents violate the bag-of-words assumption. Segment them into comparable chunks, copying each source's metadata onto every chunk:

chunks, chunk_meta = topica.split_documents(
    texts, metadata, max_words=200, min_words=50,
)
# chunk_meta[j] = the source row + {"parent": i, "chunk": j}

Chunks from the same source are nested, so use clustered standard errors when you model effects.