Skip to content

Keywords & preprocessing

Distinguishing words

topica.fighting_words

fighting_words(corpus_a, corpus_b, *, prior=0.01, informative=False, min_count=1)

Monroe-Colaresi-Quinn Fighting Words — words that distinguish corpus A from corpus B, with their statistical significance.

For each word the weighted log-odds-ratio with an informative Dirichlet prior is computed and standardized to a z-score ζ:

δ_w = log[(y_Aw+α_w)/(n_A+α₀-y_Aw-α_w)] - log[(y_Bw+α_w)/(n_B+α₀-y_Bw-α_w)], Var(δ_w) ≈ 1/(y_Aw+α_w) + 1/(y_Bw+α_w), ζ_w = δ_w / √Var(δ_w),

where y_·w are word counts, n_· are corpus token totals, and α₀ = Σ_w α_w. A large positive ζ marks a word distinctive of corpus A; a large negative ζ marks one distinctive of corpus B. Because the variance term grows for rare words, |ζ| > 1.96 is a defensible ~95% cutoff.

Parameters:

Name Type Description Default
corpus_a sequence of token lists (``list[list[str]]``).
required
corpus_b sequence of token lists (``list[list[str]]``).
required
prior float

The Dirichlet pseudocount. With informative=False it is a symmetric prior α_w = prior for every word. With informative=True the prior is scaled by each word's overall frequency, α_w = prior · c_w where c_w is the word's combined count — Monroe et al.'s informative Dirichlet prior (IDP), which pulls extreme estimates toward the corpus background.

0.01
min_count int

Drop words whose combined count across both corpora is below this.

1

Returns:

Type Description
list[(word, zeta)] sorted by descending ``zeta`` — corpus-A markers at the
top, corpus-B markers at the bottom.

topica.top_fighting_words

top_fighting_words(corpus_a, corpus_b, *, n=20, **kwargs)

Convenience wrapper around :func:fighting_words returning the n most distinctive words for each corpus: a dict {"a": [...], "b": [...]} where each value is a list of (word, zeta) (corpus-B list has the most negative z-scores first). Keyword args are passed through to fighting_words.

Preprocessing

topica.tokenize builtin

tokenize(text, *, lowercase=True, stopwords=None, token_regex=None, min_length=1)

Tokenize a string the way the corpus loader does: find regex tokens, optionally lowercase, drop short tokens and stopwords. Handy for building list[list[str]] input outside of Corpus.from_text_file.

topica.split_documents

split_documents(texts, metadata=None, *, max_words=200, min_words=50, sentence_aware=True)

Segment long documents into shorter chunks, propagating metadata.

Each source document is split into chunks of roughly max_words words; the metadata row for the source document is copied onto every chunk it produces, with two bookkeeping keys added: parent (the source document's index) and chunk (the chunk's position within that document). The chunked texts and chunked metadata stay aligned, so you can feed the chunks to a model and still aggregate or condition on the original document-level covariates.

Parameters:

Name Type Description Default
texts sequence of ``str`` (raw text) or ``list[str]`` (pre-tokenized).

Pre-tokenized input is chunked by token count and returned as token lists; raw strings are chunked and returned as strings.

required
metadata sequence aligned with ``texts``, optional.

One entry per source document — typically a dict of covariates. A mapping is shallow-copied onto each chunk; a non-mapping value is stored under a "metadata" key. If omitted, chunk metadata carries just parent / chunk.

None
max_words int

Target chunk length in words/tokens.

200
min_words int

A trailing chunk shorter than this is merged back into the previous chunk (so no text is dropped and no runt chunks are produced). A whole document shorter than min_words is still emitted as a single chunk.

50
sentence_aware bool

For raw-string input, pack whole sentences up to max_words rather than cutting mid-sentence (a sentence longer than max_words is hard-split). Ignored for pre-tokenized input.

True

Returns:

Type Description
(chunks, chunk_metadata) : the chunked documents (same element type as the
input) and a list of metadata dicts, aligned and the same length.

topica.one_hot

one_hot(values, *, drop_first=True, prefix='')

One-hot encode a categorical covariate for use as DMR features.

Given a sequence of category labels (one per document), returns (matrix, names) where matrix is a (num_docs, num_categories) float array of 0/1 indicators and names are the corresponding column names. With drop_first=True (default) the first category (sorted) is omitted as the reference level, which avoids collinearity with the DMR intercept. Pass the result straight to DMR.fit(docs, matrix, feature_names=names); combine multiple covariates with numpy.hstack.

DataFrames & metadata

These accept pandas or Polars frames (and align also takes numpy arrays and lists), keeping document metadata aligned to the rows that survive pruning.

topica.from_dataframe

from_dataframe(df, *, text_col, metadata_cols=None, tokenizer=None, stopwords=None, min_length=1, min_doc_freq=1, max_doc_fraction=1.0, min_cf=0, rm_top=0)

Build a :class:Corpus from a pandas or Polars DataFrame, keeping per-document metadata aligned to the documents that survive pruning.

df[text_col] is tokenized (with tokenizer if given, otherwise :func:topica.tokenize), a :class:Corpus is built with the usual pruning options, and the surviving rows of metadata_cols (default: every column except text_col) are attached as corpus.metadata — a DataFrame of the same kind you passed in (pandas in, pandas out; Polars in, Polars out), aligned one-to-one with the corpus documents, in the same row order. Feed that metadata straight to an STM prevalence design with no manual alignment.

Parameters:

Name Type Description Default
df DataFrame or DataFrame

One row per document.

required
text_col str

Column holding the document text.

required
metadata_cols sequence[str]

Columns to carry as aligned metadata. Defaults to all columns except text_col.

None
tokenizer callable

str -> list[str]. Defaults to :func:topica.tokenize with the stopwords and min_length arguments below.

None

topica.align

align(x, corpus)

Realign an external covariate array, DataFrame, Series, or list to the documents a :class:Corpus kept after pruning. Accepts pandas and Polars DataFrames/Series, numpy arrays, and plain lists.

Use it when your covariates were built against the original documents and the corpus dropped some during pruning::

corpus = topica.Corpus.from_documents(docs, min_doc_freq=5)
X = topica.align(X, corpus)          # now aligned to corpus rows
model.fit(corpus, X, prevalence_names=names)

topica.design_matrix

design_matrix(formula, data)

Build a design matrix from an R-style formula and a data frame (pandas or Polars).

Returns (X, feature_names) where X is a (n_rows, p) float array and feature_names are the column labels. The intercept that formulaic adds is stripped, because :func:topica.estimate_effect and the STM prevalence model add their own. Categorical columns become treatment-coded dummies; a * b / a:b expand interactions; spline(x, df=k) uses topica's restricted cubic spline. A Polars frame is converted to pandas for formulaic.

Requires the optional formulaic package.

Embeddings

topica.llm_embed

llm_embed(texts, model='text-embedding-3-small', *, batch=True, cache=None)

Embed texts with the llm library's embedding models, as a dense (n, dim) float array.

The embedding models in topica (BERTopic, Top2Vec, ETM, FASTopic) and :func:embedding_seeds all take embeddings you supply; this is one way to produce them. model names any embedding model llm <https://llm.datasette.io/>_ can reach — OpenAI's "text-embedding-3-small" / "3-large" (needs an API key), or a local model such as "sentence-transformers/all-MiniLM-L6-v2" via the llm-sentence-transformers plugin (no API, runs offline). Pass document texts for document embeddings, or the vocabulary for word embeddings.

Embeddings are costly, so pass cache=path to embed once and reuse: if the file exists and was saved for the same texts, it is loaded and no model is called; otherwise the embeddings are computed and written there (see :func:save_embeddings).

Requires the optional llm package (pip install "topica[llm]"). The embeddings are the only thing topica needs from a model; everything downstream runs in the wheel.

topica.save_embeddings

save_embeddings(path, embeddings, *, texts=None, model=None) -> str

Save an embedding matrix to a .npz file so a costly corpus is embedded once and reused.

embeddings is any (n, dim) array. When given, texts (one per row) is stored as a hash and model as a string, so :func:load_embeddings and :func:llm_embed's cache= can confirm a cache matches the current inputs. The path gets a .npz suffix if it lacks one; returns the path written. Works on any embeddings, not just :func:llm_embed's.

topica.load_embeddings

load_embeddings(path, *, with_meta=False)

Load an embedding matrix saved by :func:save_embeddings.

Returns the (n, dim) array, or (array, meta) when with_meta=True; meta carries model and texts_hash if they were saved. The .npz suffix is added if the path lacks one and the bare path does not exist.

Phrases

topica.learn_phrases

learn_phrases(docs: List[List[str]], *, min_count: int = 5, threshold: float = 10.0, scoring: str = 'default', delimiter: str = '_') -> Phrases

Learn a collocation (phrase) model from tokenized documents.

Counts unigram and adjacent-bigram frequencies across all documents; scores each candidate bigram and keeps those meeting both the min_count and threshold criteria.

Parameters:

Name Type Description Default
docs list[list[str]]

Tokenized documents — each document is a list of string tokens.

required
min_count int

A bigram must appear at least this many times to be considered. Default 5.

5
threshold float

Minimum score for a bigram to be kept. For scoring="default" a value of 10.0 is a reasonable starting point; for scoring="npmi" use a value in [-1, 1] (e.g. 0.5). Default 10.0.

10.0
scoring ``"default"`` or ``"npmi"``

Which association measure to use (see module docstring for formulas). Default "default".

'default'
delimiter str

Character used to join tokens when transforming documents. Default "_".

'_'

Returns:

Type Description
Phrases

A fitted :class:Phrases object whose :meth:~Phrases.transform method merges detected collocations in new documents.

Examples:

Bigrams::

p = learn_phrases(docs, min_count=5, threshold=10.0)
docs_bi = p.transform(docs)

Trigrams via composition::

p2   = learn_phrases(docs_bi, min_count=5, threshold=10.0)
docs_tri = p2.transform(docs_bi)

topica.apply_phrases

apply_phrases(docs: List[List[str]], phrases: Phrases) -> List[List[str]]

Apply a :class:Phrases model to tokenized documents.

Performs a greedy left-to-right scan of each document: whenever an adjacent pair (tok[i], tok[i+1]) is a known collocation the pair is merged into "tok[i]{delimiter}tok[i+1]" and the cursor advances by 2 (so the merged token cannot overlap with the next merge). Tokens not involved in a collocation are passed through unchanged.

Parameters:

Name Type Description Default
docs list[list[str]]

Tokenized documents.

required
phrases Phrases

A fitted :class:Phrases model (from :func:learn_phrases).

required

Returns:

Type Description
list[list[str]]

New documents with collocations merged.