Keywords & preprocessing¶

Distinguishing words¶

topica.fighting_words ¶

fighting_words(corpus_a, corpus_b, *, prior=0.01, informative=False, min_count=1)

Monroe-Colaresi-Quinn Fighting Words — words that distinguish corpus A from corpus B, with their statistical significance.

For each word the weighted log-odds-ratio with an informative Dirichlet prior is computed and standardized to a z-score ζ:

δ_w = log[(y_Aw+α_w)/(n_A+α₀-y_Aw-α_w)] - log[(y_Bw+α_w)/(n_B+α₀-y_Bw-α_w)], Var(δ_w) ≈ 1/(y_Aw+α_w) + 1/(y_Bw+α_w), ζ_w = δ_w / √Var(δ_w),

where y_·w are word counts, n_· are corpus token totals, and α₀ = Σ_w α_w. A large positive ζ marks a word distinctive of corpus A; a large negative ζ marks one distinctive of corpus B. Because the variance term grows for rare words, |ζ| > 1.96 is a defensible ~95% cutoff.

Parameters:

Name	Type	Description	Default
`corpus_a`	sequence of token lists (``list[list[str]]``).		required
`corpus_b`	sequence of token lists (``list[list[str]]``).		required
`prior`	`float`	The Dirichlet pseudocount. With `informative=False` it is a symmetric prior `α_w = prior` for every word. With `informative=True` the prior is scaled by each word's overall frequency, `α_w = prior · c_w` where `c_w` is the word's combined count — Monroe et al.'s informative Dirichlet prior (IDP), which pulls extreme estimates toward the corpus background.	`0.01`
`min_count`	`int`	Drop words whose combined count across both corpora is below this.	`1`

Returns:

Type	Description
list[(word, zeta)] sorted by descending ``zeta`` — corpus-A markers at the
`top, corpus-B markers at the bottom.`

topica.top_fighting_words ¶

top_fighting_words(corpus_a, corpus_b, *, n=20, **kwargs)

Convenience wrapper around :func:fighting_words returning the n most distinctive words for each corpus: a dict {"a": [...], "b": [...]} where each value is a list of (word, zeta) (corpus-B list has the most negative z-scores first). Keyword args are passed through to fighting_words.

Preprocessing¶

topica.tokenize `builtin` ¶

tokenize(text, *, lowercase=True, stopwords=None, token_regex=None, min_length=1)

Tokenize a string the way the corpus loader does: find regex tokens, optionally lowercase, drop short tokens and stopwords. Handy for building list[list[str]] input outside of Corpus.from_text_file.

text is the input string. token_regex is the token-matching pattern (None = the default word regex). min_length drops tokens shorter than that many characters.

topica.split_documents ¶

split_documents(texts, metadata=None, *, max_words=200, min_words=50, sentence_aware=True)

Segment long documents into shorter chunks, propagating metadata.

Each source document is split into chunks of roughly max_words words; the metadata row for the source document is copied onto every chunk it produces, with two bookkeeping keys added: parent (the source document's index) and chunk (the chunk's position within that document). The chunked texts and chunked metadata stay aligned, so you can feed the chunks to a model and still aggregate or condition on the original document-level covariates.

Parameters:

Name	Type	Description	Default
`texts`	sequence of ``str`` (raw text) or ``list[str]`` (pre-tokenized).	Pre-tokenized input is chunked by token count and returned as token lists; raw strings are chunked and returned as strings.	required
`metadata`	sequence aligned with ``texts``, optional.	One entry per source document — typically a `dict` of covariates. A mapping is shallow-copied onto each chunk; a non-mapping value is stored under a `"metadata"` key. If omitted, chunk metadata carries just `parent` / `chunk`.	`None`
`max_words`	`int`	Target chunk length in words/tokens.	`200`
`min_words`	`int`	A trailing chunk shorter than this is merged back into the previous chunk (so no text is dropped and no runt chunks are produced). A whole document shorter than `min_words` is still emitted as a single chunk.	`50`
`sentence_aware`	`bool`	For raw-string input, pack whole sentences up to `max_words` rather than cutting mid-sentence (a sentence longer than `max_words` is hard-split). Ignored for pre-tokenized input.	`True`

Returns:

Type	Description
`(chunks, chunk_metadata) : the chunked documents (same element type as the`
`input) and a list of metadata dicts, aligned and the same length.`

topica.one_hot ¶

one_hot(values, *, drop_first=True, prefix='')

One-hot encode a categorical covariate for use as DMR features.

Given a sequence of category labels (one per document), returns (matrix, names) where matrix is a (num_docs, num_categories) float array of 0/1 indicators and names are the corresponding column names. With drop_first=True (default) the first category (sorted) is omitted as the reference level, which avoids collinearity with the DMR intercept. Pass the result straight to DMR.fit(docs, matrix, feature_names=names); combine multiple covariates with numpy.hstack.

Prune rare (and optionally common) vocabulary from a corpus, keeping metadata row-aligned with the documents that survive — the analogue of R stm's prepDocuments.

topica.prep_documents ¶

prep_documents(corpus, meta=None, *, lower_thresh=1, upper_thresh=None, rm_top=0)

Filter rare (and optionally common) vocabulary from a corpus while keeping metadata row-aligned with the documents that survive.

This is topica's analogue of R stm's prepDocuments. Terms that appear in fewer than lower_thresh documents are dropped from the vocabulary; after dropping, documents that become empty are removed. The meta frame is subsetted to exactly the rows of the surviving documents, so the returned corpus and metadata stay one-to-one and in the same order. Feeding the returned meta straight into an STM prevalence design requires no further alignment.

Parameters:

Name	Type	Description	Default
`corpus`	`Corpus`	A :class:`~topica.Corpus` built by :func:`~topica.Corpus.from_documents` or :func:`~topica.from_dataframe`. The corpus may already carry a `corpus.metadata` attribute; if `meta` is also supplied, `meta` takes precedence and `corpus.metadata` is ignored.	required
`meta`	`pandas.DataFrame, polars.DataFrame, sequence, or numpy.ndarray`	Per-document covariates, one entry per document in `corpus` (before this call's filtering). Accepts a pandas or Polars DataFrame, a numpy array, or a plain list/sequence. When `None`, `corpus.metadata` is used if present; the returned metadata may then be `None` if neither is set.	`None`
`lower_thresh`	`int`	Minimum document frequency for a term to be kept. Terms appearing in fewer than `lower_thresh` documents are removed. `lower_thresh=1` keeps all terms (no filtering); `lower_thresh=2` drops hapax legomena.	`1`
`upper_thresh`	`int or None`	Maximum document frequency for a term to be kept. Terms appearing in more than `upper_thresh` documents are removed. `None` disables the upper bound. Passed as `rm_top` is handled separately; `upper_thresh` is a raw count ceiling.	`None`
`rm_top`	`int`	Number of the most-frequent terms to remove (regardless of count). Mirrors :func:`~topica.Corpus.from_documents`'s `rm_top` parameter.	`0`

Returns:

Name	Type	Description
`filtered_corpus`	`Corpus`	A new corpus with the rare-term vocabulary and empty documents removed. `filtered_corpus.kept_indices` reports which of the input corpus's document positions survived; `filtered_corpus.doc_lengths` is parallel to the returned `filtered_meta` rows.
`filtered_meta`	same type as ``meta``, or None	The subset of `meta` (or `corpus.metadata`) rows corresponding to the surviving documents, in the same order. Guaranteed `len(filtered_meta) == len(filtered_corpus.doc_lengths)` when meta is not None.

Sweep document-frequency thresholds and visualize how many documents and vocabulary terms are removed at each level, to inform the choice of lower_thresh.

topica.plot_removed ¶

plot_removed(corpus, thresholds, *, ax=None)

Sweep document-frequency thresholds and plot how many documents and words are removed at each level (R stm's plotRemoved).

For each threshold value in thresholds, :func:prep_documents is called and the number of removed documents and removed vocabulary terms is recorded. The result is a two-line chart that helps you choose a threshold: a very low threshold removes few items; a high threshold may eliminate many documents whose only terms are rare, which would corrupt a downstream covariate analysis.

Parameters:

Name	Type	Description	Default
`corpus`	`Corpus`	The corpus to sweep. Passed unchanged to :func:`prep_documents` at each threshold.	required
`thresholds`	`sequence of int`	Document-frequency thresholds to evaluate (x-axis). Typically a range such as `range(1, 10)`.	required
`ax`	`Axes`	Axes to draw into. When `None` a new figure is created.	`None`

Returns:

Name	Type	Description
`ax`	`Axes`	The primary axes (left y-axis = documents removed; right y-axis = words removed).

DataFrames & metadata¶

These accept pandas or Polars frames (and align also takes numpy arrays and lists), keeping document metadata aligned to the rows that survive pruning.

topica.from_dataframe ¶

from_dataframe(df, *, text_col, metadata_cols=None, tokenizer=None, stopwords=None, min_length=1, min_doc_freq=1, max_doc_fraction=1.0, min_cf=0, rm_top=0)

Build a :class:Corpus from a pandas or Polars DataFrame, keeping per-document metadata aligned to the documents that survive pruning.

df[text_col] is tokenized (with tokenizer if given, otherwise :func:topica.tokenize), a :class:Corpus is built with the usual pruning options, and the surviving rows of metadata_cols (default: every column except text_col) are attached as corpus.metadata — a DataFrame of the same kind you passed in (pandas in, pandas out; Polars in, Polars out), aligned one-to-one with the corpus documents, in the same row order. Feed that metadata straight to an STM prevalence design with no manual alignment.

To turn that metadata into a design matrix with an R-style formula, pass corpus.metadata to :func:topica.design_matrix, which needs the optional formulaic package (pip install "topica[formula]"); or build the design by hand with :func:topica.one_hot / :func:topica.spline, which need no extra dependency.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame or DataFrame`	One row per document.	required
`text_col`	`str`	Column holding the document text.	required
`metadata_cols`	`sequence[str]`	Columns to carry as aligned metadata. Defaults to all columns except `text_col`.	`None`
`tokenizer`	`callable`	`str -> list[str]`. Defaults to :func:`topica.tokenize` with the `stopwords` and `min_length` arguments below. This is also where you plug in lemmatization: the default tokenizer does not stem (stemming truncates words to roots like `militari`/`economi`, which read as broken in a topic table), so pass a lemmatizing tokenizer here if you want to merge inflections while keeping readable surface forms. See the preprocessing guide ("Readable topic words: lemmatize, don't stem").	`None`

topica.align ¶

align(x, corpus)

Realign an external covariate array, DataFrame, Series, or list to the documents a :class:Corpus kept after pruning. Accepts pandas and Polars DataFrames/Series, numpy arrays, and plain lists.

Use it when your covariates were built against the original documents and the corpus dropped some during pruning::

corpus = topica.Corpus.from_documents(docs, min_doc_freq=5)
X = topica.align(X, corpus)          # now aligned to corpus rows
model.fit(corpus, X, prevalence_names=names)

topica.design_matrix ¶

design_matrix(formula, data, _knot_ctx=None)

Build a design matrix from an R-style formula and a data frame (pandas or Polars).

Returns (X, feature_names) where X is a (n_rows, p) float array and feature_names are the column labels. The intercept that formulaic adds is stripped, because :func:topica.estimate_effect and the STM prevalence model add their own. Categorical columns become treatment-coded dummies; a * b / a:b expand interactions; spline(x, df=k) uses topica's restricted cubic spline. A Polars frame is converted to pandas for formulaic.

Requires the optional formulaic package: install it with pip install "topica[formula]" (or pip install formulaic). Without it, this raises ImportError at call time. If you would rather build the design matrix without that dependency, use :func:topica.one_hot, :func:topica.spline, and :func:topica.interaction directly.

Parameters:

Name	Type	Description	Default
`formula`	`str`	R-style formula, e.g. `"~ party + spline(year, df=3)"`.	required
`data`	`DataFrame or DataFrame`	One row per document; columns referenced in `formula` must be present.	required
`_knot_ctx`	`_KnotCapturingContext`	When supplied, the `spline` evaluations use the context's training mode so the knots are recorded. Pass the same object to :func:`design_matrix_predict` to replay those knots on new data.	`None`

Embeddings¶

topica.llm_embed ¶

llm_embed(texts, model='text-embedding-3-small', *, key=None, batch=True, cache=None)

Embed texts with the llm library's embedding models, as a dense (n, dim) float array.

The embedding models in topica (BERTopic, Top2Vec, ETM, FASTopic) and :func:embedding_seeds all take embeddings you supply; this is one way to produce them. model names any embedding model llm <https://llm.datasette.io/>_ can reach — OpenAI's "text-embedding-3-small" / "3-large" (needs an API key), or a local model such as "sentence-transformers/all-MiniLM-L6-v2" via the llm-sentence-transformers plugin (no API, runs offline). Pass document texts for document embeddings, or the vocabulary for word embeddings.

By default the API key (for hosted embedders) is resolved by llm itself: a stored llm keys value, else the provider's environment variable (OPENAI_API_KEY for OpenAI). Pass key to override it explicitly.

Embeddings are costly, so pass cache=path to embed once and reuse: if the file exists and was saved for the same texts, it is loaded and no model is called; otherwise the embeddings are computed and written there (see :func:save_embeddings).

Requires the optional llm package (pip install "topica[llm]"). The embeddings are the only thing topica needs from a model; everything downstream runs in the wheel.

topica.save_embeddings ¶

save_embeddings(path, embeddings, *, texts=None, model=None) -> str

Save an embedding matrix to a .npz file so a costly corpus is embedded once and reused.

embeddings is any (n, dim) array. When given, texts (one per row) is stored as a hash and model as a string, so :func:load_embeddings and :func:llm_embed's cache= can confirm a cache matches the current inputs. The path gets a .npz suffix if it lacks one; returns the path written. Works on any embeddings, not just :func:llm_embed's.

topica.load_embeddings ¶

load_embeddings(path, *, with_meta=False)

Load an embedding matrix saved by :func:save_embeddings.

Returns the (n, dim) array, or (array, meta) when with_meta=True; meta carries model and texts_hash if they were saved. The .npz suffix is added if the path lacks one and the bare path does not exist.

Phrases¶

topica.learn_phrases ¶

learn_phrases(docs: List[List[str]], *, min_count: int = 5, threshold: float = 10.0, scoring: str = 'default', delimiter: str = '_') -> Phrases

Learn a collocation (phrase) model from tokenized documents.

Counts unigram and adjacent-bigram frequencies across all documents; scores each candidate bigram and keeps those meeting both the min_count and threshold criteria.

Parameters:

Name	Type	Description	Default
`docs`	`list[list[str]]`	Tokenized documents — each document is a list of string tokens.	required
`min_count`	`int`	A bigram must appear at least this many times to be considered. Default `5`.	`5`
`threshold`	`float`	Minimum score for a bigram to be kept. For `scoring="default"` a value of `10.0` is a reasonable starting point; for `scoring="npmi"` use a value in `[-1, 1]` (e.g. `0.5`). Default `10.0`.	`10.0`
`scoring`	``"default"`` or ``"npmi"``	Which association measure to use (see module docstring for formulas). Default `"default"`.	`'default'`
`delimiter`	`str`	Character used to join tokens when transforming documents. Default `"_"`.	`'_'`

Returns:

Type	Description
`Phrases`	A fitted :class:`Phrases` object whose :meth:`~Phrases.transform` method merges detected collocations in new documents.

Examples:

Bigrams::

p = learn_phrases(docs, min_count=5, threshold=10.0)
docs_bi = p.transform(docs)

Trigrams via composition::

p2   = learn_phrases(docs_bi, min_count=5, threshold=10.0)
docs_tri = p2.transform(docs_bi)

topica.apply_phrases ¶

apply_phrases(docs: List[List[str]], phrases: Phrases) -> List[List[str]]

Apply a :class:Phrases model to tokenized documents.

Performs a greedy left-to-right scan of each document: whenever an adjacent pair (tok[i], tok[i+1]) is a known collocation the pair is merged into "tok[i]{delimiter}tok[i+1]" and the cursor advances by 2 (so the merged token cannot overlap with the next merge). Tokens not involved in a collocation are passed through unchanged.

Parameters:

Name	Type	Description	Default
`docs`	`list[list[str]]`	Tokenized documents.	required
`phrases`	`Phrases`	A fitted :class:`Phrases` model (from :func:`learn_phrases`).	required

Returns:

Type	Description
`list[list[str]]`	New documents with collocations merged.

topica.add_ngrams ¶

add_ngrams(docs, ngram_range=(1, 2), min_df=1, sep='_')

Expand pre-tokenized documents with contiguous n-grams.

The mechanical, exhaustive counterpart to :func:learn_phrases: rather than keeping only statistically significant collocations, it emits every contiguous n-gram, mirroring scikit-learn's CountVectorizer(ngram_range=..., min_df=...). For each document and each n in range(min_n, max_n + 1) it adds the joined n-grams (e.g. "machine_learning"), then drops terms occurring in fewer than min_df documents. Use it before fitting an embedding model so its class-based TF-IDF topic words can include bigrams.

Parameters:

Name	Type	Description	Default
`docs`	`list of token lists.`		required
`ngram_range`	``(min_n, max_n)``. ``(1, 2)`` keeps unigrams and adds bigrams;	`(2, 2)` is bigrams only.	`(1, 2)`
`min_df`	`drop terms appearing in fewer than this many documents (an integer`	document-frequency cut, as in scikit-learn). `1` keeps everything.	`1`
`sep`	`the string joining the words of an n-gram.`		`'_'`

Returns:

Type	Description
`New token lists (one per input document; an emptied document stays as an empty`
`list, so the result stays aligned with any per-document embeddings).`

Keywords & preprocessing¶

Distinguishing words¶

topica.fighting_words ¶

topica.top_fighting_words ¶

Preprocessing¶

topica.tokenize builtin ¶

topica.split_documents ¶

topica.one_hot ¶

topica.prep_documents ¶

topica.plot_removed ¶

DataFrames & metadata¶

topica.from_dataframe ¶

topica.align ¶

topica.design_matrix ¶

Embeddings¶

topica.llm_embed ¶

topica.save_embeddings ¶

topica.load_embeddings ¶

Phrases¶

topica.learn_phrases ¶

topica.apply_phrases ¶

topica.add_ngrams ¶

topica.tokenize `builtin` ¶