Keywords & preprocessing¶
Distinguishing words¶
topica.fighting_words ¶
Monroe-Colaresi-Quinn Fighting Words — words that distinguish corpus A from corpus B, with their statistical significance.
For each word the weighted log-odds-ratio with an informative Dirichlet prior
is computed and standardized to a z-score ζ:
δ_w = log[(y_Aw+α_w)/(n_A+α₀-y_Aw-α_w)] - log[(y_Bw+α_w)/(n_B+α₀-y_Bw-α_w)],
Var(δ_w) ≈ 1/(y_Aw+α_w) + 1/(y_Bw+α_w), ζ_w = δ_w / √Var(δ_w),
where y_·w are word counts, n_· are corpus token totals, and α₀ =
Σ_w α_w. A large positive ζ marks a word distinctive of corpus A; a
large negative ζ marks one distinctive of corpus B. Because the variance
term grows for rare words, |ζ| > 1.96 is a defensible ~95% cutoff.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
corpus_a
|
sequence of token lists (``list[list[str]]``).
|
|
required |
corpus_b
|
sequence of token lists (``list[list[str]]``).
|
|
required |
prior
|
float
|
The Dirichlet pseudocount. With |
0.01
|
min_count
|
int
|
Drop words whose combined count across both corpora is below this. |
1
|
Returns:
| Type | Description |
|---|---|
list[(word, zeta)] sorted by descending ``zeta`` — corpus-A markers at the
|
|
top, corpus-B markers at the bottom.
|
|
topica.top_fighting_words ¶
Convenience wrapper around :func:fighting_words returning the n most
distinctive words for each corpus: a dict {"a": [...], "b": [...]} where
each value is a list of (word, zeta) (corpus-B list has the most negative
z-scores first). Keyword args are passed through to fighting_words.
Preprocessing¶
topica.tokenize
builtin
¶
Tokenize a string the way the corpus loader does: find regex tokens,
optionally lowercase, drop short tokens and stopwords. Handy for building
list[list[str]] input outside of Corpus.from_text_file.
topica.split_documents ¶
Segment long documents into shorter chunks, propagating metadata.
Each source document is split into chunks of roughly max_words words; the
metadata row for the source document is copied onto every chunk it produces,
with two bookkeeping keys added: parent (the source document's index) and
chunk (the chunk's position within that document). The chunked texts and
chunked metadata stay aligned, so you can feed the chunks to a model and still
aggregate or condition on the original document-level covariates.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
texts
|
sequence of ``str`` (raw text) or ``list[str]`` (pre-tokenized).
|
Pre-tokenized input is chunked by token count and returned as token lists; raw strings are chunked and returned as strings. |
required |
metadata
|
sequence aligned with ``texts``, optional.
|
One entry per source document — typically a |
None
|
max_words
|
int
|
Target chunk length in words/tokens. |
200
|
min_words
|
int
|
A trailing chunk shorter than this is merged back into the previous chunk
(so no text is dropped and no runt chunks are produced). A whole document
shorter than |
50
|
sentence_aware
|
bool
|
For raw-string input, pack whole sentences up to |
True
|
Returns:
| Type | Description |
|---|---|
(chunks, chunk_metadata) : the chunked documents (same element type as the
|
|
input) and a list of metadata dicts, aligned and the same length.
|
|
topica.one_hot ¶
One-hot encode a categorical covariate for use as DMR features.
Given a sequence of category labels (one per document), returns
(matrix, names) where matrix is a (num_docs, num_categories)
float array of 0/1 indicators and names are the corresponding column
names. With drop_first=True (default) the first category (sorted) is
omitted as the reference level, which avoids collinearity with the DMR
intercept. Pass the result straight to DMR.fit(docs, matrix,
feature_names=names); combine multiple covariates with
numpy.hstack.
DataFrames & metadata¶
These accept pandas or Polars frames (and align also takes numpy arrays and
lists), keeping document metadata aligned to the rows that survive pruning.
topica.from_dataframe ¶
from_dataframe(df, *, text_col, metadata_cols=None, tokenizer=None, stopwords=None, min_length=1, min_doc_freq=1, max_doc_fraction=1.0, min_cf=0, rm_top=0)
Build a :class:Corpus from a pandas or Polars DataFrame, keeping
per-document metadata aligned to the documents that survive pruning.
df[text_col] is tokenized (with tokenizer if given, otherwise
:func:topica.tokenize), a :class:Corpus is built with the usual pruning
options, and the surviving rows of metadata_cols (default: every column
except text_col) are attached as corpus.metadata — a DataFrame of the
same kind you passed in (pandas in, pandas out; Polars in, Polars out),
aligned one-to-one with the corpus documents, in the same row order. Feed
that metadata straight to an STM prevalence design with no manual alignment.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame or DataFrame
|
One row per document. |
required |
text_col
|
str
|
Column holding the document text. |
required |
metadata_cols
|
sequence[str]
|
Columns to carry as aligned metadata. Defaults to all columns except
|
None
|
tokenizer
|
callable
|
|
None
|
topica.align ¶
Realign an external covariate array, DataFrame, Series, or list to the
documents a :class:Corpus kept after pruning. Accepts pandas and Polars
DataFrames/Series, numpy arrays, and plain lists.
Use it when your covariates were built against the original documents and the corpus dropped some during pruning::
corpus = topica.Corpus.from_documents(docs, min_doc_freq=5)
X = topica.align(X, corpus) # now aligned to corpus rows
model.fit(corpus, X, prevalence_names=names)
topica.design_matrix ¶
Build a design matrix from an R-style formula and a data frame
(pandas or Polars).
Returns (X, feature_names) where X is a (n_rows, p) float array
and feature_names are the column labels. The intercept that formulaic
adds is stripped, because :func:topica.estimate_effect and the STM
prevalence model add their own. Categorical columns become treatment-coded
dummies; a * b / a:b expand interactions; spline(x, df=k) uses
topica's restricted cubic spline. A Polars frame is converted to pandas for
formulaic.
Requires the optional formulaic package.
Embeddings¶
topica.llm_embed ¶
Embed texts with the llm library's embedding models, as a dense
(n, dim) float array.
The embedding models in topica (BERTopic, Top2Vec, ETM,
FASTopic) and :func:embedding_seeds all take embeddings you supply; this
is one way to produce them. model names any embedding model
llm <https://llm.datasette.io/>_ can reach — OpenAI's
"text-embedding-3-small" / "3-large" (needs an API key), or a local
model such as "sentence-transformers/all-MiniLM-L6-v2" via the
llm-sentence-transformers plugin (no API, runs offline). Pass document
texts for document embeddings, or the vocabulary for word embeddings.
Embeddings are costly, so pass cache=path to embed once and reuse: if the
file exists and was saved for the same texts, it is loaded and no model is
called; otherwise the embeddings are computed and written there (see
:func:save_embeddings).
Requires the optional llm package (pip install "topica[llm]"). The
embeddings are the only thing topica needs from a model; everything downstream
runs in the wheel.
topica.save_embeddings ¶
Save an embedding matrix to a .npz file so a costly corpus is embedded
once and reused.
embeddings is any (n, dim) array. When given, texts (one per row)
is stored as a hash and model as a string, so :func:load_embeddings and
:func:llm_embed's cache= can confirm a cache matches the current inputs.
The path gets a .npz suffix if it lacks one; returns the path written.
Works on any embeddings, not just :func:llm_embed's.
topica.load_embeddings ¶
Load an embedding matrix saved by :func:save_embeddings.
Returns the (n, dim) array, or (array, meta) when with_meta=True;
meta carries model and texts_hash if they were saved. The .npz
suffix is added if the path lacks one and the bare path does not exist.
Phrases¶
topica.learn_phrases ¶
learn_phrases(docs: List[List[str]], *, min_count: int = 5, threshold: float = 10.0, scoring: str = 'default', delimiter: str = '_') -> Phrases
Learn a collocation (phrase) model from tokenized documents.
Counts unigram and adjacent-bigram frequencies across all documents; scores
each candidate bigram and keeps those meeting both the min_count and
threshold criteria.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
docs
|
list[list[str]]
|
Tokenized documents — each document is a list of string tokens. |
required |
min_count
|
int
|
A bigram must appear at least this many times to be considered.
Default |
5
|
threshold
|
float
|
Minimum score for a bigram to be kept. For |
10.0
|
scoring
|
``"default"`` or ``"npmi"``
|
Which association measure to use (see module docstring for formulas).
Default |
'default'
|
delimiter
|
str
|
Character used to join tokens when transforming documents.
Default |
'_'
|
Returns:
| Type | Description |
|---|---|
Phrases
|
A fitted :class: |
Examples:
Bigrams::
p = learn_phrases(docs, min_count=5, threshold=10.0)
docs_bi = p.transform(docs)
Trigrams via composition::
p2 = learn_phrases(docs_bi, min_count=5, threshold=10.0)
docs_tri = p2.transform(docs_bi)
topica.apply_phrases ¶
Apply a :class:Phrases model to tokenized documents.
Performs a greedy left-to-right scan of each document: whenever an adjacent
pair (tok[i], tok[i+1]) is a known collocation the pair is merged into
"tok[i]{delimiter}tok[i+1]" and the cursor advances by 2 (so the merged
token cannot overlap with the next merge). Tokens not involved in a
collocation are passed through unchanged.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
docs
|
list[list[str]]
|
Tokenized documents. |
required |
phrases
|
Phrases
|
A fitted :class: |
required |
Returns:
| Type | Description |
|---|---|
list[list[str]]
|
New documents with collocations merged. |