Models¶
All models share the same shape of API: construct with hyperparameters and a
seed, call fit(documents, ...), then read topic_word (φ), doc_topic (θ),
top_words(n), coherence(n), and save / load.
topica.LDA ¶
SparseLDA topic model (the MALLET algorithm).
Construct with the hyperparameters, then call :meth:fit on a
:class:Corpus or a list of token lists. After fitting, the estimated
distributions are available as :attr:topic_word (φ) and
:attr:doc_topic (θ).
__doc__
class-attribute
¶
__doc__ = 'SparseLDA topic model (the MALLET algorithm).\n\nConstruct with the hyperparameters, then call :meth:`fit` on a\n:class:`Corpus` or a list of token lists. After fitting, the estimated\ndistributions are available as :attr:`topic_word` (φ) and\n:attr:`doc_topic` (θ).'
str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str
Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.
__module__
class-attribute
¶
str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str
Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.
topic_divergence
property
¶
Pairwise Jensen-Shannon divergence between topic-word distributions,
shape (num_topics, num_topics) (base 2, in [0, 1]; 0 on the diagonal).
Low off-diagonal values flag near-duplicate topics.
__new__
builtin
¶
Create and return a new object. See help(type) for accurate signature.
coherence
method descriptor
¶
UMass topic coherence for each topic, shape (num_topics,).
Intrinsic (no external corpus): for each topic's top-n words,
Σ_{i>j} log[(codoc(w_i,w_j)+1)/docfreq(w_j)] over the training corpus.
Higher (closer to 0) is more coherent. numpy.mean(...) gives the
usual single-number summary.
diagnostics
method descriptor
¶
Per-topic diagnostics (MALLET-style), one dict per topic, suitable for
pandas.DataFrame(model.diagnostics()).
Keys mirror MALLET's topic diagnostics: topic, tokens (assignments to
the topic), coherence (UMass), exclusivity (mean top-word share of φ
vs. other topics; higher = more distinctive), effective_words
(exp(H(φ_t)), MALLET's eff_num_words; lower = more focused),
document_entropy (entropy of the topic's token allocation across
documents), uniform_dist (KL of φ_t from uniform) and corpus_dist
(KL of φ_t from the corpus word distribution), rank1_docs (documents
whose dominant topic is this one), alpha, and top_words.
evaluate
method descriptor
¶
Held-out evaluation via the Wallach et al. (2009) left-to-right
estimator (the method MALLET's evaluate-topics uses).
data is a held-out :class:Corpus or list[list[str]]; its tokens are
matched to the training vocabulary by string (out-of-vocabulary tokens
are dropped). Returns a dict with log_likelihood (total held-out log
P(data)), perplexity (exp(-LL / num_tokens), lower is better),
num_tokens (scored), and num_oov (dropped). Cost grows with the
square of document length, so keep num_particles modest.
fit
method descriptor
¶
fit(data, *, iterations=1000, num_samples=5, sample_interval=25, progress=None, progress_interval=50)
Run Gibbs sampling on data, then average num_samples snapshots
(taken sample_interval iterations apart) into the final φ/θ estimates.
data may be a :class:Corpus or a list of token lists (each a list of
strings). When a token-list is passed, an internal corpus is built with
no frequency filtering — build a :class:Corpus explicitly for that.
progress, if given, is called as progress(iteration, ll_per_token)
every progress_interval iterations during the main loop.
load_state
staticmethod
¶
Reconstruct a fitted model from a MALLET-format Gibbs state file (the
inverse of :meth:save_state; MALLET's --input-state). The file may
be gzip-compressed or plain text. The vocabulary, documents, per-token
topic assignments, and the #alpha/#beta hyperparameters are read
back, so the loaded model supports the full read-only surface
(topic_word, doc_topic, top_words, …) and transform on new
documents, and can re-emit the state with :meth:save_state.
log_likelihood
method descriptor
¶
MALLET-formula model log-likelihood of the final sampler state.
perplexity
method descriptor
¶
Held-out perplexity (lower is better) — convenience wrapper over
:meth:evaluate. See evaluate for data/num_particles semantics.
save
method descriptor
¶
Save the fitted model to path (compact binary). Reload with LDA.load.
save_doc_topic
method descriptor
¶
Write document-topic probabilities to a TSV file (the train CLI format).
save_state
method descriptor
¶
Write the token-level Gibbs state to a gzipped file in MALLET's
--output-state format: a header, the #alpha/#beta hyperparameter
lines, then one row per token — doc source pos typeindex type topic —
giving the final topic assignment of every token in the training corpus.
Researchers pipe this into custom visualizations (e.g. pyLDAvis) or
corpus metrics. The file is gzip-compressed, as MALLET writes it.
save_topic_word
method descriptor
¶
Write topic-word probabilities to a TSV file (the train CLI format).
similar_documents
method descriptor
¶
The n training documents most similar to document doc (by index),
as (doc_name, divergence) pairs sorted by ascending Jensen-Shannon
divergence of their document-topic distributions.
top_documents
method descriptor
¶
The n training documents most strongly associated with topic, as
(doc_name, weight) pairs sorted by descending θ for that topic.
top_words
method descriptor
¶
Top n words per topic as (word, probability) pairs.
Returns a list of n-length lists (one per topic), or — when topic
is given — just that topic's list.
transform
method descriptor
¶
Infer document-topic distributions for new, unseen documents under the
fitted model (sklearn-style transform). data is a :class:Corpus or
list[list[str]]; tokens are matched to the training vocabulary by
string (OOV dropped). A document with no in-vocabulary tokens gets the
prior θ. Returns an array of shape (num_new_docs, num_topics) whose
rows sum to 1.
topica.DMR ¶
Dirichlet-Multinomial Regression topic model (Mimno & McCallum, 2008).
Like :class:LDA, but the per-document topic prior is a log-linear function
of document features: α_{d,t} = exp(λ_t · x_d). After fitting, the
learned weights are available as :attr:feature_effects — how each covariate
shifts each topic's prevalence.
__doc__
class-attribute
¶
__doc__ = "Dirichlet-Multinomial Regression topic model (Mimno & McCallum, 2008).\n\nLike :class:`LDA`, but the per-document topic prior is a log-linear function\nof document features: ``α_{d,t} = exp(λ_t · x_d)``. After fitting, the\nlearned weights are available as :attr:`feature_effects` — how each covariate\nshifts each topic's prevalence."
str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str
Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.
__module__
class-attribute
¶
str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str
Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.
feature_effects
property
¶
Learned feature weights λ, shape (num_topics, num_features) — how
each feature (column 0 is the intercept) shifts each topic's log-prior.
Positive ⇒ the feature raises that topic's prevalence.
feature_names
property
¶
Feature names aligned with the columns of :attr:feature_effects
("intercept" first).
__new__
builtin
¶
Create and return a new object. See help(type) for accurate signature.
fit
method descriptor
¶
fit(data, features, *, feature_names=None, iterations=1000, num_samples=5, sample_interval=25, progress=None, progress_interval=50)
Fit the model. data is a :class:Corpus or list[list[str]];
features is a (num_docs, F) numpy array or list of float lists (an
intercept column is prepended automatically). feature_names (length F)
names the columns; an "intercept" name is prepended.
save
method descriptor
¶
Save the fitted model to path (compact binary). Reload with DMR.load.
top_words
method descriptor
¶
Top n words per topic as (word, probability) pairs (all topics, or
one when topic is given).
transform
method descriptor
¶
transform(data, features=None, *, iterations=100, burn_in=10, num_samples=10, sample_interval=5, seed=None)
Infer topic proportions θ for new documents by collapsed Gibbs against
the fitted topic-word matrix. data is a :class:Corpus or
list[list[str]]; OOV tokens are dropped. features (optional, a
(num_docs, F) covariate array matching training, no intercept) sets
each document's Dirichlet prior α_d = exp(Xγ); if omitted the
intercept-only baseline prior is used. Returns (num_docs, num_topics).
topica.LabeledLDA ¶
Supervised topic model (Ramage et al., 2009): each document carries a set of labels, each label is a topic, and a document's tokens are constrained to its labels' topics. The number of topics is the number of distinct labels.
Documents with an empty label set are treated as unconstrained (all topics).
__doc__
class-attribute
¶
__doc__ = "Supervised topic model (Ramage et al., 2009): each document carries a set of\nlabels, each label is a topic, and a document's tokens are constrained to its\nlabels' topics. The number of topics is the number of distinct labels.\n\nDocuments with an empty label set are treated as unconstrained (all topics)."
str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str
Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.
__module__
class-attribute
¶
str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str
Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.
doc_topic
property
¶
Document-topic matrix θ, shape (num_docs, num_topics); for each
document only its label topics are non-zero, and rows sum to 1.
__new__
builtin
¶
Create and return a new object. See help(type) for accurate signature.
fit
method descriptor
¶
fit(data, labels, *, label_names=None, iterations=1000, num_samples=5, sample_interval=25, progress=None, progress_interval=50)
Fit the model. data is a :class:Corpus or list[list[str]];
labels is a list (one per document) of label lists. The topic set is
the union of all labels (or label_names, which also fixes topic order).
An empty label list leaves that document unconstrained.
top_words
method descriptor
¶
Top n words for one topic (by label name or index) or all topics.
transform
method descriptor
¶
Infer label (topic) proportions θ for new documents by collapsed Gibbs
against the fitted topic-word matrix, treating every label as available
(unsupervised inference). data is a :class:Corpus or
list[list[str]]; OOV tokens are dropped. Returns (num_docs,
num_topics); columns align with :attr:labels.
topica.SAGE ¶
Content-covariate topic model (SAGE / the STM content model).
Topics are shared, but each topic's word distribution varies by a
document-level group covariate, so you can read how a topic is worded
differently across groups. Construct, then :meth:fit on documents plus a
per-document group label.
__doc__
class-attribute
¶
__doc__ = "Content-covariate topic model (SAGE / the STM content model).\n\nTopics are shared, but each topic's word distribution varies by a\ndocument-level **group** covariate, so you can read how a topic is worded\ndifferently across groups. Construct, then :meth:`fit` on documents plus a\nper-document group label."
str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str
Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.
__module__
class-attribute
¶
str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str
Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.
doc_topic
property
¶
Document-topic matrix θ, shape (num_docs, num_topics); rows sum to 1.
topic_word
property
¶
Topic-word distributions per group, shape (num_topics, num_groups, num_words).
topic_word_marginal
property
¶
Group-averaged topic-word matrix, shape (num_topics, num_words).
__new__
builtin
¶
Create and return a new object. See help(type) for accurate signature.
coherence
method descriptor
¶
UMass topic coherence per topic (group-averaged), shape (num_topics,).
fit
method descriptor
¶
fit(data, groups, *, group_names=None, iterations=1000, num_samples=5, sample_interval=25, progress=None, progress_interval=50)
Fit the model. data is a :class:Corpus or list[list[str]];
groups is a per-document group label (strings or ints), one per
document. group_names fixes the group order (defaults to sorted union).
top_words
method descriptor
¶
Top n words for a topic. With group (name or index) given, uses that
group's word distribution; otherwise the group-averaged distribution.
word_contrast
method descriptor
¶
Words that most distinguish how topic is worded in group_a vs
group_b, by log-ratio of the two groups' word probabilities. Returns
(word, log_ratio) — positive favours group_a.
topica.CTM ¶
Correlated Topic Model (Blei & Lafferty; the STM core). Topics are drawn from a logistic-normal prior with a full covariance, so they can correlate — unlike LDA's Dirichlet. Fit by variational EM (STM's Laplace E-step).
This is the engine STM builds on; prevalence/content covariates layer on top.
__doc__
class-attribute
¶
__doc__ = "Correlated Topic Model (Blei & Lafferty; the STM core). Topics are drawn\nfrom a logistic-normal prior with a full covariance, so they can correlate —\nunlike LDA's Dirichlet. Fit by variational EM (STM's Laplace E-step).\n\nThis is the engine STM builds on; prevalence/content covariates layer on top."
str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str
Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.
__module__
class-attribute
¶
str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str
Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.
bound
property
¶
Final variational bound (approximate ELBO) at convergence — the quantity
R stm reports as convergence$bound.
bound_history
property
¶
The variational bound after each EM iteration (the convergence trajectory). Its length is the number of iterations actually run.
converged
property
¶
True if EM stopped on the em_tol criterion; False if it hit the
em_iters cap first (the fit may not have converged).
doc_topic
property
¶
Document-topic matrix θ, shape (num_docs, num_topics); rows sum to 1.
eta_cov
property
¶
Per-document variational posterior covariances ν of η, shape
(num_docs, num_topics-1, num_topics-1).
eta_mean
property
¶
Per-document variational posterior means λ of the logistic-normal η,
shape (num_docs, num_topics-1). Pairs with :attr:eta_cov to sample
θ draws (method-of-composition uncertainty).
topic_correlation
property
¶
Topic-correlation matrix from the logistic-normal Σ, shape
(num_topics, num_topics). Off-diagonal entries are genuine topic
correlations (the whole point of CTM vs. LDA).
__new__
builtin
¶
Create and return a new object. See help(type) for accurate signature.
fit
method descriptor
¶
Fit by variational EM. data is a :class:Corpus or list[list[str]].
EM runs until the relative change in the variational bound drops below
em_tol (R stm's emtol) or em_iters iterations are reached,
whichever comes first. Pass em_tol=0 to always run em_iters steps.
Check :attr:converged and :attr:bound afterward.
top_words
method descriptor
¶
Top n words per topic (or one topic) as (word, probability) pairs.
transform
method descriptor
¶
Infer topic proportions θ for new documents by the variational E-step
against the fitted globals (β, logistic-normal prior μ, Σ). data is a
:class:Corpus or list[list[str]]; tokens outside the training
vocabulary are dropped. Returns a (num_docs, num_topics) array.
topica.STM ¶
Structural Topic Model (Roberts, Stewart & Tingley). The correlated-topic
core (:class:CTM) with prevalence covariates: a document's prior topic
mean is a regression on its covariates, μ_d = X_d γ, so covariates shift
which topics a document discusses. After fitting, prevalence_effects holds
the learned γ; pair it with topica.stm.estimate_effect for inference.
__doc__
class-attribute
¶
__doc__ = "Structural Topic Model (Roberts, Stewart & Tingley). The correlated-topic\ncore (:class:`CTM`) with **prevalence covariates**: a document's prior topic\nmean is a regression on its covariates, `μ_d = X_d γ`, so covariates shift\nwhich topics a document discusses. After fitting, `prevalence_effects` holds\nthe learned γ; pair it with `topica.stm.estimate_effect` for inference."
str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str
Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.
__module__
class-attribute
¶
str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str
Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.
bound
property
¶
Final variational bound (approximate ELBO) at convergence — the quantity
R stm reports as convergence$bound.
bound_history
property
¶
The variational bound after each EM iteration (the convergence trajectory). Its length is the number of iterations actually run.
converged
property
¶
True if EM stopped on the em_tol criterion; False if it hit the
em_iters cap first (the fit may not have converged).
doc_topic
property
¶
Document-topic matrix θ, shape (num_docs, num_topics); rows sum to 1.
eta_cov
property
¶
Per-document variational posterior covariances ν of η, shape
(num_docs, num_topics-1, num_topics-1).
eta_mean
property
¶
Per-document variational posterior means λ of η, shape
(num_docs, num_topics-1). With :attr:eta_cov this is the
logistic-normal posterior used to draw θ samples for
method-of-composition uncertainty in estimate_effect.
feature_names
property
¶
Covariate names aligned with the rows of :attr:prevalence_effects
("intercept" first).
prevalence_effects
property
¶
Prevalence coefficients γ, shape (num_features, num_topics-1) — how
each covariate (row 0 is the intercept) shifts each topic's log-prior.
The last topic is the softmax reference. For inference, prefer
topica.stm.estimate_effect(model.doc_topic, X).
topic_correlation
property
¶
Topic-correlation matrix, shape (num_topics, num_topics).
topic_word_by_group
property
¶
Per-group topic-word distributions, shape (num_topics, num_groups,
num_words) — only available when fit with content covariates.
__new__
builtin
¶
Create and return a new object. See help(type) for accurate signature.
fit
method descriptor
¶
fit(data, prevalence=None, *, prevalence_names=None, content=None, content_names=None, em_iters=500, em_tol=1e-05)
Fit. data is a :class:Corpus or list[list[str]]. prevalence
(optional, (num_docs, F) covariates) makes topic prevalence depend on
covariates (μ_d = X_d γ); an intercept is prepended. content
(optional, one group label per document) makes the topic-word
distributions vary by group (the SAGE content model). At least one of
prevalence/content should be given (else use :class:CTM).
EM runs until the relative change in the variational bound drops below
em_tol (R stm's emtol) or em_iters iterations are reached,
whichever comes first — matching stm's convergence behavior rather than
a fixed iteration count. Pass em_tol=0 to always run em_iters
steps. Inspect :attr:converged and :attr:bound after fitting.
top_words
method descriptor
¶
Top n words per topic (or one topic) as (word, probability) pairs.
transform
method descriptor
¶
Infer topic proportions θ for new documents by the variational E-step
against the fitted globals (β and the logistic-normal prior). data is a
:class:Corpus or list[list[str]]; out-of-vocabulary tokens are
dropped. Returns a (num_docs, num_topics) array.
Note: the prior mean used is the covariate-free baseline μ learned at fit
time (prevalence covariates for held-out docs are not applied here), and
for a content model the marginal topic-word β is used. This is the same
held-out inference stm's fitNewDocuments performs when no new covariate
design is supplied.
word_contrast
method descriptor
¶
Words that most distinguish how topic is worded in group_a vs
group_b (log word-probability ratio; positive favours group_a).
Requires content covariates.
topica.HDP ¶
Hierarchical Dirichlet Process topic model (Teh, Jordan, Beal & Blei 2006):
LDA that infers the number of topics rather than fixing it. Fit by the
direct-assignment Gibbs sampler (the Chinese Restaurant Franchise). The two
concentration parameters alpha (document level) and gamma (corpus level)
govern how readily new topics appear; by default both are resampled from the
data (a faithful port of blei-lab/hdp), so you typically don't tune them.
__doc__
class-attribute
¶
__doc__ = "Hierarchical Dirichlet Process topic model (Teh, Jordan, Beal & Blei 2006):\nLDA that **infers the number of topics** rather than fixing it. Fit by the\ndirect-assignment Gibbs sampler (the Chinese Restaurant Franchise). The two\nconcentration parameters `alpha` (document level) and `gamma` (corpus level)\ngovern how readily new topics appear; by default both are resampled from the\ndata (a faithful port of blei-lab/hdp), so you typically don't tune them."
str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str
Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.
__module__
class-attribute
¶
str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str
Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.
concentration_history
property
¶
The learned-concentration trace: (iteration, alpha, gamma) triples
sampled during fit (only informative when resample_conc=True). Empty
if tracing was disabled.
doc_topic
property
¶
Document-topic matrix θ, shape (num_docs, num_topics); rows sum to 1.
log_likelihood_history
property
¶
The convergence trace: (iteration, per-token log-likelihood) pairs
sampled during fit. Empty if tracing was disabled.
topic_count_history
property
¶
The topic-discovery trajectory: (iteration, num_topics) pairs sampled
during fit. Watching K stabilize is the nonparametric model's headline
convergence check (it grows and shrinks before settling). Sampled every
report_interval sweeps (auto ≈ 50 points); empty if disabled.
__new__
builtin
¶
Create and return a new object. See help(type) for accurate signature.
fit
method descriptor
¶
Fit by Gibbs sampling for iters sweeps. data is a :class:Corpus or
list[list[str]]. The inferred topic count is available as num_topics.
top_words
method descriptor
¶
Top n words per topic (or one topic) as (word, probability) pairs.
transform
method descriptor
¶
Infer topic proportions θ for new documents over the discovered topics,
by collapsed Gibbs against the fixed topic-word matrix. data is a
:class:Corpus or list[list[str]]; OOV tokens are dropped. The
document-level prior is symmetric with total mass equal to the learned
concentration α. Returns a (num_docs, num_topics) array.
topica.DTM ¶
Dynamic Topic Model (Blei & Lafferty 2006): topics whose word distributions
evolve across time slices. Each topic-word chain follows a Gaussian
state-space model; inference is variational with Kalman smoothing, a faithful
port of Blei's C dtm / gensim's LdaSeqModel. After fitting, query a
topic's word distribution at any slice with topic_word(time) and trace a
word's trajectory with word_evolution(topic, word).
__doc__
class-attribute
¶
__doc__ = "Dynamic Topic Model (Blei & Lafferty 2006): topics whose word distributions\n**evolve across time slices**. Each topic-word chain follows a Gaussian\nstate-space model; inference is variational with Kalman smoothing, a faithful\nport of Blei's C `dtm` / gensim's `LdaSeqModel`. After fitting, query a\ntopic's word distribution at any slice with `topic_word(time)` and trace a\nword's trajectory with `word_evolution(topic, word)`."
str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str
Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.
__module__
class-attribute
¶
str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str
Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.
__new__
builtin
¶
Create and return a new object. See help(type) for accurate signature.
fit
method descriptor
¶
Fit by variational EM. data is a :class:Corpus or list[list[str]];
times gives each document's integer time-slice index (0-based,
contiguous). The number of slices is inferred as max(times) + 1.
top_words
method descriptor
¶
Top n words for a topic at one time slice as (word, probability).
topic_word
method descriptor
¶
Topic-word matrix at time slice time, shape (num_topics, num_words);
rows sum to 1.
word_drift
method descriptor
¶
Which words inside topic drift most between two time slices.
For each word, the change in its probability within the topic from
from_time to to_time (defaults: the first and last slices) is
computed. Returns a dict with two keys, "rising" and "falling",
each a list of (word, delta) pairs (largest gain first; largest drop
first). This is how you see what makes a topic's vocabulary evolve, not
just that it does.
word_evolution
method descriptor
¶
Trajectory of a word's probability in a topic across slices, shape
(num_times,). word is a vocabulary string or its integer id.
topica.SupervisedLDA ¶
Supervised LDA (Blei & McAuliffe 2007): LDA in which each document carries a
real-valued response y_d ~ N(ηᵀ z̄_d, σ²) regressed on its topic usage.
Fitting is supervised by the response, so topics are shaped to be predictive
and the coefficients η report how each topic moves y. Fit by variational
EM; predict returns ŷ for new documents.
__doc__
class-attribute
¶
__doc__ = 'Supervised LDA (Blei & McAuliffe 2007): LDA in which each document carries a\nreal-valued response `y_d ~ N(ηᵀ z̄_d, σ²)` regressed on its topic usage.\nFitting is supervised by the response, so topics are shaped to be predictive\nand the coefficients `η` report how each topic moves `y`. Fit by variational\nEM; `predict` returns ŷ for new documents.'
str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str
Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.
__module__
class-attribute
¶
str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str
Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.
coefficients
property
¶
Regression coefficients η, shape (num_topics,) — how each topic moves
the response (in the response's units, per unit of topic frequency).
doc_topic
property
¶
Document-topic matrix θ, shape (num_docs, num_topics); rows sum to 1.
__new__
builtin
¶
Create and return a new object. See help(type) for accurate signature.
fit
method descriptor
¶
Fit by variational EM. data is a :class:Corpus or list[list[str]];
y is the per-document real-valued response (length = number of docs).
predict
method descriptor
¶
Predict the response ŷ for new documents (list[list[str]] or a
:class:Corpus). Out-of-vocabulary words are ignored. Returns a 1-D array
of length = number of documents.
top_words
method descriptor
¶
Top n words per topic (or one topic) as (word, probability) pairs.
transform
method descriptor
¶
Infer topic proportions θ for new documents by collapsed Gibbs against
the fitted topic-word matrix (the response is not used — this is the
unsupervised E-step). data is a :class:Corpus or list[list[str]];
OOV tokens are dropped. Returns (num_docs, num_topics). To predict the
response for new documents, take transform(data) @ eta.
topica.PT ¶
Pseudo-document Topic Model (Zuo et al. 2016) for short texts. Documents
are aggregated into num_pseudo pseudo-documents that carry the topic
distributions, so the topic structure is estimated from richer aggregated
statistics than individual short documents would provide. Collapsed Gibbs.
__doc__
class-attribute
¶
__doc__ = 'Pseudo-document Topic Model (Zuo et al. 2016) for **short texts**. Documents\nare aggregated into `num_pseudo` pseudo-documents that carry the topic\ndistributions, so the topic structure is estimated from richer aggregated\nstatistics than individual short documents would provide. Collapsed Gibbs.'
str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str
Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.
__module__
class-attribute
¶
str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str
Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.
__new__
builtin
¶
Create and return a new object. See help(type) for accurate signature.
topica.GSDMM ¶
GSDMM — the "Movie Group Process" (Yin & Wang 2014). A mixture model for
short texts (tweets, survey answers, headlines) where each document
belongs to exactly one topic, not a mixture. You set an upper bound K on
the number of clusters; empty clusters die out during sampling, so the
effective num_topics is inferred from the data (≤ K). Handles the sparsity
of short documents far better than LDA.
__doc__
class-attribute
¶
__doc__ = 'GSDMM — the "Movie Group Process" (Yin & Wang 2014). A mixture model for\n**short texts** (tweets, survey answers, headlines) where each document\nbelongs to exactly *one* topic, not a mixture. You set an upper bound `K` on\nthe number of clusters; empty clusters die out during sampling, so the\neffective `num_topics` is inferred from the data (≤ K). Handles the sparsity\nof short documents far better than LDA.'
str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str
Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.
__module__
class-attribute
¶
str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str
Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.
cluster_count_history
property
¶
The cluster-discovery trajectory: (iteration, num_clusters) pairs over
the fit. The Movie Group Process starts from num_topics clusters and
empties most of them; watching the count collapse to a stable value is
its headline convergence check. Sampled every report_interval sweeps
(auto ≈ 50 points); empty if disabled.
doc_cluster
property
¶
Hard cluster assignment of each document, shape (num_docs,); values in
0..num_topics. GSDMM gives each document a single cluster.
doc_topic
property
¶
Document-topic matrix θ, shape (num_docs, num_topics); rows sum to 1.
log_likelihood_history
property
¶
The convergence trace: (iteration, per-token log-likelihood) pairs
(each document scored under its assigned cluster). Empty if disabled.
topic_word
property
¶
Topic-word matrix β, shape (num_topics, num_words) (used clusters only).
__new__
builtin
¶
Create and return a new object. See help(type) for accurate signature.
fit
method descriptor
¶
Fit by the Movie Group Process (collapsed Gibbs) for iters sweeps.
report_interval controls the cluster-discovery trace
(cluster_count_history / log_likelihood_history): 0 = auto (~50
points), a positive value records every that-many sweeps.
topica.SeededLDA ¶
Seeded LDA (guided topic modeling): you supply a few seed words per topic
and the model is steered so those topics form around them, while the rest of
each topic's vocabulary (and any residual unseeded topics) is still learned.
Useful when theory tells you which themes to expect (Jagarlamudi et al. 2012;
the seeding follows koheiw/seededlda — seed words get a weight × 100
prior pseudocount in their topic).
__doc__
class-attribute
¶
__doc__ = "Seeded LDA (guided topic modeling): you supply a few **seed words** per topic\nand the model is steered so those topics form around them, while the rest of\neach topic's vocabulary (and any `residual` unseeded topics) is still learned.\nUseful when theory tells you which themes to expect (Jagarlamudi et al. 2012;\nthe seeding follows koheiw/seededlda — seed words get a `weight × 100`\nprior pseudocount in their topic)."
str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str
Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.
__module__
class-attribute
¶
str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str
Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.
topic_names
property
¶
The topic labels: the seed names you gave, then residual_1 … for any
unseeded topics.
__new__
builtin
¶
Create and return a new object. See help(type) for accurate signature.
fit
method descriptor
¶
Fit by collapsed Gibbs for iters sweeps. Seeded topics come first (in
the order given), then the residual topics.
doc_topic_prior (optional, (num_docs, num_topics)) supplies a
per-document asymmetric Dirichlet prior α_{d,k} that replaces the
symmetric alpha, biasing each document's topic mixture toward chosen
topics (e.g. from a document embedding). It is a prior, so the sampler
can still move a document away from it.
topica.KeyATM ¶
Keyword-Assisted Topic Model (keyATM Base). Like LDA, but some topics carry a researcher-supplied keyword list; a token in a keyword topic comes either from a distribution over only that topic's keywords or from the topic's full distribution. This anchors keyword topics to their keywords while still learning the rest of the vocabulary. Faithful to keyATM/keyATM.
__doc__
class-attribute
¶
__doc__ = "Keyword-Assisted Topic Model (keyATM Base). Like LDA, but some topics carry a\nresearcher-supplied **keyword** list; a token in a keyword topic comes either\nfrom a distribution over only that topic's keywords or from the topic's full\ndistribution. This anchors keyword topics to their keywords while still\nlearning the rest of the vocabulary. Faithful to keyATM/keyATM."
str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str
Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.
__module__
class-attribute
¶
str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str
Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.
alpha_history
property
¶
Trace of the estimated document-topic prior α as (iteration, alpha)
pairs, where alpha is the length-K asymmetric prior at that sweep —
keyATM's plot_alpha / values_iter$alpha_iter. Base model only;
empty for the covariate model (which traces λ) and dynamic model.
feature_effects
property
¶
Covariate model: learned DMR coefficients λ, shape (num_topics, F+1);
column 0 is the intercept. Raises if the model was fit without covariates.
feature_names
property
¶
Covariate model: names aligned with feature_effects columns
("intercept" first). Empty for the base model.
keyword_rate
property
¶
Per-topic keyword switch rate π_k (the share of a keyword topic's mass
drawn from its keyword distribution); 0 for regular topics.
log_likelihood_history
property
¶
Convergence trace as a list of (iteration, log_likelihood, perplexity)
triples — the three columns of keyATM's model_fit (plot_modelfit).
log_likelihood is the collapsed marginal log-likelihood and
perplexity is exp(-log_likelihood / total_weighted_tokens), both on
R keyATM's scale. Sampled every report_interval sweeps during
:meth:fit (auto ≈ 50 points). Empty if tracing was disabled.
pi_history
property
¶
Trace of the per-topic keyword switch rate π as (iteration, pi) pairs
(pi length K, 0 for regular topics) — keyATM's plot_pi /
values_iter$pi_iter. Empty for a keyword-free model.
time_labels
property
¶
Dynamic model: the distinct, sorted timestamp labels, one per time segment (length T). Empty for non-dynamic models.
time_prevalence
property
¶
Dynamic model: smoothed topic prevalence per time segment, shape
(T, num_topics), rows sum to 1, aligned with time_labels. Raises if
the model was fit without timestamps.
time_state
property
¶
Dynamic model: the latent HMM state (regime) of each time segment, length
T, aligned with time_labels. Empty for non-dynamic models.
transition_matrix
property
¶
Dynamic model: the left-to-right state transition matrix, shape
(num_states, num_states). Raises if fit without timestamps.
__new__
builtin
¶
Create and return a new object. See help(type) for accurate signature.
fit
method descriptor
¶
fit(data, *, iters=1500, covariates=None, feature_names=None, timestamps=None, num_states=5, weights='information-theory', num_threads=1, optimize_interval=50, burn_in=200, prior_variance=1.0, lbfgs_iters=20, report_interval=0, prior_offset=None)
Fit by collapsed Gibbs for iters sweeps. Keyword topics come first (in
the order given), then any regular topics.
Pass covariates (a (num_docs, F) array or list of float lists) for
the covariate keyATM: the document-topic prior becomes a
Dirichlet-multinomial regression, α_{d,k} = exp(x_d · λ_k) (an
intercept is prepended). feature_names (length F) labels the columns;
the learned λ is exposed as feature_effects. With no covariates,
this is the base symmetric-α keyATM.
Pass timestamps (one value per document) for the dynamic keyATM: a
Chib (1998) change-point HMM lets topic prevalence shift over time across
num_states latent regimes. Documents are sorted by timestamp internally;
the smoothed prevalence path is exposed as time_prevalence (aligned with
time_labels) and the per-segment regime as time_state. timestamps
and covariates are mutually exclusive.
weighted_lda
staticmethod
¶
Weighted LDA — keyATM's weightedLDA: a keyword-free model with no
keyword topics, so it is plain LDA fit with keyATM's token weighting and
estimated asymmetric α (collapsed Gibbs). Use it as the unsupervised
baseline next to a keyword-assisted :class:KeyATM. fit it the same
way (the weights argument controls the token weighting); the
keyword-specific outputs (keyword_rate, pi_history) are empty.
topica.PA ¶
Pachinko Allocation Model (Li & McCallum 2006): a DAG of num_super
super-topics over num_sub shared sub-topics over words, capturing topic
correlations — super_sub reports which sub-topics each super-topic groups
together. Collapsed Gibbs over (super, sub) pairs.
__doc__
class-attribute
¶
__doc__ = 'Pachinko Allocation Model (Li & McCallum 2006): a DAG of `num_super`\nsuper-topics over `num_sub` shared sub-topics over words, capturing topic\n*correlations* — `super_sub` reports which sub-topics each super-topic groups\ntogether. Collapsed Gibbs over (super, sub) pairs.'
str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str
Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.
__module__
class-attribute
¶
str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str
Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.
super_sub
property
¶
Super-topic → sub-topic association, shape (num_super, num_sub); row s
shows which sub-topics super-topic s groups together (the correlations).
__new__
builtin
¶
Create and return a new object. See help(type) for accurate signature.
topica.HLDA ¶
Hierarchical LDA (Blei, Griffiths & Jordan): topics organized in a tree of
fixed depth, inferred by the nested Chinese Restaurant Process. The root is
the shared (general) topic; deeper nodes are progressively more specific.
Each document follows a root-to-leaf path. Inspect the tree with
topic_word/node_levels/node_parents/doc_paths.
__doc__
class-attribute
¶
__doc__ = 'Hierarchical LDA (Blei, Griffiths & Jordan): topics organized in a tree of\nfixed `depth`, inferred by the nested Chinese Restaurant Process. The root is\nthe shared (general) topic; deeper nodes are progressively more specific.\nEach document follows a root-to-leaf path. Inspect the tree with\n`topic_word`/`node_levels`/`node_parents`/`doc_paths`.'
str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str
Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.
__module__
class-attribute
¶
str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str
Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.
doc_paths
property
¶
Each document's root-to-leaf path (a list of node ids), length num_docs.
node_parents
property
¶
The parent node id of each node (-1 for the root), length num_nodes.
__new__
builtin
¶
Create and return a new object. See help(type) for accurate signature.
fit
method descriptor
¶
Fit by nested-CRP collapsed Gibbs sampling for iters sweeps.
top_words
method descriptor
¶
Top n words for one topic node as (word, probability) pairs.
topica.Corpus ¶
A preprocessed, integer-encoded document collection.
Build one from already-tokenised documents with
:meth:Corpus.from_documents, from a raw text file with
:meth:Corpus.from_text_file, or load a binary corpus written by the
preprocess CLI with :meth:Corpus.load.
__doc__
class-attribute
¶
__doc__ = 'A preprocessed, integer-encoded document collection.\n\nBuild one from already-tokenised documents with\n:meth:`Corpus.from_documents`, from a raw text file with\n:meth:`Corpus.from_text_file`, or load a binary corpus written by the\n``preprocess`` CLI with :meth:`Corpus.load`.'
str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str
Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.
__module__
class-attribute
¶
str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str
Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.
kept_indices
property
¶
Original document indices that survived pruning, parallel to the rows of
this corpus. Use it to realign an external covariate array or DataFrame
to the documents the corpus actually kept: X = X[corpus.kept_indices].
metadata
property
¶
Optional per-document metadata, already aligned to the surviving rows
(set by :func:topica.from_dataframe, or assign your own). None if
unset.
__new__
builtin
¶
Create and return a new object. See help(type) for accurate signature.
from_documents
staticmethod
¶
from_documents(documents, *, doc_names=None, doc_labels=None, stopwords=None, min_doc_freq=1, max_doc_fraction=1.0, min_cf=0, rm_top=0)
Build a corpus from pre-tokenised documents.
documents is a sequence of token lists. Optional doc_names /
doc_labels (each the same length as documents) attach an id and a
label to every document. stopwords are dropped. Vocabulary is pruned by
min_doc_freq (minimum document frequency) and max_doc_fraction
(maximum fraction of documents), by min_cf (minimum collection/total
frequency), and by rm_top (drop the N most frequent words) — matching
tomotopy's min_df / min_cf / rm_top.
from_text_file
staticmethod
¶
from_text_file(path, *, format='plain', id_field=False, id_column=0, label_column=1, text_column=2, token_regex=None, stopwords=None, min_doc_freq=1, max_doc_fraction=1.0)
Load and tokenise a raw text file (MALLET-style), matching the
preprocess CLI.
format is "plain" (one document per line) or "tsv". In plain
mode, id_field=True treats the first whitespace token as the doc id.
In tsv mode, id_column/label_column/text_column select columns
(label_column=None disables labels).
load
staticmethod
¶
Load a binary corpus file written by the preprocess CLI or
:meth:save.
save
method descriptor
¶
Write this corpus to a binary file (the preprocess format), so it
can be reused by the CLI tools or reloaded with :meth:load.