Skip to content

Models

All models share the same shape of API: construct with hyperparameters and a seed, call fit(documents, ...), then read topic_word (φ), doc_topic (θ), top_words(n), coherence(n), and save / load.

topica.LDA

SparseLDA topic model (the MALLET algorithm).

Construct with the hyperparameters, then call :meth:fit on a :class:Corpus or a list of token lists. After fitting, the estimated distributions are available as :attr:topic_word (φ) and :attr:doc_topic (θ).

__doc__ class-attribute

__doc__ = 'SparseLDA topic model (the MALLET algorithm).\n\nConstruct with the hyperparameters, then call :meth:`fit` on a\n:class:`Corpus` or a list of token lists. After fitting, the estimated\ndistributions are available as :attr:`topic_word` (φ) and\n:attr:`doc_topic` (θ).'

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.

__module__ class-attribute

__module__ = 'topica'

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.

alpha property

alpha

Per-topic α after (optional) optimisation, shape (num_topics,).

beta property

beta

The (optimised) symmetric β.

doc_names property

doc_names

Document ids, parallel to the rows of :attr:doc_topic.

doc_topic property

doc_topic

Document-topic probability matrix θ, shape (num_docs, num_topics).

num_topics property

num_topics

topic_divergence property

topic_divergence

Pairwise Jensen-Shannon divergence between topic-word distributions, shape (num_topics, num_topics) (base 2, in [0, 1]; 0 on the diagonal). Low off-diagonal values flag near-duplicate topics.

topic_word property

topic_word

Topic-word probability matrix φ, shape (num_topics, num_words).

vocabulary property

vocabulary

The vocabulary: word for each column of :attr:topic_word.

__new__ builtin

__new__(*args, **kwargs)

Create and return a new object. See help(type) for accurate signature.

__repr__ method descriptor

__repr__()

Return repr(self).

coherence method descriptor

coherence(n=10)

UMass topic coherence for each topic, shape (num_topics,).

Intrinsic (no external corpus): for each topic's top-n words, Σ_{i>j} log[(codoc(w_i,w_j)+1)/docfreq(w_j)] over the training corpus. Higher (closer to 0) is more coherent. numpy.mean(...) gives the usual single-number summary.

diagnostics method descriptor

diagnostics(n=10)

Per-topic diagnostics (MALLET-style), one dict per topic, suitable for pandas.DataFrame(model.diagnostics()).

Keys mirror MALLET's topic diagnostics: topic, tokens (assignments to the topic), coherence (UMass), exclusivity (mean top-word share of φ vs. other topics; higher = more distinctive), effective_words (exp(H(φ_t)), MALLET's eff_num_words; lower = more focused), document_entropy (entropy of the topic's token allocation across documents), uniform_dist (KL of φ_t from uniform) and corpus_dist (KL of φ_t from the corpus word distribution), rank1_docs (documents whose dominant topic is this one), alpha, and top_words.

evaluate method descriptor

evaluate(data, *, num_particles=10, seed=None)

Held-out evaluation via the Wallach et al. (2009) left-to-right estimator (the method MALLET's evaluate-topics uses).

data is a held-out :class:Corpus or list[list[str]]; its tokens are matched to the training vocabulary by string (out-of-vocabulary tokens are dropped). Returns a dict with log_likelihood (total held-out log P(data)), perplexity (exp(-LL / num_tokens), lower is better), num_tokens (scored), and num_oov (dropped). Cost grows with the square of document length, so keep num_particles modest.

fit method descriptor

fit(data, *, iterations=1000, num_samples=5, sample_interval=25, progress=None, progress_interval=50)

Run Gibbs sampling on data, then average num_samples snapshots (taken sample_interval iterations apart) into the final φ/θ estimates.

data may be a :class:Corpus or a list of token lists (each a list of strings). When a token-list is passed, an internal corpus is built with no frequency filtering — build a :class:Corpus explicitly for that.

progress, if given, is called as progress(iteration, ll_per_token) every progress_interval iterations during the main loop.

load staticmethod

load(path)

Load a model previously written by :meth:save.

load_state staticmethod

load_state(path)

Reconstruct a fitted model from a MALLET-format Gibbs state file (the inverse of :meth:save_state; MALLET's --input-state). The file may be gzip-compressed or plain text. The vocabulary, documents, per-token topic assignments, and the #alpha/#beta hyperparameters are read back, so the loaded model supports the full read-only surface (topic_word, doc_topic, top_words, …) and transform on new documents, and can re-emit the state with :meth:save_state.

log_likelihood method descriptor

log_likelihood()

MALLET-formula model log-likelihood of the final sampler state.

perplexity method descriptor

perplexity(data, *, num_particles=10, seed=None)

Held-out perplexity (lower is better) — convenience wrapper over :meth:evaluate. See evaluate for data/num_particles semantics.

save method descriptor

save(path)

Save the fitted model to path (compact binary). Reload with LDA.load.

save_doc_topic method descriptor

save_doc_topic(path)

Write document-topic probabilities to a TSV file (the train CLI format).

save_state method descriptor

save_state(path)

Write the token-level Gibbs state to a gzipped file in MALLET's --output-state format: a header, the #alpha/#beta hyperparameter lines, then one row per token — doc source pos typeindex type topic — giving the final topic assignment of every token in the training corpus. Researchers pipe this into custom visualizations (e.g. pyLDAvis) or corpus metrics. The file is gzip-compressed, as MALLET writes it.

save_topic_word method descriptor

save_topic_word(path)

Write topic-word probabilities to a TSV file (the train CLI format).

similar_documents method descriptor

similar_documents(doc, n=10)

The n training documents most similar to document doc (by index), as (doc_name, divergence) pairs sorted by ascending Jensen-Shannon divergence of their document-topic distributions.

top_documents method descriptor

top_documents(topic, n=10)

The n training documents most strongly associated with topic, as (doc_name, weight) pairs sorted by descending θ for that topic.

top_words method descriptor

top_words(n=10, *, topic=None)

Top n words per topic as (word, probability) pairs.

Returns a list of n-length lists (one per topic), or — when topic is given — just that topic's list.

transform method descriptor

transform(data, *, iterations=100, burn_in=10, num_samples=10, sample_interval=5, seed=None)

Infer document-topic distributions for new, unseen documents under the fitted model (sklearn-style transform). data is a :class:Corpus or list[list[str]]; tokens are matched to the training vocabulary by string (OOV dropped). A document with no in-vocabulary tokens gets the prior θ. Returns an array of shape (num_new_docs, num_topics) whose rows sum to 1.

topica.DMR

Dirichlet-Multinomial Regression topic model (Mimno & McCallum, 2008).

Like :class:LDA, but the per-document topic prior is a log-linear function of document features: α_{d,t} = exp(λ_t · x_d). After fitting, the learned weights are available as :attr:feature_effects — how each covariate shifts each topic's prevalence.

__doc__ class-attribute

__doc__ = "Dirichlet-Multinomial Regression topic model (Mimno & McCallum, 2008).\n\nLike :class:`LDA`, but the per-document topic prior is a log-linear function\nof document features: ``α_{d,t} = exp(λ_t · x_d)``. After fitting, the\nlearned weights are available as :attr:`feature_effects` — how each covariate\nshifts each topic's prevalence."

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.

__module__ class-attribute

__module__ = 'topica'

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.

doc_names property

doc_names

doc_topic property

doc_topic

Document-topic matrix θ, shape (num_docs, num_topics).

feature_effects property

feature_effects

Learned feature weights λ, shape (num_topics, num_features) — how each feature (column 0 is the intercept) shifts each topic's log-prior. Positive ⇒ the feature raises that topic's prevalence.

feature_names property

feature_names

Feature names aligned with the columns of :attr:feature_effects ("intercept" first).

num_topics property

num_topics

topic_word property

topic_word

Topic-word matrix φ, shape (num_topics, num_words).

vocabulary property

vocabulary

__new__ builtin

__new__(*args, **kwargs)

Create and return a new object. See help(type) for accurate signature.

__repr__ method descriptor

__repr__()

Return repr(self).

coherence method descriptor

coherence(n=10)

UMass topic coherence per topic, shape (num_topics,).

fit method descriptor

fit(data, features, *, feature_names=None, iterations=1000, num_samples=5, sample_interval=25, progress=None, progress_interval=50)

Fit the model. data is a :class:Corpus or list[list[str]]; features is a (num_docs, F) numpy array or list of float lists (an intercept column is prepended automatically). feature_names (length F) names the columns; an "intercept" name is prepended.

load staticmethod

load(path)

Load a model previously written by :meth:save.

save method descriptor

save(path)

Save the fitted model to path (compact binary). Reload with DMR.load.

top_words method descriptor

top_words(n=10, *, topic=None)

Top n words per topic as (word, probability) pairs (all topics, or one when topic is given).

transform method descriptor

transform(data, features=None, *, iterations=100, burn_in=10, num_samples=10, sample_interval=5, seed=None)

Infer topic proportions θ for new documents by collapsed Gibbs against the fitted topic-word matrix. data is a :class:Corpus or list[list[str]]; OOV tokens are dropped. features (optional, a (num_docs, F) covariate array matching training, no intercept) sets each document's Dirichlet prior α_d = exp(Xγ); if omitted the intercept-only baseline prior is used. Returns (num_docs, num_topics).

topica.LabeledLDA

Supervised topic model (Ramage et al., 2009): each document carries a set of labels, each label is a topic, and a document's tokens are constrained to its labels' topics. The number of topics is the number of distinct labels.

Documents with an empty label set are treated as unconstrained (all topics).

__doc__ class-attribute

__doc__ = "Supervised topic model (Ramage et al., 2009): each document carries a set of\nlabels, each label is a topic, and a document's tokens are constrained to its\nlabels' topics. The number of topics is the number of distinct labels.\n\nDocuments with an empty label set are treated as unconstrained (all topics)."

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.

__module__ class-attribute

__module__ = 'topica'

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.

doc_names property

doc_names

doc_topic property

doc_topic

Document-topic matrix θ, shape (num_docs, num_topics); for each document only its label topics are non-zero, and rows sum to 1.

labels property

labels

The label name for each topic, in topic (column) order.

num_topics property

num_topics

topic_word property

topic_word

Topic-word matrix φ, shape (num_topics, num_words).

vocabulary property

vocabulary

__new__ builtin

__new__(*args, **kwargs)

Create and return a new object. See help(type) for accurate signature.

__repr__ method descriptor

__repr__()

Return repr(self).

coherence method descriptor

coherence(n=10)

UMass topic coherence per topic, shape (num_topics,).

fit method descriptor

fit(data, labels, *, label_names=None, iterations=1000, num_samples=5, sample_interval=25, progress=None, progress_interval=50)

Fit the model. data is a :class:Corpus or list[list[str]]; labels is a list (one per document) of label lists. The topic set is the union of all labels (or label_names, which also fixes topic order). An empty label list leaves that document unconstrained.

load staticmethod

load(path)

Load a model previously written by :meth:save.

save method descriptor

save(path)

Save the fitted model to path. Reload with LabeledLDA.load.

top_words method descriptor

top_words(n=10, *, topic=None)

Top n words for one topic (by label name or index) or all topics.

transform method descriptor

transform(data, *, iterations=100, burn_in=10, num_samples=10, sample_interval=5, seed=None)

Infer label (topic) proportions θ for new documents by collapsed Gibbs against the fitted topic-word matrix, treating every label as available (unsupervised inference). data is a :class:Corpus or list[list[str]]; OOV tokens are dropped. Returns (num_docs, num_topics); columns align with :attr:labels.

topica.SAGE

Content-covariate topic model (SAGE / the STM content model).

Topics are shared, but each topic's word distribution varies by a document-level group covariate, so you can read how a topic is worded differently across groups. Construct, then :meth:fit on documents plus a per-document group label.

__doc__ class-attribute

__doc__ = "Content-covariate topic model (SAGE / the STM content model).\n\nTopics are shared, but each topic's word distribution varies by a\ndocument-level **group** covariate, so you can read how a topic is worded\ndifferently across groups. Construct, then :meth:`fit` on documents plus a\nper-document group label."

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.

__module__ class-attribute

__module__ = 'topica'

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.

doc_names property

doc_names

doc_topic property

doc_topic

Document-topic matrix θ, shape (num_docs, num_topics); rows sum to 1.

groups property

groups

Group names, in the index order used by :attr:topic_word's second axis.

num_groups property

num_groups

num_topics property

num_topics

topic_word property

topic_word

Topic-word distributions per group, shape (num_topics, num_groups, num_words).

topic_word_marginal property

topic_word_marginal

Group-averaged topic-word matrix, shape (num_topics, num_words).

vocabulary property

vocabulary

__new__ builtin

__new__(*args, **kwargs)

Create and return a new object. See help(type) for accurate signature.

__repr__ method descriptor

__repr__()

Return repr(self).

coherence method descriptor

coherence(n=10)

UMass topic coherence per topic (group-averaged), shape (num_topics,).

fit method descriptor

fit(data, groups, *, group_names=None, iterations=1000, num_samples=5, sample_interval=25, progress=None, progress_interval=50)

Fit the model. data is a :class:Corpus or list[list[str]]; groups is a per-document group label (strings or ints), one per document. group_names fixes the group order (defaults to sorted union).

load staticmethod

load(path)

Load a model previously written by :meth:save.

save method descriptor

save(path)

Save the fitted model to path. Reload with SAGE.load.

top_words method descriptor

top_words(topic, *, group=None, n=10)

Top n words for a topic. With group (name or index) given, uses that group's word distribution; otherwise the group-averaged distribution.

word_contrast method descriptor

word_contrast(topic, group_a, group_b, n=10)

Words that most distinguish how topic is worded in group_a vs group_b, by log-ratio of the two groups' word probabilities. Returns (word, log_ratio) — positive favours group_a.

topica.CTM

Correlated Topic Model (Blei & Lafferty; the STM core). Topics are drawn from a logistic-normal prior with a full covariance, so they can correlate — unlike LDA's Dirichlet. Fit by variational EM (STM's Laplace E-step).

This is the engine STM builds on; prevalence/content covariates layer on top.

__doc__ class-attribute

__doc__ = "Correlated Topic Model (Blei & Lafferty; the STM core). Topics are drawn\nfrom a logistic-normal prior with a full covariance, so they can correlate —\nunlike LDA's Dirichlet. Fit by variational EM (STM's Laplace E-step).\n\nThis is the engine STM builds on; prevalence/content covariates layer on top."

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.

__module__ class-attribute

__module__ = 'topica'

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.

bound property

bound

Final variational bound (approximate ELBO) at convergence — the quantity R stm reports as convergence$bound.

bound_history property

bound_history

The variational bound after each EM iteration (the convergence trajectory). Its length is the number of iterations actually run.

converged property

converged

True if EM stopped on the em_tol criterion; False if it hit the em_iters cap first (the fit may not have converged).

doc_names property

doc_names

doc_topic property

doc_topic

Document-topic matrix θ, shape (num_docs, num_topics); rows sum to 1.

eta_cov property

eta_cov

Per-document variational posterior covariances ν of η, shape (num_docs, num_topics-1, num_topics-1).

eta_mean property

eta_mean

Per-document variational posterior means λ of the logistic-normal η, shape (num_docs, num_topics-1). Pairs with :attr:eta_cov to sample θ draws (method-of-composition uncertainty).

num_topics property

num_topics

topic_correlation property

topic_correlation

Topic-correlation matrix from the logistic-normal Σ, shape (num_topics, num_topics). Off-diagonal entries are genuine topic correlations (the whole point of CTM vs. LDA).

topic_word property

topic_word

Topic-word matrix β, shape (num_topics, num_words).

vocabulary property

vocabulary

__new__ builtin

__new__(*args, **kwargs)

Create and return a new object. See help(type) for accurate signature.

__repr__ method descriptor

__repr__()

Return repr(self).

coherence method descriptor

coherence(n=10)

UMass topic coherence per topic, shape (num_topics,).

fit method descriptor

fit(data, *, em_iters=500, em_tol=1e-05)

Fit by variational EM. data is a :class:Corpus or list[list[str]]. EM runs until the relative change in the variational bound drops below em_tol (R stm's emtol) or em_iters iterations are reached, whichever comes first. Pass em_tol=0 to always run em_iters steps. Check :attr:converged and :attr:bound afterward.

load staticmethod

load(path)

Load a model previously written by :meth:save.

save method descriptor

save(path)

Save the fitted model to path. Reload with CTM.load.

top_words method descriptor

top_words(n=10, *, topic=None)

Top n words per topic (or one topic) as (word, probability) pairs.

transform method descriptor

transform(data)

Infer topic proportions θ for new documents by the variational E-step against the fitted globals (β, logistic-normal prior μ, Σ). data is a :class:Corpus or list[list[str]]; tokens outside the training vocabulary are dropped. Returns a (num_docs, num_topics) array.

topica.STM

Structural Topic Model (Roberts, Stewart & Tingley). The correlated-topic core (:class:CTM) with prevalence covariates: a document's prior topic mean is a regression on its covariates, μ_d = X_d γ, so covariates shift which topics a document discusses. After fitting, prevalence_effects holds the learned γ; pair it with topica.stm.estimate_effect for inference.

__doc__ class-attribute

__doc__ = "Structural Topic Model (Roberts, Stewart & Tingley). The correlated-topic\ncore (:class:`CTM`) with **prevalence covariates**: a document's prior topic\nmean is a regression on its covariates, `μ_d = X_d γ`, so covariates shift\nwhich topics a document discusses. After fitting, `prevalence_effects` holds\nthe learned γ; pair it with `topica.stm.estimate_effect` for inference."

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.

__module__ class-attribute

__module__ = 'topica'

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.

bound property

bound

Final variational bound (approximate ELBO) at convergence — the quantity R stm reports as convergence$bound.

bound_history property

bound_history

The variational bound after each EM iteration (the convergence trajectory). Its length is the number of iterations actually run.

converged property

converged

True if EM stopped on the em_tol criterion; False if it hit the em_iters cap first (the fit may not have converged).

doc_names property

doc_names

doc_topic property

doc_topic

Document-topic matrix θ, shape (num_docs, num_topics); rows sum to 1.

eta_cov property

eta_cov

Per-document variational posterior covariances ν of η, shape (num_docs, num_topics-1, num_topics-1).

eta_mean property

eta_mean

Per-document variational posterior means λ of η, shape (num_docs, num_topics-1). With :attr:eta_cov this is the logistic-normal posterior used to draw θ samples for method-of-composition uncertainty in estimate_effect.

feature_names property

feature_names

Covariate names aligned with the rows of :attr:prevalence_effects ("intercept" first).

groups property

groups

Content-covariate group names (axis-1 order of :attr:topic_word_by_group).

num_topics property

num_topics

prevalence_effects property

prevalence_effects

Prevalence coefficients γ, shape (num_features, num_topics-1) — how each covariate (row 0 is the intercept) shifts each topic's log-prior. The last topic is the softmax reference. For inference, prefer topica.stm.estimate_effect(model.doc_topic, X).

topic_correlation property

topic_correlation

Topic-correlation matrix, shape (num_topics, num_topics).

topic_word property

topic_word

Topic-word matrix β, shape (num_topics, num_words).

topic_word_by_group property

topic_word_by_group

Per-group topic-word distributions, shape (num_topics, num_groups, num_words) — only available when fit with content covariates.

vocabulary property

vocabulary

__new__ builtin

__new__(*args, **kwargs)

Create and return a new object. See help(type) for accurate signature.

__repr__ method descriptor

__repr__()

Return repr(self).

coherence method descriptor

coherence(n=10)

UMass topic coherence per topic, shape (num_topics,).

fit method descriptor

fit(data, prevalence=None, *, prevalence_names=None, content=None, content_names=None, em_iters=500, em_tol=1e-05)

Fit. data is a :class:Corpus or list[list[str]]. prevalence (optional, (num_docs, F) covariates) makes topic prevalence depend on covariates (μ_d = X_d γ); an intercept is prepended. content (optional, one group label per document) makes the topic-word distributions vary by group (the SAGE content model). At least one of prevalence/content should be given (else use :class:CTM).

EM runs until the relative change in the variational bound drops below em_tol (R stm's emtol) or em_iters iterations are reached, whichever comes first — matching stm's convergence behavior rather than a fixed iteration count. Pass em_tol=0 to always run em_iters steps. Inspect :attr:converged and :attr:bound after fitting.

load staticmethod

load(path)

Load a model previously written by :meth:save.

save method descriptor

save(path)

Save the fitted model to path. Reload with STM.load.

top_words method descriptor

top_words(n=10, *, topic=None)

Top n words per topic (or one topic) as (word, probability) pairs.

transform method descriptor

transform(data)

Infer topic proportions θ for new documents by the variational E-step against the fitted globals (β and the logistic-normal prior). data is a :class:Corpus or list[list[str]]; out-of-vocabulary tokens are dropped. Returns a (num_docs, num_topics) array.

Note: the prior mean used is the covariate-free baseline μ learned at fit time (prevalence covariates for held-out docs are not applied here), and for a content model the marginal topic-word β is used. This is the same held-out inference stm's fitNewDocuments performs when no new covariate design is supplied.

word_contrast method descriptor

word_contrast(topic, group_a, group_b, n=10)

Words that most distinguish how topic is worded in group_a vs group_b (log word-probability ratio; positive favours group_a). Requires content covariates.

topica.HDP

Hierarchical Dirichlet Process topic model (Teh, Jordan, Beal & Blei 2006): LDA that infers the number of topics rather than fixing it. Fit by the direct-assignment Gibbs sampler (the Chinese Restaurant Franchise). The two concentration parameters alpha (document level) and gamma (corpus level) govern how readily new topics appear; by default both are resampled from the data (a faithful port of blei-lab/hdp), so you typically don't tune them.

__doc__ class-attribute

__doc__ = "Hierarchical Dirichlet Process topic model (Teh, Jordan, Beal & Blei 2006):\nLDA that **infers the number of topics** rather than fixing it. Fit by the\ndirect-assignment Gibbs sampler (the Chinese Restaurant Franchise). The two\nconcentration parameters `alpha` (document level) and `gamma` (corpus level)\ngovern how readily new topics appear; by default both are resampled from the\ndata (a faithful port of blei-lab/hdp), so you typically don't tune them."

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.

__module__ class-attribute

__module__ = 'topica'

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.

alpha property

alpha

The fitted document-level concentration α0 (resampled if enabled).

concentration_history property

concentration_history

The learned-concentration trace: (iteration, alpha, gamma) triples sampled during fit (only informative when resample_conc=True). Empty if tracing was disabled.

doc_names property

doc_names

doc_topic property

doc_topic

Document-topic matrix θ, shape (num_docs, num_topics); rows sum to 1.

gamma property

gamma

The fitted corpus-level concentration γ (resampled if enabled).

log_likelihood_history property

log_likelihood_history

The convergence trace: (iteration, per-token log-likelihood) pairs sampled during fit. Empty if tracing was disabled.

num_topics property

num_topics

The inferred number of topics K.

topic_count_history property

topic_count_history

The topic-discovery trajectory: (iteration, num_topics) pairs sampled during fit. Watching K stabilize is the nonparametric model's headline convergence check (it grows and shrinks before settling). Sampled every report_interval sweeps (auto ≈ 50 points); empty if disabled.

topic_word property

topic_word

Topic-word matrix β, shape (num_topics, num_words).

vocabulary property

vocabulary

__new__ builtin

__new__(*args, **kwargs)

Create and return a new object. See help(type) for accurate signature.

__repr__ method descriptor

__repr__()

Return repr(self).

coherence method descriptor

coherence(n=10)

UMass topic coherence per topic, shape (num_topics,).

fit method descriptor

fit(data, *, iters=150, report_interval=0)

Fit by Gibbs sampling for iters sweeps. data is a :class:Corpus or list[list[str]]. The inferred topic count is available as num_topics.

load staticmethod

load(path)

Load a model previously written by :meth:save.

save method descriptor

save(path)

Save the fitted model to path. Reload with HDP.load.

top_words method descriptor

top_words(n=10, *, topic=None)

Top n words per topic (or one topic) as (word, probability) pairs.

transform method descriptor

transform(data, *, iterations=100, burn_in=10, num_samples=10, sample_interval=5, seed=None)

Infer topic proportions θ for new documents over the discovered topics, by collapsed Gibbs against the fixed topic-word matrix. data is a :class:Corpus or list[list[str]]; OOV tokens are dropped. The document-level prior is symmetric with total mass equal to the learned concentration α. Returns a (num_docs, num_topics) array.

topica.DTM

Dynamic Topic Model (Blei & Lafferty 2006): topics whose word distributions evolve across time slices. Each topic-word chain follows a Gaussian state-space model; inference is variational with Kalman smoothing, a faithful port of Blei's C dtm / gensim's LdaSeqModel. After fitting, query a topic's word distribution at any slice with topic_word(time) and trace a word's trajectory with word_evolution(topic, word).

__doc__ class-attribute

__doc__ = "Dynamic Topic Model (Blei & Lafferty 2006): topics whose word distributions\n**evolve across time slices**. Each topic-word chain follows a Gaussian\nstate-space model; inference is variational with Kalman smoothing, a faithful\nport of Blei's C `dtm` / gensim's `LdaSeqModel`. After fitting, query a\ntopic's word distribution at any slice with `topic_word(time)` and trace a\nword's trajectory with `word_evolution(topic, word)`."

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.

__module__ class-attribute

__module__ = 'topica'

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.

bound property

bound

The final variational bound (ELBO) reached during fitting.

num_times property

num_times

The number of time slices (available after fit).

num_topics property

num_topics

vocabulary property

vocabulary

__new__ builtin

__new__(*args, **kwargs)

Create and return a new object. See help(type) for accurate signature.

__repr__ method descriptor

__repr__()

Return repr(self).

fit method descriptor

fit(data, times, *, em_iters=20)

Fit by variational EM. data is a :class:Corpus or list[list[str]]; times gives each document's integer time-slice index (0-based, contiguous). The number of slices is inferred as max(times) + 1.

load staticmethod

load(path)

Load a model previously written by :meth:save.

save method descriptor

save(path)

Save the fitted model to path. Reload with DTM.load.

top_words method descriptor

top_words(topic, time, n=10)

Top n words for a topic at one time slice as (word, probability).

topic_word method descriptor

topic_word(time)

Topic-word matrix at time slice time, shape (num_topics, num_words); rows sum to 1.

word_drift method descriptor

word_drift(topic, *, n=10, from_time=0, to_time=None)

Which words inside topic drift most between two time slices.

For each word, the change in its probability within the topic from from_time to to_time (defaults: the first and last slices) is computed. Returns a dict with two keys, "rising" and "falling", each a list of (word, delta) pairs (largest gain first; largest drop first). This is how you see what makes a topic's vocabulary evolve, not just that it does.

word_evolution method descriptor

word_evolution(topic, word)

Trajectory of a word's probability in a topic across slices, shape (num_times,). word is a vocabulary string or its integer id.

topica.SupervisedLDA

Supervised LDA (Blei & McAuliffe 2007): LDA in which each document carries a real-valued response y_d ~ N(ηᵀ z̄_d, σ²) regressed on its topic usage. Fitting is supervised by the response, so topics are shaped to be predictive and the coefficients η report how each topic moves y. Fit by variational EM; predict returns ŷ for new documents.

__doc__ class-attribute

__doc__ = 'Supervised LDA (Blei & McAuliffe 2007): LDA in which each document carries a\nreal-valued response `y_d ~ N(ηᵀ z̄_d, σ²)` regressed on its topic usage.\nFitting is supervised by the response, so topics are shaped to be predictive\nand the coefficients `η` report how each topic moves `y`. Fit by variational\nEM; `predict` returns ŷ for new documents.'

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.

__module__ class-attribute

__module__ = 'topica'

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.

coefficients property

coefficients

Regression coefficients η, shape (num_topics,) — how each topic moves the response (in the response's units, per unit of topic frequency).

doc_names property

doc_names

doc_topic property

doc_topic

Document-topic matrix θ, shape (num_docs, num_topics); rows sum to 1.

num_topics property

num_topics

sigma2 property

sigma2

The fitted response variance σ².

topic_word property

topic_word

Topic-word matrix β, shape (num_topics, num_words).

vocabulary property

vocabulary

__new__ builtin

__new__(*args, **kwargs)

Create and return a new object. See help(type) for accurate signature.

__repr__ method descriptor

__repr__()

Return repr(self).

coherence method descriptor

coherence(n=10)

UMass topic coherence per topic, shape (num_topics,).

fit method descriptor

fit(data, y, *, em_iters=25, var_iters=15)

Fit by variational EM. data is a :class:Corpus or list[list[str]]; y is the per-document real-valued response (length = number of docs).

load staticmethod

load(path)

Load a model previously written by :meth:save.

predict method descriptor

predict(data, *, var_iters=20)

Predict the response ŷ for new documents (list[list[str]] or a :class:Corpus). Out-of-vocabulary words are ignored. Returns a 1-D array of length = number of documents.

save method descriptor

save(path)

Save the fitted model to path. Reload with SupervisedLDA.load.

top_words method descriptor

top_words(n=10, *, topic=None)

Top n words per topic (or one topic) as (word, probability) pairs.

transform method descriptor

transform(data, *, iterations=100, burn_in=10, num_samples=10, sample_interval=5, seed=None)

Infer topic proportions θ for new documents by collapsed Gibbs against the fitted topic-word matrix (the response is not used — this is the unsupervised E-step). data is a :class:Corpus or list[list[str]]; OOV tokens are dropped. Returns (num_docs, num_topics). To predict the response for new documents, take transform(data) @ eta.

topica.PT

Pseudo-document Topic Model (Zuo et al. 2016) for short texts. Documents are aggregated into num_pseudo pseudo-documents that carry the topic distributions, so the topic structure is estimated from richer aggregated statistics than individual short documents would provide. Collapsed Gibbs.

__doc__ class-attribute

__doc__ = 'Pseudo-document Topic Model (Zuo et al. 2016) for **short texts**. Documents\nare aggregated into `num_pseudo` pseudo-documents that carry the topic\ndistributions, so the topic structure is estimated from richer aggregated\nstatistics than individual short documents would provide. Collapsed Gibbs.'

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.

__module__ class-attribute

__module__ = 'topica'

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.

doc_names property

doc_names

doc_topic property

doc_topic

num_topics property

num_topics

topic_word property

topic_word

vocabulary property

vocabulary

__new__ builtin

__new__(*args, **kwargs)

Create and return a new object. See help(type) for accurate signature.

__repr__ method descriptor

__repr__()

Return repr(self).

fit method descriptor

fit(data, *, iters=1000)

Fit by collapsed Gibbs sampling for iters sweeps.

load staticmethod

load(path)

Load a model previously written by :meth:save.

save method descriptor

save(path)

Save the fitted model to path. Reload with PT.load.

topica.GSDMM

GSDMM — the "Movie Group Process" (Yin & Wang 2014). A mixture model for short texts (tweets, survey answers, headlines) where each document belongs to exactly one topic, not a mixture. You set an upper bound K on the number of clusters; empty clusters die out during sampling, so the effective num_topics is inferred from the data (≤ K). Handles the sparsity of short documents far better than LDA.

__doc__ class-attribute

__doc__ = 'GSDMM — the "Movie Group Process" (Yin & Wang 2014). A mixture model for\n**short texts** (tweets, survey answers, headlines) where each document\nbelongs to exactly *one* topic, not a mixture. You set an upper bound `K` on\nthe number of clusters; empty clusters die out during sampling, so the\neffective `num_topics` is inferred from the data (≤ K). Handles the sparsity\nof short documents far better than LDA.'

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.

__module__ class-attribute

__module__ = 'topica'

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.

cluster_count_history property

cluster_count_history

The cluster-discovery trajectory: (iteration, num_clusters) pairs over the fit. The Movie Group Process starts from num_topics clusters and empties most of them; watching the count collapse to a stable value is its headline convergence check. Sampled every report_interval sweeps (auto ≈ 50 points); empty if disabled.

doc_cluster property

doc_cluster

Hard cluster assignment of each document, shape (num_docs,); values in 0..num_topics. GSDMM gives each document a single cluster.

doc_names property

doc_names

doc_topic property

doc_topic

Document-topic matrix θ, shape (num_docs, num_topics); rows sum to 1.

log_likelihood_history property

log_likelihood_history

The convergence trace: (iteration, per-token log-likelihood) pairs (each document scored under its assigned cluster). Empty if disabled.

num_topics property

num_topics

The number of non-empty clusters discovered (≤ the K you set).

topic_word property

topic_word

Topic-word matrix β, shape (num_topics, num_words) (used clusters only).

vocabulary property

vocabulary

__new__ builtin

__new__(*args, **kwargs)

Create and return a new object. See help(type) for accurate signature.

__repr__ method descriptor

__repr__()

Return repr(self).

fit method descriptor

fit(data, *, iters=30, report_interval=0)

Fit by the Movie Group Process (collapsed Gibbs) for iters sweeps. report_interval controls the cluster-discovery trace (cluster_count_history / log_likelihood_history): 0 = auto (~50 points), a positive value records every that-many sweeps.

load staticmethod

load(path)

Load a model previously written by :meth:save.

save method descriptor

save(path)

Save the fitted model to path. Reload with GSDMM.load.

topica.SeededLDA

Seeded LDA (guided topic modeling): you supply a few seed words per topic and the model is steered so those topics form around them, while the rest of each topic's vocabulary (and any residual unseeded topics) is still learned. Useful when theory tells you which themes to expect (Jagarlamudi et al. 2012; the seeding follows koheiw/seededlda — seed words get a weight × 100 prior pseudocount in their topic).

__doc__ class-attribute

__doc__ = "Seeded LDA (guided topic modeling): you supply a few **seed words** per topic\nand the model is steered so those topics form around them, while the rest of\neach topic's vocabulary (and any `residual` unseeded topics) is still learned.\nUseful when theory tells you which themes to expect (Jagarlamudi et al. 2012;\nthe seeding follows koheiw/seededlda — seed words get a `weight × 100`\nprior pseudocount in their topic)."

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.

__module__ class-attribute

__module__ = 'topica'

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.

doc_names property

doc_names

doc_topic property

doc_topic

num_topics property

num_topics

topic_names property

topic_names

The topic labels: the seed names you gave, then residual_1 … for any unseeded topics.

topic_word property

topic_word

vocabulary property

vocabulary

__new__ builtin

__new__(*args, **kwargs)

Create and return a new object. See help(type) for accurate signature.

__repr__ method descriptor

__repr__()

Return repr(self).

fit method descriptor

fit(data, *, iters=2000, doc_topic_prior=None)

Fit by collapsed Gibbs for iters sweeps. Seeded topics come first (in the order given), then the residual topics.

doc_topic_prior (optional, (num_docs, num_topics)) supplies a per-document asymmetric Dirichlet prior α_{d,k} that replaces the symmetric alpha, biasing each document's topic mixture toward chosen topics (e.g. from a document embedding). It is a prior, so the sampler can still move a document away from it.

load staticmethod

load(path)

Load a model previously written by :meth:save.

save method descriptor

save(path)

Save the fitted model to path. Reload with SeededLDA.load.

topica.KeyATM

Keyword-Assisted Topic Model (keyATM Base). Like LDA, but some topics carry a researcher-supplied keyword list; a token in a keyword topic comes either from a distribution over only that topic's keywords or from the topic's full distribution. This anchors keyword topics to their keywords while still learning the rest of the vocabulary. Faithful to keyATM/keyATM.

__doc__ class-attribute

__doc__ = "Keyword-Assisted Topic Model (keyATM Base). Like LDA, but some topics carry a\nresearcher-supplied **keyword** list; a token in a keyword topic comes either\nfrom a distribution over only that topic's keywords or from the topic's full\ndistribution. This anchors keyword topics to their keywords while still\nlearning the rest of the vocabulary. Faithful to keyATM/keyATM."

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.

__module__ class-attribute

__module__ = 'topica'

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.

alpha_history property

alpha_history

Trace of the estimated document-topic prior α as (iteration, alpha) pairs, where alpha is the length-K asymmetric prior at that sweep — keyATM's plot_alpha / values_iter$alpha_iter. Base model only; empty for the covariate model (which traces λ) and dynamic model.

doc_names property

doc_names

doc_topic property

doc_topic

feature_effects property

feature_effects

Covariate model: learned DMR coefficients λ, shape (num_topics, F+1); column 0 is the intercept. Raises if the model was fit without covariates.

feature_names property

feature_names

Covariate model: names aligned with feature_effects columns ("intercept" first). Empty for the base model.

keyword_rate property

keyword_rate

Per-topic keyword switch rate π_k (the share of a keyword topic's mass drawn from its keyword distribution); 0 for regular topics.

log_likelihood_history property

log_likelihood_history

Convergence trace as a list of (iteration, log_likelihood, perplexity) triples — the three columns of keyATM's model_fit (plot_modelfit). log_likelihood is the collapsed marginal log-likelihood and perplexity is exp(-log_likelihood / total_weighted_tokens), both on R keyATM's scale. Sampled every report_interval sweeps during :meth:fit (auto ≈ 50 points). Empty if tracing was disabled.

num_topics property

num_topics

pi_history property

pi_history

Trace of the per-topic keyword switch rate π as (iteration, pi) pairs (pi length K, 0 for regular topics) — keyATM's plot_pi / values_iter$pi_iter. Empty for a keyword-free model.

time_labels property

time_labels

Dynamic model: the distinct, sorted timestamp labels, one per time segment (length T). Empty for non-dynamic models.

time_prevalence property

time_prevalence

Dynamic model: smoothed topic prevalence per time segment, shape (T, num_topics), rows sum to 1, aligned with time_labels. Raises if the model was fit without timestamps.

time_state property

time_state

Dynamic model: the latent HMM state (regime) of each time segment, length T, aligned with time_labels. Empty for non-dynamic models.

topic_names property

topic_names

topic_word property

topic_word

transition_matrix property

transition_matrix

Dynamic model: the left-to-right state transition matrix, shape (num_states, num_states). Raises if fit without timestamps.

vocabulary property

vocabulary

__new__ builtin

__new__(*args, **kwargs)

Create and return a new object. See help(type) for accurate signature.

__repr__ method descriptor

__repr__()

Return repr(self).

fit method descriptor

fit(data, *, iters=1500, covariates=None, feature_names=None, timestamps=None, num_states=5, weights='information-theory', num_threads=1, optimize_interval=50, burn_in=200, prior_variance=1.0, lbfgs_iters=20, report_interval=0, prior_offset=None)

Fit by collapsed Gibbs for iters sweeps. Keyword topics come first (in the order given), then any regular topics.

Pass covariates (a (num_docs, F) array or list of float lists) for the covariate keyATM: the document-topic prior becomes a Dirichlet-multinomial regression, α_{d,k} = exp(x_d · λ_k) (an intercept is prepended). feature_names (length F) labels the columns; the learned λ is exposed as feature_effects. With no covariates, this is the base symmetric-α keyATM.

Pass timestamps (one value per document) for the dynamic keyATM: a Chib (1998) change-point HMM lets topic prevalence shift over time across num_states latent regimes. Documents are sorted by timestamp internally; the smoothed prevalence path is exposed as time_prevalence (aligned with time_labels) and the per-segment regime as time_state. timestamps and covariates are mutually exclusive.

load staticmethod

load(path)

Load a model previously written by :meth:save.

save method descriptor

save(path)

Save the fitted model to path. Reload with KeyATM.load.

weighted_lda staticmethod

weighted_lda(num_topics, *, alpha=0.1, beta=0.01, seed=42)

Weighted LDA — keyATM's weightedLDA: a keyword-free model with no keyword topics, so it is plain LDA fit with keyATM's token weighting and estimated asymmetric α (collapsed Gibbs). Use it as the unsupervised baseline next to a keyword-assisted :class:KeyATM. fit it the same way (the weights argument controls the token weighting); the keyword-specific outputs (keyword_rate, pi_history) are empty.

topica.PA

Pachinko Allocation Model (Li & McCallum 2006): a DAG of num_super super-topics over num_sub shared sub-topics over words, capturing topic correlationssuper_sub reports which sub-topics each super-topic groups together. Collapsed Gibbs over (super, sub) pairs.

__doc__ class-attribute

__doc__ = 'Pachinko Allocation Model (Li & McCallum 2006): a DAG of `num_super`\nsuper-topics over `num_sub` shared sub-topics over words, capturing topic\n*correlations* — `super_sub` reports which sub-topics each super-topic groups\ntogether. Collapsed Gibbs over (super, sub) pairs.'

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.

__module__ class-attribute

__module__ = 'topica'

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.

doc_names property

doc_names

doc_topic property

doc_topic

Document × sub-topic proportions, shape (num_docs, num_sub).

num_sub property

num_sub

num_super property

num_super

num_topics property

num_topics

Alias for num_sub (the word-level topics).

super_sub property

super_sub

Super-topic → sub-topic association, shape (num_super, num_sub); row s shows which sub-topics super-topic s groups together (the correlations).

topic_word property

topic_word

Sub-topic word distributions, shape (num_sub, num_words).

vocabulary property

vocabulary

__new__ builtin

__new__(*args, **kwargs)

Create and return a new object. See help(type) for accurate signature.

__repr__ method descriptor

__repr__()

Return repr(self).

fit method descriptor

fit(data, *, iters=1000)

Fit by collapsed Gibbs sampling for iters sweeps.

load staticmethod

load(path)

Load a model previously written by :meth:save.

save method descriptor

save(path)

Save the fitted model to path. Reload with PA.load.

topica.HLDA

Hierarchical LDA (Blei, Griffiths & Jordan): topics organized in a tree of fixed depth, inferred by the nested Chinese Restaurant Process. The root is the shared (general) topic; deeper nodes are progressively more specific. Each document follows a root-to-leaf path. Inspect the tree with topic_word/node_levels/node_parents/doc_paths.

__doc__ class-attribute

__doc__ = 'Hierarchical LDA (Blei, Griffiths & Jordan): topics organized in a tree of\nfixed `depth`, inferred by the nested Chinese Restaurant Process. The root is\nthe shared (general) topic; deeper nodes are progressively more specific.\nEach document follows a root-to-leaf path. Inspect the tree with\n`topic_word`/`node_levels`/`node_parents`/`doc_paths`.'

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.

__module__ class-attribute

__module__ = 'topica'

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.

doc_paths property

doc_paths

Each document's root-to-leaf path (a list of node ids), length num_docs.

leaves property

leaves

The leaf node ids (nodes that are no node's parent).

node_levels property

node_levels

The tree level (0 = root) of each node, length num_nodes.

node_parents property

node_parents

The parent node id of each node (-1 for the root), length num_nodes.

num_nodes property

num_nodes

The number of topic nodes in the inferred tree.

topic_word property

topic_word

Per-node word distributions, shape (num_nodes, num_words).

vocabulary property

vocabulary

__new__ builtin

__new__(*args, **kwargs)

Create and return a new object. See help(type) for accurate signature.

__repr__ method descriptor

__repr__()

Return repr(self).

fit method descriptor

fit(data, *, iters=500)

Fit by nested-CRP collapsed Gibbs sampling for iters sweeps.

load staticmethod

load(path)

Load a model previously written by :meth:save.

save method descriptor

save(path)

Save the fitted model to path. Reload with HLDA.load.

top_words method descriptor

top_words(node, n=10)

Top n words for one topic node as (word, probability) pairs.

topica.Corpus

A preprocessed, integer-encoded document collection.

Build one from already-tokenised documents with :meth:Corpus.from_documents, from a raw text file with :meth:Corpus.from_text_file, or load a binary corpus written by the preprocess CLI with :meth:Corpus.load.

__doc__ class-attribute

__doc__ = 'A preprocessed, integer-encoded document collection.\n\nBuild one from already-tokenised documents with\n:meth:`Corpus.from_documents`, from a raw text file with\n:meth:`Corpus.from_text_file`, or load a binary corpus written by the\n``preprocess`` CLI with :meth:`Corpus.load`.'

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.

__module__ class-attribute

__module__ = 'topica'

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.

doc_labels property

doc_labels

doc_names property

doc_names

kept_indices property

kept_indices

Original document indices that survived pruning, parallel to the rows of this corpus. Use it to realign an external covariate array or DataFrame to the documents the corpus actually kept: X = X[corpus.kept_indices].

metadata property

metadata

Optional per-document metadata, already aligned to the surviving rows (set by :func:topica.from_dataframe, or assign your own). None if unset.

num_docs property

num_docs

num_words property

num_words

total_tokens property

total_tokens

vocabulary property

vocabulary

__new__ builtin

__new__(*args, **kwargs)

Create and return a new object. See help(type) for accurate signature.

__repr__ method descriptor

__repr__()

Return repr(self).

from_documents staticmethod

from_documents(documents, *, doc_names=None, doc_labels=None, stopwords=None, min_doc_freq=1, max_doc_fraction=1.0, min_cf=0, rm_top=0)

Build a corpus from pre-tokenised documents.

documents is a sequence of token lists. Optional doc_names / doc_labels (each the same length as documents) attach an id and a label to every document. stopwords are dropped. Vocabulary is pruned by min_doc_freq (minimum document frequency) and max_doc_fraction (maximum fraction of documents), by min_cf (minimum collection/total frequency), and by rm_top (drop the N most frequent words) — matching tomotopy's min_df / min_cf / rm_top.

from_text_file staticmethod

from_text_file(path, *, format='plain', id_field=False, id_column=0, label_column=1, text_column=2, token_regex=None, stopwords=None, min_doc_freq=1, max_doc_fraction=1.0)

Load and tokenise a raw text file (MALLET-style), matching the preprocess CLI.

format is "plain" (one document per line) or "tsv". In plain mode, id_field=True treats the first whitespace token as the doc id. In tsv mode, id_column/label_column/text_column select columns (label_column=None disables labels).

load staticmethod

load(path)

Load a binary corpus file written by the preprocess CLI or :meth:save.

save method descriptor

save(path)

Write this corpus to a binary file (the preprocess format), so it can be reused by the CLI tools or reloaded with :meth:load.