Citing¶
Citing topica¶
If topica contributes to published work, please cite it:
@software{caren_topica,
author = {Caren, Neal},
title = {topica: fast, all-purpose topic modeling for Python},
year = {2026},
url = {https://github.com/nealcaren/topica}
}
topica reimplements published methods and is validated against their reference
implementations. It does not replace the original work: please also cite the
model you use, and, where you rely on a feature ported from a specific package
(for example estimateEffect from stm), that package too.
Cite the model you used¶
| Model | Primary citation |
|---|---|
LDA |
Blei, Ng & Jordan (2003); fast sampler: Yao, Mimno & McCallum (2009) |
DMR |
Mimno & McCallum (2008) |
LabeledLDA |
Ramage, Hall, Nallapati & Manning (2009) |
CTM |
Blei & Lafferty (2007) |
STM |
Roberts, Stewart & Tingley (2014, 2019) |
SAGE |
Eisenstein, Ahmed & Xing (2011) |
HDP |
Teh, Jordan, Beal & Blei (2006) |
DTM |
Blei & Lafferty (2006) |
SupervisedLDA |
Blei & McAuliffe (2008) |
PT |
Zuo et al. (2016) |
GSDMM |
Yin & Wang (2014) |
SeededLDA |
method: Jagarlamudi, Daumé III & Udupa (2012); package: Watanabe (2023) |
KeyATM |
Eshima, Imai & Sasaki (2024) |
PA |
Li & McCallum (2006) |
HLDA |
Griffiths, Jordan, Tenenbaum & Blei (2004) |
LightLDA (alias sampler) |
Yuan et al. (2015) |
BERTopic |
Grootendorst (2022) |
Top2Vec |
Angelov (2020) |
ETM |
Dieng, Ruiz & Blei (2020) |
FASTopic |
Wu et al. (2024) |
Full references¶
Models¶
- Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research, 3, 993–1022.
- Yao, L., Mimno, D., & McCallum, A. (2009). Efficient methods for topic model inference on streaming document collections. KDD 2009, 937–946. doi:10.1145/1557019.1557121 (SparseLDA)
- Mimno, D., & McCallum, A. (2008). Topic models conditioned on arbitrary features with Dirichlet-multinomial regression. UAI 2008.
- Ramage, D., Hall, D., Nallapati, R., & Manning, C. D. (2009). Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora. EMNLP 2009, 248–256. doi:10.3115/1699510.1699543
- Blei, D. M., & Lafferty, J. D. (2007). A correlated topic model of Science. The Annals of Applied Statistics, 1(1), 17–35. doi:10.1214/07-AOAS114
- Roberts, M. E., Stewart, B. M., & Airoldi, E. M. (2016). A model of text for experimentation in the social sciences. Journal of the American Statistical Association, 111(515), 988–1003. doi:10.1080/01621459.2016.1141684
- Roberts, M. E., Stewart, B. M., & Tingley, D. (2019). stm: An R package for structural topic models. Journal of Statistical Software, 91(2), 1–40. doi:10.18637/jss.v091.i02
- Eisenstein, J., Ahmed, A., & Xing, E. P. (2011). Sparse additive generative models of text. ICML 2011, 1041–1048.
- Teh, Y. W., Jordan, M. I., Beal, M. J., & Blei, D. M. (2006). Hierarchical Dirichlet processes. Journal of the American Statistical Association, 101(476), 1566–1581. doi:10.1198/016214506000000302
- Blei, D. M., & Lafferty, J. D. (2006). Dynamic topic models. ICML 2006, 113–120. doi:10.1145/1143844.1143859
- Blei, D. M., & McAuliffe, J. D. (2008). Supervised topic models. NIPS 2007 (Advances in Neural Information Processing Systems 20), 121–128.
- Zuo, Y., Wu, J., Zhang, H., Lin, H., Wang, F., Xu, K., & Xiong, H. (2016). Topic modeling of short texts: A pseudo-document view. KDD 2016, 2105–2114. doi:10.1145/2939672.2939880
- Yin, J., & Wang, J. (2014). A Dirichlet multinomial mixture model-based approach for short text clustering. KDD 2014, 233–242. doi:10.1145/2623330.2623715 (GSDMM)
- Jagarlamudi, J., Daumé III, H., & Udupa, R. (2012). Incorporating lexical priors into topic models. EACL 2012, 204–213. (seeded-LDA method)
- Watanabe, K. (2023). seededlda: Seeded sequential LDA for topic modeling. R package.
- Eshima, S., Imai, K., & Sasaki, T. (2024). Keyword-assisted topic models. American Journal of Political Science, 68(2), 730–750. doi:10.1111/ajps.12779
- Li, W., & McCallum, A. (2006). Pachinko allocation: DAG-structured mixture models of topic correlations. ICML 2006, 577–584.
- Griffiths, T. L., Jordan, M. I., Tenenbaum, J. B., & Blei, D. M. (2004). Hierarchical topic models and the nested Chinese restaurant process. NIPS 2003 (Advances in Neural Information Processing Systems 16), 17–24.
- Yuan, J., Gao, F., Ho, Q., Dai, W., Wei, J., Zheng, X., Xing, E. P., Liu, T.-Y., & Ma, W.-Y. (2015). LightLDA: Big topic models on modest computer clusters. WWW 2015, 1351–1361. doi:10.1145/2736277.2741115
- Grootendorst, M. (2022). BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv:2203.05794.
- Angelov, D. (2020). Top2Vec: Distributed representations of topics. arXiv:2008.09470.
- Dieng, A. B., Ruiz, F. J. R., & Blei, D. M. (2020). Topic modeling in embedding spaces. Transactions of the Association for Computational Linguistics, 8, 439–453. doi:10.1162/tacl_a_00325
- Wu, X., Nguyen, T., Zhang, D. C., Wang, W. Y., & Luu, A. T. (2024). FASTopic: A fast, adaptive, stable, and transferable topic modeling paradigm. NeurIPS 2024. arXiv:2405.17978
Methods used in the diagnostics and effects tools¶
- Chib, S. (1998). Estimation and comparison of multiple change-point models. Journal of Econometrics, 86(2), 221–241. doi:10.1016/S0304-4076(97)00115-2 (dynamic keyATM HMM)
- Lau, J. H., Newman, D., & Baldwin, T. (2014). Machine reading tea leaves: Automatically evaluating topic coherence and topic model quality. EACL 2014, 530–539. doi:10.3115/v1/E14-1056 (coherence)
- Sievert, C., & Shirley, K. E. (2014). LDAvis: A method for visualizing and interpreting topics. Workshop on Interactive Language Learning, Visualization, and Interfaces (ACL 2014), 63–70. doi:10.3115/v1/W14-3110 (relevance)
- Chang, J., Gerrish, S., Wang, C., Boyd-Graber, J., & Blei, D. M. (2009). Reading tea leaves: How humans interpret topic models. NIPS 2009 (Advances in Neural Information Processing Systems 22), 288–296. (intrusion tests)
- Monroe, B. L., Colaresi, M. P., & Quinn, K. M. (2008). Fightin' words: Lexical feature selection and evaluation for identifying the content of political conflict. Political Analysis, 16(4), 372–403. doi:10.1093/pan/mpn018 (fighting words)
- Campello, R. J. G. B., Moulavi, D., & Sander, J. (2013). Density-based clustering based on hierarchical density estimates. PAKDD 2013, 160–172. doi:10.1007/978-3-642-37456-2_14 (HDBSCAN)
- McInnes, L., Healy, J., & Melville, J. (2018). UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv:1802.03426.
Reference implementations and libraries¶
topica is validated against, and in places ports features from, these projects:
- MALLET (McCallum, 2002) and RustMallet (Mimno)
- stm (Roberts, Stewart & Tingley)
- keyATM (Eshima, Imai & Sasaki)
- seededlda (Watanabe)
- lda-c / ctm-c / dtm / hdp (Blei lab)
- gensim (Řehůřek & Sojka, 2010)
- tomotopy (bab2min)
- BERTopic (Grootendorst) and Top2Vec (Angelov)
- ETM (Dieng, Ruiz & Blei) and FASTopic (Wu et al.)
- petal-clustering (HDBSCAN) and umap-rs (UMAP), both pure-Rust and BLAS-free