“How do your topics look so good?” I get this question sometimes when I present my research. I’ve also heard from friends and colleagues that they’ve tried topic modeling but got confusing or incoherent results and now are very skeptical of this method.
In this quick and practical guide, I’m going to share a set of steps that you can follow to get coherent topics from most datasets. You can think of this like a topic modeling recipe. These tips are partly based on my personal experience and partly on important research done by others!
Two caveats: I’ll be focusing on latent Dirichlet allocation (LDA), but the tips about evaluation apply more broadly. And some of these tips are English-centric, as preprocessing steps like stemming can have different effects in other languages.
There are many, many different kinds of topic models. However, topic model evaluation is really difficult (see below), and it’s particularly difficult to directly compare results across different models. LDA is an older model, first introduced to NLP in 2003 by David Blei. It’s rare in modern NLP to use models from 20 years ago, but in this case, LDA still performs strongly, according to recent experiments by Harrando et al., (2021) and Hoyle et al. (2022). LDA is also relatively more accessible than other topic models, as there are multiple public implementations, and it performs well even on small datasets. I’m not advising you not to use other models, but you shouldn’t feel pressure to use a newer model just because it’s newer.
However, given the unsupervised and exploratory goals of topic modeling, none of these automatic metrics can replace manual evaluation by experts with domain expertise, and there is recent work by Hoyle et al. (2021) showing that these metrics don’t correlate well with human judgments. I recommend using the tests suggested in the paper below, like word intruder and document intruder tests, to test your topics. In these tests, you ask annotators to select a random “intruder” word from the set of most probable words for a topic; if they can identify the “intruder,” then the topic is considered coherent. These evaluations are kind of fun, like word puzzles!
If you use the word intruder test, you can also compare your results to prior work. As some benchmarks, I’ve observed performance of about 0.5 for Twitter data and 0.7 for Reddit data.
It’s unlikely you can reach proper statistical power using these evaluations (the paper below suggests at least 15 labelers, which is unrealistic for most research projects that require domain expertise). But manual evaluation can still provide important context about the coherence and reliability of your topics.
There’s no “right” number of topics. It depends on both your dataset and your research goals. For example, using the same training data, you might prefer a larger or smaller number of topics depending on whether you’re interested in high level themes or fine-grained discourses. In general, your best bet is to try different numbers and see what works best.
While there are automatic metrics you can use to select the number of topics (e.g., log likelihood, which measures statistical fit), I’d strongly suggest a qualitative evaluation instead (repeating a bit of what I wrote above about manual evaluation). Read your topics! Check both the most probable words as well as the documents with highest topic probability. If you notice that the topics are too broad and combining multiple themes, you might want to try a larger number of topics. If instead you see multiple topics on the same theme, you might want to try a smaller number of topics.
As a rule of thumb, I usually train models with 20, 30, 50, 100, and 200 topics to start.
Very short documents like tweets are more difficult to topic model, and you’ll generally find more coherent topics using documents at least a paragraph in length. Documents that are too long (e.g., entire novels) can result in incoherent topics or topics that reflect different types of themes than topics trained on smaller documents. In general, paragraphs, pages, or posts are good sizes with which to start experimenting.
That said, this doesn’t mean it’s impossible to get coherent topics from short documents. In my experience, you can still discover coherent topics from tweets as long as you follow the tips below (and especially, remove duplicates and curate the training set), though they won’t be as coherent as topics trained on other types of data.
Aside from the number of topics, there are a few other knobs you can turn when preparing a model for training. There are two Dirichlet parameters that control the distributions of topics and words (how “spiky” these distributions are). Prior work by Wallach et al. (2009) has found that it’s helpful to tune the parameter governing topics per document distribution but not the parameter governing the words per topic distribution. In other words, you should think carefully about how many topics you expect to be represented in each document (do you expect each document to be represented by one topic? multiple topics?), but you shouldn’t worry too much about how many words you expect to represent each topic.
There are different ways to train an LDA model. Some implementations, like gensim, use variational inference and can more easily accommodate online learning. This is often faster than Gibbs sampling because it’s common to use fewer training iterations with variational inference (e.g., gensim uses just one training iteration). However, Gibbs sampling produces more coherent results, particularly on smaller training sets. I don’t know of a published result establishing this, but it’s been observed anecdotally by many researchers.
Removing stopwords is generally not helpful, except for very frequent and infrequent words. In work comparing models with and without stopword removal, Schofield et al. (2017) found that it’s best to use a high frequency threshold (e.g., words in at least 50% of the corpus). Removing stopwords is usually a cosmetic change that helps you interpret the results, not one that impacts training, so you can do it before or after training.
In work comparing different stemming and lemmatization methods in English, Schofield and Mimno (2016) found that these treatments are usually redundant and sometimes harmful. Stemming can reduce the readability of your topics (often the stems aren’t easily interpretable) as well as conflate terms with very different meanings. As noted in their paper, “In most cases, the topic model was already grouping together common words with the same root on its own, and gained little by better modeling rare words.”
Under a certain threshold, duplicates are handled easily by LDA. The duplicates get their own topic and don’t hurt the rest of the model. But when there are a lot of duplicates, they can really destroy your topics’ coherence. These principles are supported by work (again) from Schofield et al. (2017).
Aside from removing exact duplicates, make sure that you also check for near duplicates. In my experience, tweets are particularly prone to this issue, where people share close but not exact copies of other people’s content. For example, certain jokes have been retold so many times on Twitter that they can throw off an entire model. It might take some trial and error to capture all of these.
In general, if you get really weird looking topics from what otherwise seems like a reasonable dataset, the first thing you should do is check for duplicate documents in your training set. I’ve found that this is one of the most useful tricks that can turn nonsense topics into coherent topics.
Think carefully about which documents to include in your training set and how those documents help you answer your particular research questions. For example, when studying a particular online community, you could include all the posts and comments from that community, or you could include a subset. You could include only posts; only comments; only posts containing a keyword of interest; etc. Data curation is an important step before training that can significantly shape the resulting topics.
Sometimes the topic model works really well—it identifies real themes in our dataset—but those themes mirror metadata that we’ve already documented. For example, a topic model trained on novels might produce topics that map to specific authors, which isn’t particularly useful if (a) we already have a list of the authors mapped to novels and (b) we’re using the topic model to discover new, cross-cutting themes. Similarly, the model might mimic genre categories or other labels that we already know.
Luckily, there’s an easy preprocessing technique we can use to bias our topic models away from known metadata, introduced by Thompson and Mimno (2018). By probabilistically subsampling words associated with the metadata before training, we can reduce the topics’ mirroring of the metadata, without losing coherence. You can use the Authorless TMs package (also available via MALLET) to run this preprocessing on your own data. This has become a standard part of my topic modeling workflow, and I highly recommend giving it a try.
While I’m working, I usually rely on the most probable words for each topic as their labels. However, when presenting results to others, I’ve found it helpful to replace these words with hand-annotated labels that represent both the top words and the top documents for each topic. In particular, if your paper is being read or reviewed by people unfamiliar with topic modeling, I’ve found this can reduce unnecessary concerns about stopwords, lemmatizing, etc.
This labeling should be guided by domain knowledge and done by more than one person. The labels should then be reconciled across different annotators.
Heatmaps are good, though they require your reader to take their time in examining the plot. You can improve the legibility of the heatmap by ranking your topics by entropy over subsets of the data or by Jensen-Shannon divergence from one topic. You might also want to normalize the columns, to avoid “stripey” patterns (where one topic dominates all the categories) and highlight differences across categories.
For example, in Figure 6 of this paper, we ranked the topics by entropy over book review genres, and we normalized the columns so that probability distinctions for topics with universally lower or higher probabilities are still interpretable.
That concludes my list of tips! If you’re curious to learn more about topic modeling, I’ve also included some favorite resources below.
One additional note: I think it’s possible to read through this list and come away with the impression that topic modeling is just a bunch of tricks to produce output that looks good but isn’t necessarily representative of the dataset. This is where I would point to more rigorous evaluation methods (like the hand-annotation techniques described above). And at a meta level, the point of topic modeling generally isn’t to discover one “correct” view of the data but to guide a qualitative exploration, discovering one of many possible interpretative lenses through which you can view the data.
Thank you to David Mimno and Matthew Wilkens for providing comments on earlier versions of this blog post. Thanks also to Alexandra Schofield, Laure Thompson, Alexander M. Hoyle, Hanna Wallach, David Mimno, and many others for the useful research that underlies most of these tips!
Updated March 19, 2023
July 27, 2022