So you want to be a hero topic modeler? It turns out it’s not that difficult, thanks to the work those who created and maintain MALLET and the team behind The Programming Historian‘s lesson on topic modeling (see more in credits below).
The 5 Habits of Highly Effective Topic Modelers
1. Have me set up your computer
Over the years, I’ve discovered that one of the hardest things about doing topic modeling is getting the software installed—especially on Windows machines. Since the point of the class is not to learn to install arcane software and configure your computer but instead to learn how to think with the results of computational analysis, it seemed far more useful to have me install everything this time around. You’re welcome.
2. Create a MALLET corpus
Now you’re ready to start topic modeling. This is a two-step process. The first step is assembling all of your documents and then putting them into a single file for MALLET to read. It will have the extension .mallet
. I’ll refer to this file as a MALLET corpus.
First, put all of your documents into one directory somewhere on your computer. Second, from the command line:
- make sure you’re in the MALLET directory
- run one of the following commands based on your operating system:
- on a Mac:
./bin/mallet import-dir --input “directory” --output “name”.mallet --keep-sequence
./bin/mallet
invokes MALLETimport-dir
tells it that you’re importing a directory--input
plus the directory location tells it what text files you want to pull from--output
tells it what the MALLET corpus should be named--keep-sequence
: “This option preserves the document as a sequence of word features, rather than a vector of word feature counts. Use this option for sequence labeling tasks. The MALLET topic modeling toolkit also requires feature sequences rather than feature vectors.”
- Check that you have a new
.mallet
file in your directory with thels
command.
Optional Commands for Stopwords
MALLET comes with a stopword list that you can call when creating the MALLET corpus. To do this, add the command --remove-stopwords
to the import command. To see what terms are included in the default MALLET stopwords list, go to the MALLET directory > stoplists > en.txt.
You cannot add words to the MALLET stopwords file. (Well, you can, but it won’t affect what MALLET does as the list is just there as a reference for an internal function.) Instead, you can create another text file (in TextEdit on Mac or Notepad on Windows) with one stopword per line. You should name and save this file to your MALLET directory (e.g., extra-stopwords.txt
). When creating your MALLET corpus, you would use both --remove-stopwords
and a new flag, --extra-stopwords [filename]
(e.g., --remove-stopwords --extra-stopwords extra-stopwords.txt
)
Remember, of course, that using stop words will radically change your output. Doing so can be good or bad; the one thing for certain is that it is never simply neutral.
3. Train your topics
Now that we have a MALLET corpus, we are ready to start the unsupervised machine learning. This is referred to as “training” your topics.
From the command line, make sure that you’re (still) in your MALLET directory.
Run the following command on a Mac: ./bin/mallet train-topics --input “name”.mallet --num-topics "X" --output-topic-keys “name”-keys.txt --output-doc-topics “name”-composition.txt
.
./bin/mallet
tells it what file will run the commandtrain-topics
is the command that you’re running--input
plus the file name that follows it is the MALLET file you’ll perform your train your models on--num-topics
plus the integer you put after it is the number of topics you have the model find. Default is 10.- The number of topics that you should use is perhaps the most vexing question for topic modeling. Here’s a useful passage from Shawn Graham, Scott Weingart, and Ian Milligan’s, Programming Historian tutorial: “How do you know the number of topics to search for? Is there a natural number of topics? What we have found is that one has to run the train-topics with varying numbers of topics to see how the composition file breaks down. If we end up with the majority of our original texts all in a very limited number of topics, then we take that as a signal that we need to increase the number of topics.”
--output
Two different files that will contain our results from the training process. See below for more details.
Optional Commands
Here is another command that you can use while training your topics: --num-iterations "X" --optimize-interval “X”
.
--num-iterations
plus the integer that follows tells your computer how many times it should run through the process before it yields the final models.--optimize-interval
plus the integer you put after allows the model to learn from itself as it goes along. I find that it gets better results to include this. You need to set this to an integer. Optimization should take place every 10 intervals, according to the MALLET website.
4. Record Your Work
Like regular close reading, there is art to the process of distant reading. One of the advantages that the latter has over the former, however, is that it can be a lot simpler to record your process so your work could be replicated. That said, once you have created a series of topic models, it can be really easy to lose track of what exactly you’ve done. Therefore it’s important to keep a record as you work.
- Move your two output files to another directory. You might want to name the folder with something helpful like “peanuts-nostop-10topics.”
- You should also create a note of what you did to create these files. Create a text file (either from your GUI or from the command line using
touch "name".txt
). Go to the terminal and copy the commands that you used to create the file and then paste them into the text file. You can easily find the recent commands you’ve run by hitting the up arrow on your keyboard.
5. Interpret Your Results
You created two files from following this process:
- topic-keys.txt: a text document showing you what the “key”, or most likely, words are for each topic.
- composition.txt: outputs a text file indicating the breakdown, by percentage, of each topic within each original text file you imported. This is normally best viewed in Excel.
Take a look at the topic-keys. See if there are multiple topics that seem to use the same terms; if so, you might want to re-run the model with fewer topics. If there doesn’t seem to be coherence among the topics, you might need to re-run the model with more topics. Consider whether there are terms that seem to be throwing off or clogging the model, including stopwords or character names. If you want to remove these, you will have to return to Step 4 to remove them.
Take a look at the composition in something like Excel. Add a row to the top, and title the columns; the first two can be titled file number
and filename
, with the others being numbered for topics, starting at 0. Using the topic-keys, find a topic that you’re interested in, and look for the number of the topic in the column at the top of the page. Sort the spreadsheet by that column and look at which documents have the highest percentage of being composed from that topic. Are the results expected? Do the documents seem to belong together?
Other Resources
David Blei, who is the lead author on the original paper about topic modeling, wrote a general introduction to topic modeling. Scott Weingart has a really good post on “Topic Modeling for Humanists,” which I think is a useful overview for anyone, regardless of the field they are in.
Credits
This guide is a very stripped-down and adapted version of “Getting Started with Topic Modeling and MALLET” by Shawn Graham, Scott Weingart, and Ian Milligan, which is just one of many fabulous lessons at The Programming Historian. This lesson and the others at Programming Historian are released under a CC BY license, for which I’m grateful.
I prepared a version of this guide at first for my own use and then deployed it in a number of workshops on topic modeling for Brown University Library’s Center for Digital Scholarship. I updated that guide in 2018 for my DigHT 315 class, making a number of improvements. Since then, I’ve continued to make changes, including big ones in 2022, when I finally decided it wasn’t super important to have students install the software themselves.