Topic Modeling Peanuts

Rationale

We recently finished our reading of the first volume of The Complete Peanuts. We’ve read some of it closely thanks to your daily comics discussions, and we’ve gotten an even closer look at a subset of those strips as we’ve marked them up with TEI and CBML. With all of this behind us, I’m guessing that we could offer some pretty in-depth takes on what Peanuts is “about.” And yet, we have only read 4% of his 17,897 strips. So any pronouncements about what Peanuts is “about” at this point would necessarily require ignoring most of the available evidence.

Fortunately, digital humanities methods offer us, as Robert K. Nelson writes, “an additional method that allows us to examine and detect patterns within not a sampling but in the entirety of an archive.” One of these methods is topic modeling. From our reading, we know that topic modeling is, as Megan Brett puts it, “a way of identifying patterns in a corpus” by using an unsupervised machine-learning process to find “recurring pattern[s] of co-occurring words.” In other words, topic modeling should help us discover some of the underlying discourses within the whole Peanuts corpus, that thing we could never, ever read in a regular semester.

Of course, it needs to be said that topic modeling works with text and text only. And, as we all know at this point, Peanuts is about so much more than dialogue and diegetic text. This method cannot even pretend to account for the visual aspect of Schulz’s work. Nevertheless, in our quest to understand Peanuts—and especially how it might change over time—topic modeling provides us with a chance to consider patterns of linguistic affinity and difference that are invisible simply because of their scale. In short, while we know that topic modeling cannot be the final word about Peanuts, it’s still worth a shot!

Nitty Gritty

For this assignment, you will create at least two topic models of the text of Schulz’s strips and begin analyzing them. The assignment has one outcome: a 5-minute presentation on Tuesday, 19 March, about what you’ve learned.

You will each meet with me during our regular class time on Tuesday, 12 March to talk about this assignment. The appointments are

  • 4:00-4:20, Elijah
  • 4:20-4:40, Scout
  • 4:40-5:00, Emily

You should all feel free to consult with me frequently during this project. This is something that’s probably new for most of you, and I don’t expect it to work perfectly. If something with MALLET isn’t working, let’s solve it. If you’ve got a model but don’t know what to do next, let’s talk about it. If you’re annoyed with the class, let’s fight! The point of all this is not to prove that you are a hacker; the point is to give us all a different pathway for thinking about Peanuts.

Step One: Get the Data

You will work with two data sets in our repository:

  • the GoComics data is in the gocomics/peanuts-clean folder, these data have descriptive text that describes what happens in each strip
  • the “Jesse” data is in the jesse_data/extracted-speech-strips folder, these data only have the contents of speech and thought balloons

Please remember that these data are copyrighted: while we are using them in a fair-use context (a research project and an educational experience), they should not be distributed outside of the class.

Step Two: Predict the Data

Spend a little time thinking about what you expect to see in a topic model. Think about a particular character and/or types of activities that happen in Peanuts. Jot down your thoughts so you don’t lose track of them (you’ll need them later). In many ways, this is sort of like the “hypothesis” step in that science project you did in middle school.

Step  Three: Model the Data

Using MALLET and the Step-by-Step Guide to Topic Modeling, make models of each data set. Make sure you use the same options for both data sets, which will allow for better comparisons. You need to make the following choices:

  • the number of topics
  • whether to remove stopwords; consider whether you need to supplement MALLET’s default stopword list (viewable in the MALLET directory > stoplists > en.txt) with an additional stopwords file, such as character names.
  • whether to use the --optimize-interval setting

Make sure that you keep records of the settings you use to create your models (you’ll need them later).

After making each model, look at the topic keys to see if there seem to be too few topics or too many. Too many topics tend to be easier to spot since you will have two or more topics that look relatively similar to one another. For example, if you have three “baseball” topics, you should probably model the data with fewer topics. Too few topics can be spotted by there not being especially clear patterns to the topic keys. (That said, since we are working with relatively small documents the appearance “clear patterns” is not guaranteed.)

Step Four: Analyze the Data

Ask yourself which topics you are seeing that you had expected, which topics you hadn’t expected to appear, and which topics are “missing.” Consider which character names are prevalent in particular topics. Observe which words appear in multiple topics. Does one data set seem to better match your predictions?

Looking at the model of one data set, find a topic that is interesting to you. Open the topic composition file in Excel; insert a row at the top of the spreadsheet and name the columns to make them easier to navigate. See which documents (AKA strips) rank high for this topic. Do you see any patterns for the years the strips were published? Do any of these top strips seem like they don’t really belong in the topic?

See whether you can find a similar topic in the other data set. Do the same strips appear near the top of this “mirror topic”?

Step Five: Visualize the Data

Choose at least one of the following to visualize your data:

  • Pick one of the strips that you selected for our daily discussions. Identify the three most prominent topics for this strip, according to your model of the Jesse data. Using the topic keys, color-code the words in the strip that come from each of these three particular topics.
  • Pick one of the strips that you selected for our daily discussions. Make a pie chart for each of the strips, showing which topics comprise the strip using Excel and the guidelines in Miriam Posner’s “Very basic strategies for interpreting results from the Topic Modeling Tool
  • Ask me to visualize one or more of your topics over time (or run topics-over-time.py from the scripts folder in our repo).

Step Six: Present the Data

Give a five-minute presentation about your topic models and what you learned during class on 19 March. Your presentation should cover the following:

  • Your hypotheses from Step Two
  • A summary of the settings for the topic model you used for your analysis (how many topics, whether you removed stopwords, etc.)
  • What you learned during your analysis in Step Four
  • Your visualization(s)
  • What you learned from this approach that you didn’t learn through reading the strips or marking the strips up

This can be a relatively informal presentation.

Grading

This assignment is worth 50 points. You will be graded on whether you complete all of its requirements.

Credits

This assignment was designed in 2018 and owes debts to Miriam Posner and Lisa Rhody for their ideas for visualizing the output of a topic model. In 2019, I simplified portions of the assignment based on what I learned the previous year. In 2020, I adapted the assignment for Peanuts. In 2023, I changed this from a group to an individual assignment.