Rationale

In Nabokov’s Favorite Word is Mauve, Ben Blatt suggests that Hemingway would find his “mathematical analysis equal parts illuminating and outlandish.” But he also suggests that “it’s just as outlandish to focus on a small sample and never look at the whole picture.” We’ve read some Vonnegut—a whole 100 pages—so what right do we have to say that we know anything about his themes or style? How could we go about correcting this? After all, we only have one week of the semester remaining! More importantly, what could we learn if we actually managed to pull this off?

Strap in for one last crazy digital humanities experiment!

Part of this project will involve us once again confronting the friction of formats. Computational approaches require computable information, and it turns out that print books aren’t. Or at least, they aren’t yet. This project will present you with a second chance to think about what it takes to get material from one format into another.

The Nitty Gritty

Collectively, we will continue (with my 2021 class) building a dataset of all of Kurt Vonnegut’s stories, novels, and nonfiction. Individually, you will be scanning a number of pages from one or more of his books and then processing those scans with optical character recognition (OCR) software. You will finish this by Monday, December 12 at 11:59pm. Then you’ll play with the data during the final exam.

Scanning

It turns out that books can be kind of hard to scan since there’s all that pesky page-turning to deal with. We’re going to simplify things by making these books less book-like and removing their bindings.

  • You will scan the pages that you are assigned, turning the pages into PDFs. If you have selections from multiple books, please scan those pages in separate chunks, creating different files.
  • There are three printers around the JKB that have automatic feeders and that can scan both sides of the page. You can find them in 3003C, 4073 East, and 4016 West. There’s also a printer in the Office of Digital Humanities office in 1163J JFSB. I estimate that it will take you 5 minutes to do the scanning. The scanners will email you the file.
  • Before scanning, you should fan the edge of your pages to make sure none are stuck to each other. Then, you will need to make four changes to the default settings on the printers:
    • The default is to only scan one side of the page. You need to make sure you scan 2 sides.
    • Set the scanner to work in grayscale rather than color.
    • The default on the scanners is 300 dpi, but you should set it to 200 dpi, as it will result in a smaller file.
    • The default file type on the scanners is a Compact PDF. This will produce unusable output. Instead, you need to change the file type to PDF. (You do not need to do this if you’re using the copier in 1163J JFSB.)
  • If for some reason you aren’t receiving the file, there’s a strong chance that it was too large to email. In that case, break your bundle of pages into two or three chunks and repeat the process for each.
  • When you have received the file, check it to make sure you scanned both sides of the pages. 
  • Rename your PDF to bookabbreviation_pages_lastname; for example, PS_73-300_croxall.pdf. If you’re working with different books, make sure you save these as different files. 
  • Upload a copy of the PDF(s) to this Google Drive folder.

OCRing

  • To do the OCR work, you will need to bring your PDF(s) to one of several computers in the JFSB that have Prizmo installed on them.
    • All of the iMacs in the labs on the main floor of the JFSB (1131) have Prizmo installed. This lab is open Monday – Saturday, from the morning until the evening.
    • All of the iMacs in our classroom have Prizmo installed. You will be able to use it during our class periods next week—during which we will not be meeting. Additionally, I’ve reserved the room at the following times:
      • Friday, 2 December from 2-6 pm
      • Friday, 9 December from 9 am – 6 pm
    • IMPORTANT: Prizmo will not work properly if you log in to the computer with your normal Net ID and password. Instead, you should log in with the username macuser and the password that I will send you via email.
  • When opening Prizmo, choose “New Document…” and then drag and drop your PDF onto the window.
  • Select all of the pages in your file by clicking on a single page image in the leftmost bar and choosing Edit > Select All (⌘A).
  • Then click “Recognize” in Prizmo’s upper-right corner.
  • Sit back and relax as Prizmo processes all of your text.
  • As you look through each page of the text, do the following three things:
    • First, make sure that only the body of the text is selected on the page, removing the page numbers and/or page headers. You may have to rearrange the regions that Prizmo has automatically selected.
    • Second, make sure you don’t have dots in the text. If you do, it means you’re working on a copy of Prizmo that hasn’t been registered. Please email me before going any farther.
    • Third, check for misspellings and correct them. They will have red underlines; if the word is spelled correctly and will repeat regularly throughout your pages (like a character’s name, a place name, or a foreign word), you can right-click and tell Prizmo to ignore that word moving forward.
    • Fourth, check for random characters such as numbers or strange punctuation marks and correct them. Prizmo won’t catch these with red underlines most of the time. Just glance around the page. You’ll be surprised how good the human eye is at seeing misplaced characters.
  • Remember: I’m not asking you to read through each and every single word.
  • If you can’t finish all of this OCR work at once, you can save the work as a Prizmo file (.pzdoc). Make sure you save it to a flash drive or upload it to something like BoxDropbox, or Google Drive.
  • When you’ve finished all of your pages, choose File > Export… and set the Format to “Regular Text”. Make sure you include all pages. Click “Export to File…” and then save the file.
  • Name your file bookabbreviation_pages_lastname; for example, TBQ_201-338_croxall.txt and email me the file. In the email, please let me know if you finished all of your pages or not.
  • If you have pages from more than one book, you will have to go through this process with your two different files. But don’t worry, I made the number of pages equal for everyone. Well, except for me. I got extra.
  • Important: Please do not spend more than five hours on the OCR, even if you don’t finish. Just let me know in your email what you got done and how long you spent on the project.
  • You need to get me all of your files no later than 11:59pm on Monday, 12 December. This will give me the chance to compile everything in time for the final.
  • Final Exam

    During the final (Thursday, 15 December from 3-6 pm), you will work in groups to analyze our newly created Vonnegut corpus using the various text analysis tools provided by Voyant, which was developed by Stéfan Sinclair and Geoffrey Rockwell. You will use Voyant’s tools to help you practice digital humanities. Put differently, you’ll identify patterns and then interpret them.

    Working in groups, you will not read Vonnegut using at least two of the different tools in Voyant. Each group will have one tool assigned to them; you will be free to pick the other. Tools that I am known to favor include TextualArcMandalaMicrosearchTermsRadioTopicsPhrases, and Bubblelines.

    By the end of the final, your group will collectively write a 200-word per-person (minimum) Google Doc. (So if you have three people in your group, you’ll write 600 words minimum, and so forth.) Make sure that you share your document with b [dot] croxall [at] gmail [dot] com. Your document will discuss patterns you’ve found and how you interpret them, with the goal of increasing our understanding of Vonnegut’s work. Your Doc should include images that you’ve created in Voyant. You will have approximately 2 hours of the final to do your exploration and complete the writing. The final 45 minutes of the exam period will be devoted to each group sharing what they’ve found, as well as a few final words from me.

    Of course, it’s important to recognize that we might not learn anything earth-shattering—or even anything—by taking this approach. That’s okay. We are, to a certain extent, just screwing around. As Rockwell puts it in his essay “What is Text Analysis, Really?”: “Playful experimentation is a pragmatic approach of trying something, seeing if you obtain interesting results” (214, emphasis added). We’re out to have fun and see if we find anything interesting along the way.

    Then we’ll all high-five each other and ride off into the sunset.

    Final Exam Groups

    Group 1 (Collocates Graph)
    • Kira
    • Rachel
    • Kellsie
    Group 2 (TextualArc)
    • Maddie
    • Sophie
    • Jill
    Group 3 (Mandala)
    • Sophia
    • Maura
    • Ellie
    Group 4 (WordTree)
    • Tatum
    • Raistlyn
    • Megan
    Group 5 (MicroSearch)
    • William
    • Serena
    • Siera
    Group 6 (Dreamscape)
    • Luci
    • Hannah
    • Kaden

    Book Assignments

    Name Book(s) Pages
    Megan CC 61-264
    Raistlyn B 1-200
    Kaden B / SoT 201-318 / 141-222
    Kellsie SoT / HP 223-260 / vii-136
    Rachel HP / T 137-322 / xiii-xvii, 1-10
    Kira T 11-210
    Jill T / CS 211-250 / 7-86
    Tatum CS 87-168, 173-192
    Serena CS 193-269, 275-292
    William CS 293-365, 373-398
    Maddie CS 399-458, 463-501
    Sophia CS 502-604
    Maura CS 605-668, 673-702
    Hannah CS 703-801
    Sophie CS 807-852, 857-907
    Luci HBWJ / WFaG vii-185 / 123-139
    Ellie GBYDK / WFaG / PS 11-92 / xiii-121 / 55-72
    Siera WFaG / PS 141-288 / xi-54
    Croxall BoC / SHF / PS 1-52 / 171-172 / 73-300

    Book Abbreviations

    Novels
    • B = Bluebeard (1987)
    • BoC = Breakfast of Champions (1973)
    • CC = Cat’s Cradle (1963)
    • DD = Deadeye Dick (1982)
    • G = Galápagos (1985)
    • GBYMR = God Bless You, Mr. Rosewater (1965)
    • HP = Hocus Pocus (1990)
    • JB = Jailbird (1979)
    • MN = Mother Night (1961)
    • PP = Player Piano (1952)
    • SHF = Slaughterhouse-Five (1969)
    • SoT = Sirens of Titan (1959)
    • SS = Slapstick (1976)
    • T = Timequake (1997)
    Short Stories
    • CS = Complete Short Stories (2017)
    Plays
    • HBWJ = Happy Birthday, Wanda June (1970)
    Novellas
    • GBYDK = God Bless You, Dr. Kevorkian (1999)
    Nonfiction
    • FWtD = Fates Worse than Death (1991)
    • ITINWI = If This Isn’t Nice, What Is? (2013)
    • KVL = Kurt Vonnegut: Letters (2014)
    • MWaC = Man Without a Country (2005)
    • PS = Palm Sunday (1981)
    • WFaG = Wampeters, Foma & Granfalloons (1974)

    Grading

    This project, which will include the final, is worth 15% of your grade in the class. Half of those 150 points are related to your completing the scanning and OCR work on your assigned pages (again, you should not go over 5 hours). The other 75 points will be awarded based on your group’s work during the final exam. Remember, as an experimental class project, you are not being graded on what you and your group find about Vonnegut’s work. After all, we simply don’t know what we’ll find—if anything.

    Instead, you’ll be graded on (1) whether you accomplish all the parts of the assignment (pass/fail); (2) how engaged you are with the work; and (3) how well you apply the method of screwing around / pattern recognition / interpretation we’ve been embracing throughout the semester.

    Credits

    This assignment was designed by Brian Croxall, originally in 2014 and with Hemingway in mind, and is licensed with a Creative Commons BY (CC BY 4.0) license. Special props to Stewart Varner for telling me to stop thinking about Whitman; David Mimno and Ted Underwood for encouragement; and Paul Fyfe and Jason B. Jones for an idea that I gleefully ripped off. None of this would be possible without the fantastic resource of Voyant Tools, which Stéfan Sinclair (RIP) and Geoffrey Rockwell have developed for years amid constant pestering from me for new features.