Last week, I had the very real pleasure of attending a Word Vectors for the Thoughtful Humanist workshop. This workshop was hosted by Northeastern’s Women Writers Project (WWP) and was sponsored by a grant from the NEH’s Office of Digital Humanities. The principal instructors throughout the week were Julia Flanders and Sarah Connell, respectively the WWP Director and Assistant Director.
I have a lot that I could say about the workshop, and I hope to collect some thoughts in some short blog posts over the coming days. The idea is that if I try to write some short thoughts rather than say everything, I might end up saying something. But I’ve got to start somewhere, and where I want to start is trying to explain word vectors to myself. I expect that this exercise will prove useful to retain some of what I learned last week, as well as prepare me to share this methodology with my students, when I teach it in January 2022, and my colleagues in BYU’s Office of Digital Humanities, when I teach them later this summer.
Put very simply, word vectors are a means to represent linguistic data in multi-dimensional space and calculate their similarity. The algorithm—normally word2vec or GloVe—looks at a word and its neighbors within a window that the researcher sets (e.g. 5 words to either side of the key word). Each token (individual word) is converted into a type (which is to say that “marriage” only appears once in the model) and then placed in vector space. These placements are essentially random as the window moves across a document, but as the model “reads” a word and the other words within the window, it makes small adjustments to where these words lie in the multidimensional space. (How many dimensions? As many as there are types within the document.) Words that are within the window get adjusted within the vector space so that they are “closer” to one another. And with the magic of negative sampling, a certain number of words that are not within the current window get adjusted a bit so they are “farther” from the word the algorithm is currently looking at. Run this process across every term in the document and then do it as many times as possible, and you end up with a vector space in which all words are represented, and the words that are used near each other end up being “close” in this high dimensional space, at least as measured by cosine similarity. (Please, please do not ask me to explain cosine similarity. I last took math in 12th grade, to my DH-loving, everlasting shame.)
But what’s really, potentially magical about this approach is not just that words like “Brian” and “Croxall” end up close to each other in vector space because they tend to appear close to one another. Instead, it’s the fact that this method places words that are used in similar discursive spaces close to one another. For example, if Shakespeare calls someone a “poxy knave” and calls someone else a “poxy blackguard,” not only will “knave” and “blackguard” be near “poxy,” but they will be nearer each other. What’s even more magical is that the types that make up “syphilitic fool” will also end up being near “poxy,” “knave,” and “blackguard” because they are part of, again, similar discursive spaces.
So, there you have it: an initial pass at explaining what word vectors are. In the spirit of a long-ago post by Chris Forster, please tell me where I’m wrong.
I want to acknowledge that my thinking has been informed not just by the workshop but by some of what we were asked to read to prepare. In particular, I learned a lot from the following: two posts by Ryan Heuser (“Word Vectors in the Eighteenth Century” part 1 and part 2), one by Ben Schmidt, one by Gabriel Recchia (responding to Heuser), and one by Laura Johnson (one of the team at the workshop).