Talking myself through word embeddings, part 1

Last week, I had the very real pleasure of attending a Word Vectors for the Thoughtful Humanist workshop. This workshop was hosted by Northeastern’s Women Writers Project (WWP) and was sponsored by a grant from the NEH’s Office of Digital Humanities. The principal instructors throughout the week were Julia Flanders and Sarah Connell, respectively the WWP Director and Assistant Director.

I have a lot that I could say about the workshop, and I hope to collect some thoughts in some short blog posts over the coming days. The idea is that if I try to write some short thoughts rather than say everything, I might end up saying something. But I’ve got to start somewhere, and where I want to start is trying to explain word vectors to myself. I expect that this exercise will prove useful to retain some of what I learned last week, as well as prepare me to share this methodology with my students, when I teach it in January 2022, and my colleagues in BYU’s Office of Digital Humanities, when I teach them later this summer.

Put very simply, word vectors are a means to represent linguistic data in multi-dimensional space and calculate their similarity. The algorithm—normally word2vec or GloVe—looks at a word and its neighbors within a window that the researcher sets (e.g. 5 words to either side of the key word). Each token (individual word) is converted into a type (which is to say that “marriage” only appears once in the model) and then placed in vector space. These placements are essentially random as the window moves across a document, but as the model “reads” a word and the other words within the window, it makes small adjustments to where these words lie in the multidimensional space. (How many dimensions? As many as there are types within the document.) Words that are within the window get adjusted within the vector space so that they are “closer” to one another. And with the magic of negative sampling, a certain number of words that are not within the current window get adjusted a bit so they are “farther” from the word the algorithm is currently looking at. Run this process across every term in the document and then do it as many times as possible, and you end up with a vector space in which all words are represented, and the words that are used near each other end up being “close” in this high dimensional space, at least as measured by cosine similarity. (Please, please do not ask me to explain cosine similarity. I last took math in 12th grade, to my DH-loving, everlasting shame.)

But what’s really, potentially magical about this approach is not just that words like “Brian” and “Croxall” end up close to each other in vector space because they tend to appear close to one another. Instead, it’s the fact that this method places words that are used in similar discursive spaces close to one another. For example, if Shakespeare calls someone a “poxy knave” and calls someone else a “poxy blackguard,” not only will “knave” and “blackguard” be near “poxy,” but they will be nearer each other. What’s even more magical is that the types that make up “syphilitic fool” will also end up being near “poxy,” “knave,” and “blackguard” because they are part of, again, similar discursive spaces.

So, there you have it: an initial pass at explaining what word vectors are. In the spirit of a long-ago post by Chris Forster, please tell me where I’m wrong.

I want to acknowledge that my thinking has been informed not just by the workshop but by some of what we were asked to read to prepare. In particular, I learned a lot from the following: two posts by Ryan Heuser (“Word Vectors in the Eighteenth Century” part 1 and part 2), one by Ben Schmidtone by Gabriel Recchia (responding to Heuser), and one by Laura Johnson (one of the team at the workshop).

2 thoughts on “Talking myself through word embeddings, part 1

  1. I received a note that you’d linked to my post through the magic of WordPress – it’s great to hear that people are still reading and getting something out of it!

    What a great idea to reinforce your knowledge by making public summaries of this kind. Re your bolded request toward the end, I would just point out a few nitpicky things:

    – “How many dimensions? As many as there are types within the document.” — In general, the number of dimensions in a vector space doesn’t have to be the same as the number of types within the document — though it certainly can be, and some implementations of some algorithms might necessitate this. Common choices for dimensionality include the number of types within the document that appear with some minimum frequency, or some other large, semi-arbitrarily chosen number. Or if someone is trying to create a vector space that optimizes some metric, they might select whatever dimensionality optimizes that metric on a validation set.

    – Yep, counting words as co-occurring if they both appear within a window is one option; some other algorithms don’t use a sliding window but instead say that words co-occur if they appear within the same ‘document’ (where a ‘document’ could be a sentence, or a paragraph, or some other logical unit).

    – “Words that are within the window get adjusted within the vector space so that they are “closer” to one another” – while words within the window do certainly end up closer to each other in the vector space, this is almost a byproduct of the adjustment rather than its primary aim. This seems a common misconception so I thought it worth pointing out. Consider the following algorithm, which is not exactly how most algorithms for constructing vector spaces actually work, but a good enough analogy to be useful: For every type T in a text, assign to that type a set S, to which all words appearing within (say) 3 words of T are added. We then determine how close two types are in this ‘set space’ by calculating how much overlap there is in their corresponding sets. Say that the corpus contains the sentences
    ““What are you doing, you loose fish, you clapped-out poxy blackguard, you beggarly, lousy, beetle-headed knave!” (h/t to a Google search for “poxy blackguard…”)
    “You scullion! You rampallian! You poxy knave!”
    Then the set for ‘poxy’ is { knave, you, rampallian, blackguard, beggarly, clapped-out, fish}; the set for ‘knave’ is { poxy, you, rampallian }; the set for ‘blackguard’ is { you, beggarly, lousy, poxy, clapped-out }. So: ‘blackguard’ and ‘knave’ are close in this space by virtue of the fact that their sets both contain ‘poxy’ and ‘you’. But also, ‘poxy’ and ‘knave’ are close in this space (both of their sets contain ‘you’ and ‘rampallian’), as are ‘poxy’ and ‘blackguard’ (both their sets contain ‘you’, ‘beggardly’, and ‘clapped-out’). In other words: the algorithm is designed to create a space in which words that appear in similar contexts are ‘close together’; words that actually co-occur in the same window also end up close together in that space, almost as a side effect.

    (You might have noticed that this set-based algorithm is completely ignoring the number of times (beyond the first) that two types appear in the same 3-word window, so it won’t work well with a large corpus. If only we had a different data structure capable of keeping track of this – say, some kind of vector… ;))

  2. Thanks for the post! This is all new to me–and sounds super cool to test out on some historical datasets

Leave a Reply

Your email address will not be published. Required fields are marked *