word2vec

Word embeddings offer a way to transform text into features. Instead of using vectors of word counts, words now are represented as positions on a latent multidimensional space. These positions are weights from an underlying deep learning models where the use of words are predicted based on the contiguous words. The idea is that words that have similar weights are likely to be used surrounded by the same words.

word2vec is a method to compute word embeddings developed by Google. There are others (e.g. Glove, BERT, etc), but it is quite popular and we can use pre-trained models to speed up our analysis.

Let’s see what we can do with it usign the rword2vec package in R. The examples here are based on the package materials, available here.

#library(devtools)
#install_github("mukul13/rword2vec")
#install.packages("lsa")
library(rword2vec)
library(lsa)
## Loading required package: SnowballC

This is how you would train the model. Note that this chunk of code will take a LONG time, so don’t run it. There are different ways to train the model (see ?word2vec for details)

model <- word2vec(
    train_file = "text8",
    output_file = "vec.bin",
    binary=1,
    num_threads=3,
    debug_mode=1)

To speed up the process, I’m providing a pre-trained model, available in the file vec.bin. We can now use it to run some analyses.

We’ll start by computing the most similar words to a specific word, where similar means how close they are on the latent multidimensional space (cosine similarity).

distance(file_name = "../data/vec.bin",
        search_word = "princess",
        num = 10)
## Entered word or sentence: princess
## 
## Word: princess  Position in vocabulary: 3419
##        word              dist
## 1   consort 0.734738826751709
## 2   heiress 0.718510031700134
## 3   duchess 0.715823769569397
## 4    prince 0.703364968299866
## 5   empress 0.690687596797943
## 6   matilda 0.688317775726318
## 7     queen 0.682406425476074
## 8  isabella 0.668479681015015
## 9  countess 0.665310502052307
## 10  dowager 0.662643551826477
distance(file_name = "../data/vec.bin",
    search_word = "terrible",
    num = 10)
## Entered word or sentence: terrible
## 
## Word: terrible  Position in vocabulary: 8301
##           word              dist
## 1       sorrow 0.621069073677063
## 2     ruthless 0.616687178611755
## 3        cruel 0.611717998981476
## 4  devastating 0.606187760829926
## 5     horrific 0.599025368690491
## 6      scourge 0.595880687236786
## 7        weary 0.586524903774261
## 8   pestilence 0.584030032157898
## 9       doomed 0.584006071090698
## 10   crippling 0.581335961818695
distance(file_name = "../data/vec.bin",
    search_word = "london",
    num = 10)
## Entered word or sentence: london
## 
## Word: london  Position in vocabulary: 339
##               word              dist
## 1        edinburgh 0.672667682170868
## 2          glasgow  0.65399569272995
## 3          croydon 0.635727107524872
## 4        southwark 0.630425989627838
## 5           dublin 0.617245435714722
## 6          bristol   0.6152104139328
## 7         brighton 0.614435136318207
## 8       birmingham  0.59646064043045
## 9  buckinghamshire 0.594625115394592
## 10      manchester 0.571323156356812
distance(file_name = "../data/vec.bin",
    search_word = "uk",
    num = 10)
## Entered word or sentence: uk
## 
## Word: uk  Position in vocabulary: 532
##          word              dist
## 1   australia 0.605582296848297
## 2      canada  0.52595591545105
## 3          us 0.521789014339447
## 4         bbc 0.502693831920624
## 5      charts 0.485292196273804
## 6          bt 0.477047115564346
## 7  australian 0.470468789339066
## 8         usa 0.469096928834915
## 9      london 0.468733191490173
## 10         eu 0.443375200033188
distance(file_name = "../data/vec.bin",
    search_word = "philosophy",
    num = 10)
## Entered word or sentence: philosophy
## 
## Word: philosophy  Position in vocabulary: 603
##             word              dist
## 1    metaphysics 0.835179328918457
## 2       idealism 0.742121577262878
## 3      discourse 0.725879728794098
## 4  philosophical 0.723901093006134
## 5       theology 0.718465328216553
## 6  jurisprudence 0.717357635498047
## 7    materialism 0.716643393039703
## 8     empiricism 0.713004291057587
## 9       humanism 0.705726206302643
## 10  epistemology 0.700498759746552

As an alternative way to think of embeddings, see this cool online visualization.

Where do these similarities come from? Let’s extract the underlying word vectors.

# Extracting word vectors
bin_to_txt("../data/vec.bin", "../data/vector.txt")

And now read them in R:

library(readr)
data <- read_delim("../data/vector.txt", 
    skip=1, delim=" ",
    col_names=c("word", paste0("V", 1:100)))
## Rows: 71291 Columns: 101
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: " "
## chr   (1): word
## dbl (100): V1, V2, V3, V4, V5, V6, V7, V8, V9, V10, V11, V12, V13, V14, V15,...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
data[1:10, 1:6]
## # A tibble: 10 × 6
##    word        V1       V2       V3       V4       V5
##    <chr>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
##  1 </s>   0.00400  0.00442 -0.00383 -0.00328  0.00137
##  2 the    0.778    1.08    -0.00492  0.436   -1.73   
##  3 of    -0.249   -0.0993  -0.685    1.56    -1.30   
##  4 and   -0.735    1.06     0.604    0.0723  -0.629  
##  5 one    1.33     0.0608  -0.385   -0.503    0.0646 
##  6 in    -0.947    0.845   -0.979    1.70     0.191  
##  7 a      1.42     1.27    -1.66     0.623   -2.01   
##  8 to     1.35    -0.377   -2.09     1.16     1.25   
##  9 zero   0.541   -1.07     0.715    0.218    0.0464 
## 10 nine   2.06     0.0170  -0.602   -1.58     0.457

That’s the value of each word for each of the first five dimensions. We can plot some of these to understand better exactly what we’re working with:

plot_words <- function(words, data){
  # empty plot
  plot(0, 0, xlim=c(-2.5, 2.5), ylim=c(-2.5,2.5), type="n",
       xlab="First dimension", ylab="Second dimension")
  for (word in words){
    # extract first two dimensions
    vector <- as.numeric(data[data$word==word,2:3])
    # add to plot
    text(vector[1], vector[2], labels=word)
  }
}

plot_words(c("good", "better", "bad", "worse"), data)

plot_words(c("microsoft", "yahoo", "apple", "mango", "peach"), data)

plot_words(c("atheist", "agnostic", "catholic", "buddhist", "protestant", "christian"), data)

plot_words(c("government", "economics", "sociology", 
             "philosophy", "law", "engineering", "astrophysics",
             "biology", "physics", "chemistry"), data)

Once we have the vectors for each word, we can compute the similarity between a pair of words:

similarity <- function(word1, word2){
    lsa::cosine(
        x=as.numeric(data[data$word==word1,2:101]),
        y=as.numeric(data[data$word==word2,2:101]))

}

similarity("australia", "england")
##           [,1]
## [1,] 0.6319489
similarity("australia", "canada")
##           [,1]
## [1,] 0.6800522
similarity("australia", "apple")
##           [,1]
## [1,] 0.0300495

The final function provided by the package is word_analogy, which helps us find regularities in the word vector space:

word_analogy(file_name = "../data/vec.bin",
    search_words = "king queen man",
    num = 1)
## 
## Word: king  Position in vocabulary: 187
## 
## Word: queen  Position in vocabulary: 903
## 
## Word: man  Position in vocabulary: 243
##    word              dist
## 1 woman 0.670807123184204
word_analogy(file_name = "../data/vec.bin",
    search_words = "paris france berlin",
    num = 1)
## 
## Word: paris  Position in vocabulary: 1055
## 
## Word: france  Position in vocabulary: 303
## 
## Word: berlin  Position in vocabulary: 1360
##      word              dist
## 1 germany 0.818466305732727
word_analogy(file_name = "../data/vec.bin",
    search_words = "man woman uncle",
    num = 2)
## 
## Word: man  Position in vocabulary: 243
## 
## Word: woman  Position in vocabulary: 1012
## 
## Word: uncle  Position in vocabulary: 4206
##    word              dist
## 1 niece 0.729662358760834
## 2  aunt 0.729258477687836
word_analogy(file_name = "../data/vec.bin",
    search_words = "building architect software",
    num = 1)
## 
## Word: building  Position in vocabulary: 672
## 
## Word: architect  Position in vocabulary: 3366
## 
## Word: software  Position in vocabulary: 404
##         word              dist
## 1 programmer 0.584205448627472
word_analogy(file_name = "../data/vec.bin",
    search_words = "man actor woman",
    num = 5)
## 
## Word: man  Position in vocabulary: 243
## 
## Word: actor  Position in vocabulary: 461
## 
## Word: woman  Position in vocabulary: 1012
##          word              dist
## 1     actress 0.815776824951172
## 2      singer 0.705898344516754
## 3  comedienne 0.665390908718109
## 4  playwright 0.655908346176147
## 5 entertainer 0.655762135982513
word_analogy(file_name = "../data/vec.bin",
    search_words = "france paris uk",
    num = 1)
## 
## Word: france  Position in vocabulary: 303
## 
## Word: paris  Position in vocabulary: 1055
## 
## Word: uk  Position in vocabulary: 532
##     word              dist
## 1 london 0.532313704490662
word_analogy(file_name = "../data/vec.bin",
    search_words = "up down inside",
    num = 2)
## 
## Word: up  Position in vocabulary: 98
## 
## Word: down  Position in vocabulary: 310
## 
## Word: inside  Position in vocabulary: 1319
##      word              dist
## 1 beneath 0.573975384235382
## 2 outside 0.570115745067596

And we can see some examples of algorithmic bias (but really, bias in the training data):

word_analogy(file_name = "../data/vec.bin",
    search_words = "man woman professor",
    num = 1)
## 
## Word: man  Position in vocabulary: 243
## 
## Word: woman  Position in vocabulary: 1012
## 
## Word: professor  Position in vocabulary: 1750
##       word              dist
## 1 lecturer 0.671598970890045
word_analogy(file_name = "../data/vec.bin",
    search_words = "man doctor woman",
    num = 1)
## 
## Word: man  Position in vocabulary: 243
## 
## Word: doctor  Position in vocabulary: 1907
## 
## Word: woman  Position in vocabulary: 1012
##    word              dist
## 1 nurse 0.520112752914429