Word embeddings offer a way to transform text into features. Instead of using vectors of word counts, words now are represented as positions on a latent multidimensional space. These positions are weights from an underlying deep learning models where the use of words are predicted based on the contiguous words. The idea is that words that have similar weights are likely to be used surrounded by the same words.
word2vec
is a method to compute word embeddings
developed by Google. There are others (e.g. Glove
,
BERT
, etc), but it is quite popular and we can use
pre-trained models to speed up our analysis.
Let’s see what we can do with it usign the rword2vec
package in R. The examples here are based on the package materials,
available here.
#library(devtools)
#install_github("mukul13/rword2vec")
#install.packages("lsa")
library(rword2vec)
library(lsa)
## Loading required package: SnowballC
This is how you would train the model. Note that this chunk of code will take a LONG time, so don’t run it. There are different ways to train the model (see ?word2vec for details)
model <- word2vec(
train_file = "text8",
output_file = "vec.bin",
binary=1,
num_threads=3,
debug_mode=1)
To speed up the process, I’m providing a pre-trained model, available
in the file vec.bin
. We can now use it to run some
analyses.
We’ll start by computing the most similar words to a specific word, where similar means how close they are on the latent multidimensional space (cosine similarity).
distance(file_name = "../data/vec.bin",
search_word = "princess",
num = 10)
## Entered word or sentence: princess
##
## Word: princess Position in vocabulary: 3419
## word dist
## 1 consort 0.734738826751709
## 2 heiress 0.718510031700134
## 3 duchess 0.715823769569397
## 4 prince 0.703364968299866
## 5 empress 0.690687596797943
## 6 matilda 0.688317775726318
## 7 queen 0.682406425476074
## 8 isabella 0.668479681015015
## 9 countess 0.665310502052307
## 10 dowager 0.662643551826477
distance(file_name = "../data/vec.bin",
search_word = "terrible",
num = 10)
## Entered word or sentence: terrible
##
## Word: terrible Position in vocabulary: 8301
## word dist
## 1 sorrow 0.621069073677063
## 2 ruthless 0.616687178611755
## 3 cruel 0.611717998981476
## 4 devastating 0.606187760829926
## 5 horrific 0.599025368690491
## 6 scourge 0.595880687236786
## 7 weary 0.586524903774261
## 8 pestilence 0.584030032157898
## 9 doomed 0.584006071090698
## 10 crippling 0.581335961818695
distance(file_name = "../data/vec.bin",
search_word = "london",
num = 10)
## Entered word or sentence: london
##
## Word: london Position in vocabulary: 339
## word dist
## 1 edinburgh 0.672667682170868
## 2 glasgow 0.65399569272995
## 3 croydon 0.635727107524872
## 4 southwark 0.630425989627838
## 5 dublin 0.617245435714722
## 6 bristol 0.6152104139328
## 7 brighton 0.614435136318207
## 8 birmingham 0.59646064043045
## 9 buckinghamshire 0.594625115394592
## 10 manchester 0.571323156356812
distance(file_name = "../data/vec.bin",
search_word = "uk",
num = 10)
## Entered word or sentence: uk
##
## Word: uk Position in vocabulary: 532
## word dist
## 1 australia 0.605582296848297
## 2 canada 0.52595591545105
## 3 us 0.521789014339447
## 4 bbc 0.502693831920624
## 5 charts 0.485292196273804
## 6 bt 0.477047115564346
## 7 australian 0.470468789339066
## 8 usa 0.469096928834915
## 9 london 0.468733191490173
## 10 eu 0.443375200033188
distance(file_name = "../data/vec.bin",
search_word = "philosophy",
num = 10)
## Entered word or sentence: philosophy
##
## Word: philosophy Position in vocabulary: 603
## word dist
## 1 metaphysics 0.835179328918457
## 2 idealism 0.742121577262878
## 3 discourse 0.725879728794098
## 4 philosophical 0.723901093006134
## 5 theology 0.718465328216553
## 6 jurisprudence 0.717357635498047
## 7 materialism 0.716643393039703
## 8 empiricism 0.713004291057587
## 9 humanism 0.705726206302643
## 10 epistemology 0.700498759746552
As an alternative way to think of embeddings, see this cool online visualization.
Where do these similarities come from? Let’s extract the underlying word vectors.
# Extracting word vectors
bin_to_txt("../data/vec.bin", "../data/vector.txt")
And now read them in R:
library(readr)
data <- read_delim("../data/vector.txt",
skip=1, delim=" ",
col_names=c("word", paste0("V", 1:100)))
## Rows: 71291 Columns: 101
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: " "
## chr (1): word
## dbl (100): V1, V2, V3, V4, V5, V6, V7, V8, V9, V10, V11, V12, V13, V14, V15,...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
data[1:10, 1:6]
## # A tibble: 10 × 6
## word V1 V2 V3 V4 V5
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 </s> 0.00400 0.00442 -0.00383 -0.00328 0.00137
## 2 the 0.778 1.08 -0.00492 0.436 -1.73
## 3 of -0.249 -0.0993 -0.685 1.56 -1.30
## 4 and -0.735 1.06 0.604 0.0723 -0.629
## 5 one 1.33 0.0608 -0.385 -0.503 0.0646
## 6 in -0.947 0.845 -0.979 1.70 0.191
## 7 a 1.42 1.27 -1.66 0.623 -2.01
## 8 to 1.35 -0.377 -2.09 1.16 1.25
## 9 zero 0.541 -1.07 0.715 0.218 0.0464
## 10 nine 2.06 0.0170 -0.602 -1.58 0.457
That’s the value of each word for each of the first five dimensions. We can plot some of these to understand better exactly what we’re working with:
plot_words <- function(words, data){
# empty plot
plot(0, 0, xlim=c(-2.5, 2.5), ylim=c(-2.5,2.5), type="n",
xlab="First dimension", ylab="Second dimension")
for (word in words){
# extract first two dimensions
vector <- as.numeric(data[data$word==word,2:3])
# add to plot
text(vector[1], vector[2], labels=word)
}
}
plot_words(c("good", "better", "bad", "worse"), data)
plot_words(c("microsoft", "yahoo", "apple", "mango", "peach"), data)
plot_words(c("atheist", "agnostic", "catholic", "buddhist", "protestant", "christian"), data)
plot_words(c("government", "economics", "sociology",
"philosophy", "law", "engineering", "astrophysics",
"biology", "physics", "chemistry"), data)
Once we have the vectors for each word, we can compute the similarity between a pair of words:
similarity <- function(word1, word2){
lsa::cosine(
x=as.numeric(data[data$word==word1,2:101]),
y=as.numeric(data[data$word==word2,2:101]))
}
similarity("australia", "england")
## [,1]
## [1,] 0.6319489
similarity("australia", "canada")
## [,1]
## [1,] 0.6800522
similarity("australia", "apple")
## [,1]
## [1,] 0.0300495
The final function provided by the package is
word_analogy
, which helps us find regularities in the word
vector space:
word_analogy(file_name = "../data/vec.bin",
search_words = "king queen man",
num = 1)
##
## Word: king Position in vocabulary: 187
##
## Word: queen Position in vocabulary: 903
##
## Word: man Position in vocabulary: 243
## word dist
## 1 woman 0.670807123184204
word_analogy(file_name = "../data/vec.bin",
search_words = "paris france berlin",
num = 1)
##
## Word: paris Position in vocabulary: 1055
##
## Word: france Position in vocabulary: 303
##
## Word: berlin Position in vocabulary: 1360
## word dist
## 1 germany 0.818466305732727
word_analogy(file_name = "../data/vec.bin",
search_words = "man woman uncle",
num = 2)
##
## Word: man Position in vocabulary: 243
##
## Word: woman Position in vocabulary: 1012
##
## Word: uncle Position in vocabulary: 4206
## word dist
## 1 niece 0.729662358760834
## 2 aunt 0.729258477687836
word_analogy(file_name = "../data/vec.bin",
search_words = "building architect software",
num = 1)
##
## Word: building Position in vocabulary: 672
##
## Word: architect Position in vocabulary: 3366
##
## Word: software Position in vocabulary: 404
## word dist
## 1 programmer 0.584205448627472
word_analogy(file_name = "../data/vec.bin",
search_words = "man actor woman",
num = 5)
##
## Word: man Position in vocabulary: 243
##
## Word: actor Position in vocabulary: 461
##
## Word: woman Position in vocabulary: 1012
## word dist
## 1 actress 0.815776824951172
## 2 singer 0.705898344516754
## 3 comedienne 0.665390908718109
## 4 playwright 0.655908346176147
## 5 entertainer 0.655762135982513
word_analogy(file_name = "../data/vec.bin",
search_words = "france paris uk",
num = 1)
##
## Word: france Position in vocabulary: 303
##
## Word: paris Position in vocabulary: 1055
##
## Word: uk Position in vocabulary: 532
## word dist
## 1 london 0.532313704490662
word_analogy(file_name = "../data/vec.bin",
search_words = "up down inside",
num = 2)
##
## Word: up Position in vocabulary: 98
##
## Word: down Position in vocabulary: 310
##
## Word: inside Position in vocabulary: 1319
## word dist
## 1 beneath 0.573975384235382
## 2 outside 0.570115745067596
And we can see some examples of algorithmic bias (but really, bias in the training data):
word_analogy(file_name = "../data/vec.bin",
search_words = "man woman professor",
num = 1)
##
## Word: man Position in vocabulary: 243
##
## Word: woman Position in vocabulary: 1012
##
## Word: professor Position in vocabulary: 1750
## word dist
## 1 lecturer 0.671598970890045
word_analogy(file_name = "../data/vec.bin",
search_words = "man doctor woman",
num = 1)
##
## Word: man Position in vocabulary: 243
##
## Word: doctor Position in vocabulary: 1907
##
## Word: woman Position in vocabulary: 1012
## word dist
## 1 nurse 0.520112752914429