Word embeddings is a way to tranform text into features. Instead of using vectors of word counts, words now are represented as positions on a latent multidimensional space. These positions are weights from an underlying deep learning models where the use of words are predicted based on the contiguous words. The idea is that words that have similar weights are likely to be used surrounded by the same words.
word2vec
is a method to compute word embeddings developed by Google. There are others (e.g. Glove
), but it is quite popular and we can use pre-trained models to speed up our analysis.
Let’s see what we can do with it usign the rword2vec
package in R. The examples here are based on the package materials, available here.
library(rword2vec)
library(lsa)
## Loading required package: SnowballC
This is how you would train the model. Note that this chunk of code will take a LONG time, so don’t run it. There are different ways to train the model (see ?word2vec for details)
model <- word2vec(
train_file = "text8",
output_file = "vec.bin",
binary=1,
num_threads=3,
debug_mode=1)
To speed up the process, I’m providing a pre-trained model, available in the file vec.bin
. We can now use it to run some analyses.
We’ll start by computing the most similar words to a specific word, where similar means how close they are on the latent multidimensional space.
distance(file_name = "vec.bin",
search_word = "princess",
num = 10)
## Entered word or sentence: princess
##
## Word: princess Position in vocabulary: 3419
## word dist
## 1 consort 0.734738826751709
## 2 heiress 0.718510031700134
## 3 duchess 0.715823769569397
## 4 prince 0.703364968299866
## 5 empress 0.690687596797943
## 6 matilda 0.688317775726318
## 7 queen 0.682406425476074
## 8 isabella 0.668479681015015
## 9 countess 0.665310502052307
## 10 dowager 0.662643551826477
distance(file_name = "vec.bin",
search_word = "terrible",
num = 10)
## Entered word or sentence: terrible
##
## Word: terrible Position in vocabulary: 8301
## word dist
## 1 sorrow 0.621069073677063
## 2 ruthless 0.616687178611755
## 3 cruel 0.611717998981476
## 4 devastating 0.606187760829926
## 5 horrific 0.599025368690491
## 6 scourge 0.595880687236786
## 7 weary 0.586524903774261
## 8 pestilence 0.584030032157898
## 9 doomed 0.584006071090698
## 10 crippling 0.581335961818695
distance(file_name = "vec.bin",
search_word = "london",
num = 10)
## Entered word or sentence: london
##
## Word: london Position in vocabulary: 339
## word dist
## 1 edinburgh 0.672667682170868
## 2 glasgow 0.65399569272995
## 3 croydon 0.635727107524872
## 4 southwark 0.630425989627838
## 5 dublin 0.617245435714722
## 6 bristol 0.6152104139328
## 7 brighton 0.614435136318207
## 8 birmingham 0.59646064043045
## 9 buckinghamshire 0.594625115394592
## 10 manchester 0.571323156356812
distance(file_name = "vec.bin",
search_word = "uk",
num = 10)
## Entered word or sentence: uk
##
## Word: uk Position in vocabulary: 532
## word dist
## 1 australia 0.605582296848297
## 2 canada 0.52595591545105
## 3 us 0.521789014339447
## 4 bbc 0.502693831920624
## 5 charts 0.485292196273804
## 6 bt 0.477047115564346
## 7 australian 0.470468789339066
## 8 usa 0.469096928834915
## 9 london 0.468733191490173
## 10 eu 0.443375200033188
distance(file_name = "vec.bin",
search_word = "philosophy",
num = 10)
## Entered word or sentence: philosophy
##
## Word: philosophy Position in vocabulary: 603
## word dist
## 1 metaphysics 0.835179328918457
## 2 idealism 0.742121577262878
## 3 discourse 0.725879728794098
## 4 philosophical 0.723901093006134
## 5 theology 0.718465328216553
## 6 jurisprudence 0.717357635498047
## 7 materialism 0.716643393039703
## 8 empiricism 0.713004291057587
## 9 humanism 0.705726206302643
## 10 epistemology 0.700498759746552
Where do these similarities come from? Let’s extract the underlying word vectors.
# Extracting word vectors
bin_to_txt("vec.bin", "vector.txt")
## $rfile_name
## [1] "vec.bin"
##
## $routput_file
## [1] "vector.txt"
library(readr)
data <- read_delim("vector.txt",
skip=1, delim=" ",
col_names=c("word", paste0("V", 1:100)))
## Parsed with column specification:
## cols(
## .default = col_double(),
## word = col_character()
## )
## See spec(...) for full column specifications.
data[1:10, 1:6]
## # A tibble: 10 x 6
## word V1 V2 V3 V4 V5
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 </s> 0.00400 0.00442 -0.00383 -0.00328 0.00137
## 2 the 0.778 1.08 -0.00492 0.436 -1.73
## 3 of -0.249 -0.0993 -0.685 1.56 -1.30
## 4 and -0.735 1.06 0.604 0.0723 -0.629
## 5 one 1.33 0.0608 -0.385 -0.503 0.0646
## 6 in -0.947 0.845 -0.979 1.70 0.191
## 7 a 1.42 1.27 -1.66 0.623 -2.01
## 8 to 1.35 -0.377 -2.09 1.16 1.25
## 9 zero 0.541 -1.07 0.715 0.218 0.0464
## 10 nine 2.06 0.0170 -0.602 -1.58 0.457
That’s the value of each word for each of the first five dimensions. We can plot some of these to understand better exactly what we’re working with:
plot_words <- function(words, data){
# empty plot
plot(0, 0, xlim=c(-2.5, 2.5), ylim=c(-2.5,2.5), type="n",
xlab="First dimension", ylab="Second dimension")
for (word in words){
# extract first two dimensions
vector <- as.numeric(data[data$word==word,2:3])
# add to plot
text(vector[1], vector[2], labels=word)
}
}
plot_words(c("good", "better", "bad", "worse"), data)
plot_words(c("microsoft", "yahoo", "apple", "mango", "peach"), data)
plot_words(c("atheist", "agnostic", "catholic", "buddhist", "protestant", "christian"), data)
plot_words(c("government", "economics", "sociology",
"philosophy", "law", "engineering", "astrophysics",
"biology", "physics", "chemistry"), data)
Once we have the vectors for each word, we can compute the similarity between a pair of words:
similarity <- function(word1, word2){
lsa::cosine(
x=as.numeric(data[data$word==word1,2:101]),
y=as.numeric(data[data$word==word2,2:101]))
}
similarity("australia", "england")
## [,1]
## [1,] 0.6319489
similarity("australia", "canada")
## [,1]
## [1,] 0.6800522
similarity("australia", "apple")
## [,1]
## [1,] 0.0300495
The final function provided by the package is word_analogy
, which helps us find regularities in the word vector space:
word_analogy(file_name = "vec.bin",
search_words = "king queen man",
num = 1)
##
## Word: king Position in vocabulary: 187
##
## Word: queen Position in vocabulary: 903
##
## Word: man Position in vocabulary: 243
## word dist
## 1 woman 0.670807123184204
word_analogy(file_name = "vec.bin",
search_words = "paris france berlin",
num = 1)
##
## Word: paris Position in vocabulary: 1055
##
## Word: france Position in vocabulary: 303
##
## Word: berlin Position in vocabulary: 1360
## word dist
## 1 germany 0.818466305732727
word_analogy(file_name = "vec.bin",
search_words = "man woman uncle",
num = 2)
##
## Word: man Position in vocabulary: 243
##
## Word: woman Position in vocabulary: 1012
##
## Word: uncle Position in vocabulary: 4206
## word dist
## 1 niece 0.729662358760834
## 2 aunt 0.729258477687836
word_analogy(file_name = "vec.bin",
search_words = "building architect software",
num = 1)
##
## Word: building Position in vocabulary: 672
##
## Word: architect Position in vocabulary: 3366
##
## Word: software Position in vocabulary: 404
## word dist
## 1 programmer 0.584205448627472
word_analogy(file_name = "vec.bin",
search_words = "man actor woman",
num = 5)
##
## Word: man Position in vocabulary: 243
##
## Word: actor Position in vocabulary: 461
##
## Word: woman Position in vocabulary: 1012
## word dist
## 1 actress 0.815776824951172
## 2 singer 0.705898344516754
## 3 comedienne 0.665390908718109
## 4 playwright 0.655908346176147
## 5 entertainer 0.655762135982513
word_analogy(file_name = "vec.bin",
search_words = "france paris uk",
num = 1)
##
## Word: france Position in vocabulary: 303
##
## Word: paris Position in vocabulary: 1055
##
## Word: uk Position in vocabulary: 532
## word dist
## 1 london 0.532313704490662
word_analogy(file_name = "vec.bin",
search_words = "up down inside",
num = 2)
##
## Word: up Position in vocabulary: 98
##
## Word: down Position in vocabulary: 310
##
## Word: inside Position in vocabulary: 1319
## word dist
## 1 beneath 0.573975384235382
## 2 outside 0.570115745067596
And we can see some examples of algorithmic bias (but really, bias in the training data):
word_analogy(file_name = "vec.bin",
search_words = "man woman professor",
num = 1)
##
## Word: man Position in vocabulary: 243
##
## Word: woman Position in vocabulary: 1012
##
## Word: professor Position in vocabulary: 1750
## word dist
## 1 lecturer 0.671598970890045
word_analogy(file_name = "vec.bin",
search_words = "man doctor woman",
num = 1)
##
## Word: man Position in vocabulary: 243
##
## Word: doctor Position in vocabulary: 1907
##
## Word: woman Position in vocabulary: 1012
## word dist
## 1 nurse 0.520112752914429
Beyond this type of exploratory analysis, word embeddings can be very useful in analyses of large-scale text corpora in two different ways: to expand existing dictionaries and as a way to build features for a supervised learning classifier. The code below shows how to expand a dictionary of uncivil words. By looking for other words with semantic similarity to each of these terms, we can identify words that we may not have thought of in the first place, either because they’re slang, new words or just misspellings of existing words.
Here we will use a different set of pre-trained word embeddings, which were computed on a large corpus of public Facebook posts on the pages of US Members of Congress that we collected from the Graph API.
distance(file_name = "FBvec.bin",
search_word = "liberal",
num = 10)
## Entered word or sentence: liberal
##
## Word: liberal Position in vocabulary: 428
## word dist
## 1 leftist 0.875029563903809
## 2 lefty 0.808053195476532
## 3 lib 0.774020493030548
## 4 rightwing 0.768333077430725
## 5 progressive 0.766966998577118
## 6 left-wing 0.74224179983139
## 7 statist 0.741962492465973
## 8 right-wing 0.740352988243103
## 9 far-left 0.733825862407684
## 10 leftwing 0.715518414974213
distance(file_name = "FBvec.bin",
search_word = "crooked",
num = 10)
## Entered word or sentence: crooked
##
## Word: crooked Position in vocabulary: 2225
## word dist
## 1 corrupt 0.782054841518402
## 2 thieving 0.683514535427094
## 3 slimy 0.675886511802673
## 4 teflon 0.669225692749023
## 5 crook 0.660020768642426
## 6 corupt 0.651829242706299
## 7 dishonest 0.645328283309937
## 8 conniving 0.636701285839081
## 9 corporatist 0.629674255847931
## 10 untrustworthy 0.623017013072968
distance(file_name = "FBvec.bin",
search_word = "libtard",
num = 10)
## Entered word or sentence: libtard
##
## Word: libtard Position in vocabulary: 5753
## word dist
## 1 lib 0.798957586288452
## 2 lefty 0.771853387355804
## 3 libturd 0.762575328350067
## 4 teabagger 0.744283258914948
## 5 teabilly 0.715277075767517
## 6 liberal 0.709996342658997
## 7 retard 0.690707504749298
## 8 dumbass 0.690422177314758
## 9 rwnj 0.684058785438538
## 10 republitard 0.678197801113129
distance(file_name = "FBvec.bin",
search_word = "douchebag",
num = 10)
## Entered word or sentence: douchebag
##
## Word: douchebag Position in vocabulary: 9781
## word dist
## 1 scumbag 0.808189928531647
## 2 moron 0.80128538608551
## 3 hypocrite 0.787607729434967
## 4 jackass 0.783857941627502
## 5 shitbag 0.773443937301636
## 6 pos 0.76619291305542
## 7 dipshit 0.757693469524384
## 8 loser 0.756536900997162
## 9 coward 0.755453526973724
## 10 poser 0.750370919704437
distance(file_name = "FBvec.bin",
search_word = "idiot",
num = 10)
## Entered word or sentence: idiot
##
## Word: idiot Position in vocabulary: 646
## word dist
## 1 imbecile 0.867565214633942
## 2 asshole 0.848560094833374
## 3 moron 0.781079053878784
## 4 asshat 0.772150039672852
## 5 a-hole 0.765781462192535
## 6 ahole 0.760824918746948
## 7 asswipe 0.742586553096771
## 8 ignoramus 0.735219776630402
## 9 arsehole 0.732272684574127
## 10 idoit 0.720151424407959
We can also take the embeddings themselves as features at the word level and then aggregate to a document level as an alternative or complement to bag-of-word approaches.
Let’s see an example with the data we used in our most recent challenge:
library(quanteda)
## Warning: package 'quanteda' was built under R version 3.4.4
## Package version: 1.3.0
## Parallel computing: 2 of 4 threads used.
## See https://quanteda.io for tutorials and examples.
##
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
##
## View
fb <- read.csv("~/data/incivility.csv", stringsAsFactors = FALSE)
fbcorpus <- corpus(fb$comment_message)
fbdfm <- dfm(fbcorpus, remove=stopwords("english"), verbose=TRUE)
## Creating a dfm from a corpus input...
## ... lowercasing
## ... found 3,043 documents, 12,086 features
## ... removed 166 features
## ... created a 3,043 x 11,920 sparse dfm
## ... complete.
## Elapsed time: 0.521 seconds.
fbdfm <- dfm_trim(fbdfm, min_docfreq = 2, verbose=TRUE)
## Removing features occurring:
## - in fewer than 2 documents: 6,444
## Total features removed: 6,444 (54.1%).
First, we will convert the word embeddings to a data frame, and then we will match the features from each document with their corresponding embeddings.
bin_to_txt("FBvec.bin", "FBvector.txt")
## $rfile_name
## [1] "FBvec.bin"
##
## $routput_file
## [1] "FBvector.txt"
# extracting word embeddings for words in corpus
w2v <- readr::read_delim("FBvector.txt",
skip=1, delim=" ", quote="",
col_names=c("word", paste0("V", 1:100)))
## Parsed with column specification:
## cols(
## .default = col_double(),
## word = col_character()
## )
## See spec(...) for full column specifications.
## Warning in rbind(names(probs), probs_f): number of columns of result is not
## a multiple of vector length (arg 2)
## Warning: 1 parsing failure.
## row # A tibble: 1 x 5 col row col expected actual file expected <int> <chr> <chr> <chr> <chr> actual 1 107385 <NA> 101 columns 46 columns 'FBvector.txt' file # A tibble: 1 x 5
w2v <- w2v[w2v$word %in% featnames(fbdfm),]
# creating new feature matrix for embeddings
embed <- matrix(NA, nrow=ndoc(fbdfm), ncol=100)
for (i in 1:ndoc(fbdfm)){
if (i %% 100 == 0) message(i, '/', ndoc(fbdfm))
# extract word counts
vec <- as.numeric(fbdfm[i,])
# keep words with counts of 1 or more
doc_words <- featnames(fbdfm)[vec>0]
# extract embeddings for those words
embed_vec <- w2v[w2v$word %in% doc_words, 2:101]
# aggregate from word- to document-level embeddings by taking AVG
embed[i,] <- colMeans(embed_vec, na.rm=TRUE)
# if no words in embeddings, simply set to 0
if (nrow(embed_vec)==0) embed[i,] <- 0
}
## 100/3043
## 200/3043
## 300/3043
## 400/3043
## 500/3043
## 600/3043
## 700/3043
## 800/3043
## 900/3043
## 1000/3043
## 1100/3043
## 1200/3043
## 1300/3043
## 1400/3043
## 1500/3043
## 1600/3043
## 1700/3043
## 1800/3043
## 1900/3043
## 2000/3043
## 2100/3043
## 2200/3043
## 2300/3043
## 2400/3043
## 2500/3043
## 2600/3043
## 2700/3043
## 2800/3043
## 2900/3043
## 3000/3043
Let’s now try to replicate the lasso classifier we estimated earlier with this new feature set.
set.seed(123)
training <- sample(1:nrow(fb), floor(.80 * nrow(fb)))
test <- (1:nrow(fb))[1:nrow(fb) %in% training == FALSE]
## function to compute accuracy
accuracy <- function(ypred, y){
tab <- table(ypred, y)
return(sum(diag(tab))/sum(tab))
}
# function to compute precision
precision <- function(ypred, y){
tab <- table(ypred, y)
return((tab[2,2])/(tab[2,1]+tab[2,2]))
}
# function to compute recall
recall <- function(ypred, y){
tab <- table(ypred, y)
return(tab[2,2]/(tab[1,2]+tab[2,2]))
}
library(glmnet)
## Loading required package: Matrix
## Warning: package 'Matrix' was built under R version 3.4.4
## Loading required package: foreach
## Loaded glmnet 2.0-13
require(doMC)
## Loading required package: doMC
## Loading required package: iterators
## Loading required package: parallel
registerDoMC(cores=3)
lasso <- cv.glmnet(embed[training,], fb$attacks[training],
family="binomial", alpha=1, nfolds=5, parallel=TRUE, intercept=TRUE,
type.measure="class")
# computing predicted values
preds <- predict(lasso, embed[test,], type="class")
# confusion matrix
table(preds, fb$attacks[test])
##
## preds 0 1
## 0 96 38
## 1 149 326
# performance metrics
accuracy(preds, fb$attacks[test])
## [1] 0.6929392
precision(preds==1, fb$attacks[test]==1)
## [1] 0.6863158
recall(preds==1, fb$attacks[test]==1)
## [1] 0.8956044
precision(preds==0, fb$attacks[test]==0)
## [1] 0.7164179
recall(preds==0, fb$attacks[test]==0)
## [1] 0.3918367
We generally find quite good performance with a much smaller set of features. Of course, one downside of this approach is that it’s very hard to interpret the coefficients we get from the lasso regression.
best.lambda <- which(lasso$lambda==lasso$lambda.1se)
beta <- lasso$glmnet.fit$beta[,best.lambda]
head(beta)
## V1 V2 V3 V4 V5 V6
## 0.00000000 0.00000000 0.00000000 0.00000000 -0.03309439 0.00000000
## identifying predictive features
df <- data.frame(coef = as.numeric(beta),
word = names(beta), stringsAsFactors=F)
df <- df[order(df$coef),]
head(df[,c("coef", "word")], n=30)
## coef word
## 98 -0.05621984 V98
## 5 -0.03309439 V5
## 72 -0.02373680 V72
## 1 0.00000000 V1
## 2 0.00000000 V2
## 3 0.00000000 V3
## 4 0.00000000 V4
## 6 0.00000000 V6
## 7 0.00000000 V7
## 8 0.00000000 V8
## 9 0.00000000 V9
## 11 0.00000000 V11
## 13 0.00000000 V13
## 14 0.00000000 V14
## 15 0.00000000 V15
## 16 0.00000000 V16
## 18 0.00000000 V18
## 19 0.00000000 V19
## 21 0.00000000 V21
## 22 0.00000000 V22
## 23 0.00000000 V23
## 24 0.00000000 V24
## 25 0.00000000 V25
## 26 0.00000000 V26
## 27 0.00000000 V27
## 28 0.00000000 V28
## 29 0.00000000 V29
## 30 0.00000000 V30
## 31 0.00000000 V31
## 32 0.00000000 V32
df <- df[order(df$coef, decreasing=TRUE),]
head(df[,c("coef", "word")], n=30)
## coef word
## 83 0.54603922 V83
## 36 0.29928865 V36
## 43 0.12314401 V43
## 39 0.11865342 V39
## 20 0.08579542 V20
## 12 0.07094041 V12
## 10 0.05380056 V10
## 77 0.05272163 V77
## 86 0.03480883 V86
## 17 0.01549445 V17
## 1 0.00000000 V1
## 2 0.00000000 V2
## 3 0.00000000 V3
## 4 0.00000000 V4
## 6 0.00000000 V6
## 7 0.00000000 V7
## 8 0.00000000 V8
## 9 0.00000000 V9
## 11 0.00000000 V11
## 13 0.00000000 V13
## 14 0.00000000 V14
## 15 0.00000000 V15
## 16 0.00000000 V16
## 18 0.00000000 V18
## 19 0.00000000 V19
## 21 0.00000000 V21
## 22 0.00000000 V22
## 23 0.00000000 V23
## 24 0.00000000 V24
## 25 0.00000000 V25
head(w2v[order(w2v$V83, decreasing=TRUE),"word"], n=20)
## # A tibble: 20 x 1
## word
## <chr>
## 1 oath
## 2 usc
## 3 hour
## 4 minutes
## 5 job
## 6 man
## 7 tantrum
## 8 years
## 9 flag
## 10 politician
## 11 trillion
## 12 duty
## 13 cfr
## 14 senator
## 15 yrs
## 16 shall
## 17 feet
## 18 nose
## 19 unto
## 20 loud
head(w2v[order(w2v$V98),"word"], n=20)
## # A tibble: 20 x 1
## word
## <chr>
## 1 nobody's
## 2 cyber-security
## 3 gun
## 4 sense
## 5 partisan
## 6 civil
## 7 mongering
## 8 mongers
## 9 tired
## 10 nra
## 11 political
## 12 politicians
## 13 strong
## 14 real-time
## 15 quo
## 16 sure
## 17 foreign
## 18 noise
## 19 politician
## 20 decisions
Finally, if we want to maximize performance, we can simply combine both bag-of-words and embeddings features into a single matrix, and use xgboost to let it choose for us the best set of features. This combination of features and classifier gives us the best performance.
library(xgboost)
# converting matrix object
X <- as(cbind(fbdfm, embed), "dgCMatrix")
# parameters to explore
tryEta <- c(1,2)
tryDepths <- c(1,2,4)
# placeholders for now
bestEta=NA
bestDepth=NA
bestAcc=0
for(eta in tryEta){
for(dp in tryDepths){
bst <- xgb.cv(data = X[training,],
label = fb$attacks[training],
max.depth = dp,
eta = eta,
nthread = 4,
nround = 500,
nfold=5,
print_every_n = 100L,
objective = "binary:logistic")
# cross-validated accuracy
acc <- 1-mean(tail(bst$evaluation_log$test_error_mean))
cat("Results for eta=",eta," and depth=", dp, " : ",
acc," accuracy.\n",sep="")
if(acc>bestAcc){
bestEta=eta
bestAcc=acc
bestDepth=dp
}
}
}
## [1] train-error:0.330012+0.004394 test-error:0.336873+0.020236
## [101] train-error:0.154787+0.005935 test-error:0.316349+0.029044
## [201] train-error:0.092851+0.004448 test-error:0.317584+0.020830
## [301] train-error:0.061114+0.005362 test-error:0.318416+0.022765
## [401] train-error:0.042214+0.002491 test-error:0.315949+0.020295
## [500] train-error:0.034203+0.000838 test-error:0.320466+0.017845
## Results for eta=1 and depth=1 : 0.6801496 accuracy.
## [1] train-error:0.309573+0.012324 test-error:0.333597+0.014509
## [101] train-error:0.032765+0.002085 test-error:0.322910+0.036368
## [201] train-error:0.030505+0.001852 test-error:0.319211+0.033141
## [301] train-error:0.030505+0.001852 test-error:0.317162+0.027379
## [401] train-error:0.030505+0.001852 test-error:0.311821+0.025696
## [500] train-error:0.030505+0.001852 test-error:0.313874+0.025489
## Results for eta=1 and depth=2 : 0.6855766 accuracy.
## [1] train-error:0.254007+0.005466 test-error:0.324563+0.016406
## [101] train-error:0.030300+0.001973 test-error:0.321277+0.006829
## [201] train-error:0.030300+0.001973 test-error:0.322928+0.008565
## [301] train-error:0.030300+0.001973 test-error:0.320050+0.009146
## [401] train-error:0.030300+0.001973 test-error:0.319635+0.009648
## [500] train-error:0.030300+0.001973 test-error:0.319226+0.013174
## Results for eta=1 and depth=4 : 0.6807747 accuracy.
## [1] train-error:0.329704+0.005710 test-error:0.338543+0.011451
## [101] train-error:0.399938+0.105812 test-error:0.406803+0.095547
## [201] train-error:0.399938+0.105812 test-error:0.406803+0.095547
## [301] train-error:0.399938+0.105812 test-error:0.406803+0.095547
## [401] train-error:0.399938+0.105812 test-error:0.406803+0.095547
## [500] train-error:0.399938+0.105812 test-error:0.406803+0.095547
## Results for eta=2 and depth=1 : 0.593197 accuracy.
## [1] train-error:0.314707+0.013002 test-error:0.320055+0.015449
## [101] train-error:0.424516+0.081324 test-error:0.416968+0.082079
## [201] train-error:0.424516+0.081324 test-error:0.416968+0.082079
## [301] train-error:0.424516+0.081324 test-error:0.416968+0.082079
## [401] train-error:0.424516+0.081324 test-error:0.416968+0.082079
## [500] train-error:0.424516+0.081324 test-error:0.416968+0.082079
## Results for eta=2 and depth=2 : 0.5830322 accuracy.
## [1] train-error:0.254623+0.007139 test-error:0.323346+0.010880
## [101] train-error:0.385982+0.028828 test-error:0.411719+0.042873
## [201] train-error:0.385982+0.028828 test-error:0.411719+0.042873
## [301] train-error:0.385982+0.028828 test-error:0.411719+0.042873
## [401] train-error:0.385982+0.028828 test-error:0.411719+0.042873
## [500] train-error:0.385982+0.028828 test-error:0.411719+0.042873
## Results for eta=2 and depth=4 : 0.5882808 accuracy.
cat("Best model has eta=",bestEta," and depth=", bestDepth, " : ",
bestAcc," accuracy.\n",sep="")
## Best model has eta=1 and depth=2 : 0.6855766 accuracy.
# running best model
rf <- xgboost(data = X[training,],
label = fb$attacks[training],
max.depth = bestDepth,
eta = bestEta,
nthread = 4,
nround = 1000,
print_every_n=100L,
objective = "binary:logistic")
## [1] train-error:0.328677
## [101] train-error:0.048891
## [201] train-error:0.031224
## [301] train-error:0.031224
## [401] train-error:0.031224
## [501] train-error:0.031224
## [601] train-error:0.031224
## [701] train-error:0.031224
## [801] train-error:0.031224
## [901] train-error:0.031224
## [1000] train-error:0.031224
# out-of-sample accuracy
preds <- predict(rf, X[test,])
cat("\nAccuracy on test set=", round(accuracy(preds>.50, fb$attacks[test]),3))
##
## Accuracy on test set= 0.66
cat("\nPrecision(1) on test set=", round(precision(preds>.50, fb$attacks[test]),3))
##
## Precision(1) on test set= 0.689
cat("\nRecall(1) on test set=", round(recall(preds>.50, fb$attacks[test]),3))
##
## Recall(1) on test set= 0.786
cat("\nPrecision(0) on test set=", round(precision(preds<.50, fb$attacks[test]==0),3))
##
## Precision(0) on test set= 0.598
cat("\nRecall(0) on test set=", round(recall(preds<.50, fb$attacks[test]==0),3))
##
## Recall(0) on test set= 0.473