word2vec

Word embeddings is a way to tranform text into features. Instead of using vectors of word counts, words now are represented as positions on a latent multidimensional space. These positions are weights from an underlying deep learning models where the use of words are predicted based on the contiguous words. The idea is that words that have similar weights are likely to be used surrounded by the same words.

word2vec is a method to compute word embeddings developed by Google. There are others (e.g. Glove), but it is quite popular and we can use pre-trained models to speed up our analysis.

Let’s see what we can do with it usign the rword2vec package in R. The examples here are based on the package materials, available here.

library(rword2vec)
library(lsa)
## Loading required package: SnowballC

This is how you would train the model. Note that this chunk of code will take a LONG time, so don’t run it. There are different ways to train the model (see ?word2vec for details)

model <- word2vec(
    train_file = "text8",
    output_file = "vec.bin",
    binary=1,
    num_threads=3,
    debug_mode=1)

To speed up the process, I’m providing a pre-trained model, available in the file vec.bin. We can now use it to run some analyses.

We’ll start by computing the most similar words to a specific word, where similar means how close they are on the latent multidimensional space.

distance(file_name = "vec.bin",
        search_word = "princess",
        num = 10)
## Entered word or sentence: princess
## 
## Word: princess  Position in vocabulary: 3419
##        word              dist
## 1   consort 0.734738826751709
## 2   heiress 0.718510031700134
## 3   duchess 0.715823769569397
## 4    prince 0.703364968299866
## 5   empress 0.690687596797943
## 6   matilda 0.688317775726318
## 7     queen 0.682406425476074
## 8  isabella 0.668479681015015
## 9  countess 0.665310502052307
## 10  dowager 0.662643551826477
distance(file_name = "vec.bin",
    search_word = "terrible",
    num = 10)
## Entered word or sentence: terrible
## 
## Word: terrible  Position in vocabulary: 8301
##           word              dist
## 1       sorrow 0.621069073677063
## 2     ruthless 0.616687178611755
## 3        cruel 0.611717998981476
## 4  devastating 0.606187760829926
## 5     horrific 0.599025368690491
## 6      scourge 0.595880687236786
## 7        weary 0.586524903774261
## 8   pestilence 0.584030032157898
## 9       doomed 0.584006071090698
## 10   crippling 0.581335961818695
distance(file_name = "vec.bin",
    search_word = "london",
    num = 10)
## Entered word or sentence: london
## 
## Word: london  Position in vocabulary: 339
##               word              dist
## 1        edinburgh 0.672667682170868
## 2          glasgow  0.65399569272995
## 3          croydon 0.635727107524872
## 4        southwark 0.630425989627838
## 5           dublin 0.617245435714722
## 6          bristol   0.6152104139328
## 7         brighton 0.614435136318207
## 8       birmingham  0.59646064043045
## 9  buckinghamshire 0.594625115394592
## 10      manchester 0.571323156356812
distance(file_name = "vec.bin",
    search_word = "uk",
    num = 10)
## Entered word or sentence: uk
## 
## Word: uk  Position in vocabulary: 532
##          word              dist
## 1   australia 0.605582296848297
## 2      canada  0.52595591545105
## 3          us 0.521789014339447
## 4         bbc 0.502693831920624
## 5      charts 0.485292196273804
## 6          bt 0.477047115564346
## 7  australian 0.470468789339066
## 8         usa 0.469096928834915
## 9      london 0.468733191490173
## 10         eu 0.443375200033188
distance(file_name = "vec.bin",
    search_word = "philosophy",
    num = 10)
## Entered word or sentence: philosophy
## 
## Word: philosophy  Position in vocabulary: 603
##             word              dist
## 1    metaphysics 0.835179328918457
## 2       idealism 0.742121577262878
## 3      discourse 0.725879728794098
## 4  philosophical 0.723901093006134
## 5       theology 0.718465328216553
## 6  jurisprudence 0.717357635498047
## 7    materialism 0.716643393039703
## 8     empiricism 0.713004291057587
## 9       humanism 0.705726206302643
## 10  epistemology 0.700498759746552

Where do these similarities come from? Let’s extract the underlying word vectors.

# Extracting word vectors
bin_to_txt("vec.bin", "vector.txt")
## $rfile_name
## [1] "vec.bin"
## 
## $routput_file
## [1] "vector.txt"
library(readr)
data <- read_delim("vector.txt", 
    skip=1, delim=" ",
    col_names=c("word", paste0("V", 1:100)))
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   word = col_character()
## )
## See spec(...) for full column specifications.
data[1:10, 1:6]
## # A tibble: 10 x 6
##    word        V1       V2       V3       V4       V5
##    <chr>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
##  1 </s>   0.00400  0.00442 -0.00383 -0.00328  0.00137
##  2 the    0.778    1.08    -0.00492  0.436   -1.73   
##  3 of    -0.249   -0.0993  -0.685    1.56    -1.30   
##  4 and   -0.735    1.06     0.604    0.0723  -0.629  
##  5 one    1.33     0.0608  -0.385   -0.503    0.0646 
##  6 in    -0.947    0.845   -0.979    1.70     0.191  
##  7 a      1.42     1.27    -1.66     0.623   -2.01   
##  8 to     1.35    -0.377   -2.09     1.16     1.25   
##  9 zero   0.541   -1.07     0.715    0.218    0.0464 
## 10 nine   2.06     0.0170  -0.602   -1.58     0.457

That’s the value of each word for each of the first five dimensions. We can plot some of these to understand better exactly what we’re working with:

plot_words <- function(words, data){
  # empty plot
  plot(0, 0, xlim=c(-2.5, 2.5), ylim=c(-2.5,2.5), type="n",
       xlab="First dimension", ylab="Second dimension")
  for (word in words){
    # extract first two dimensions
    vector <- as.numeric(data[data$word==word,2:3])
    # add to plot
    text(vector[1], vector[2], labels=word)
  }
}

plot_words(c("good", "better", "bad", "worse"), data)

plot_words(c("microsoft", "yahoo", "apple", "mango", "peach"), data)

plot_words(c("atheist", "agnostic", "catholic", "buddhist", "protestant", "christian"), data)

plot_words(c("government", "economics", "sociology", 
             "philosophy", "law", "engineering", "astrophysics",
             "biology", "physics", "chemistry"), data)

Once we have the vectors for each word, we can compute the similarity between a pair of words:

similarity <- function(word1, word2){
    lsa::cosine(
        x=as.numeric(data[data$word==word1,2:101]),
        y=as.numeric(data[data$word==word2,2:101]))

}

similarity("australia", "england")
##           [,1]
## [1,] 0.6319489
similarity("australia", "canada")
##           [,1]
## [1,] 0.6800522
similarity("australia", "apple")
##           [,1]
## [1,] 0.0300495

The final function provided by the package is word_analogy, which helps us find regularities in the word vector space:

word_analogy(file_name = "vec.bin",
    search_words = "king queen man",
    num = 1)
## 
## Word: king  Position in vocabulary: 187
## 
## Word: queen  Position in vocabulary: 903
## 
## Word: man  Position in vocabulary: 243
##    word              dist
## 1 woman 0.670807123184204
word_analogy(file_name = "vec.bin",
    search_words = "paris france berlin",
    num = 1)
## 
## Word: paris  Position in vocabulary: 1055
## 
## Word: france  Position in vocabulary: 303
## 
## Word: berlin  Position in vocabulary: 1360
##      word              dist
## 1 germany 0.818466305732727
word_analogy(file_name = "vec.bin",
    search_words = "man woman uncle",
    num = 2)
## 
## Word: man  Position in vocabulary: 243
## 
## Word: woman  Position in vocabulary: 1012
## 
## Word: uncle  Position in vocabulary: 4206
##    word              dist
## 1 niece 0.729662358760834
## 2  aunt 0.729258477687836
word_analogy(file_name = "vec.bin",
    search_words = "building architect software",
    num = 1)
## 
## Word: building  Position in vocabulary: 672
## 
## Word: architect  Position in vocabulary: 3366
## 
## Word: software  Position in vocabulary: 404
##         word              dist
## 1 programmer 0.584205448627472
word_analogy(file_name = "vec.bin",
    search_words = "man actor woman",
    num = 5)
## 
## Word: man  Position in vocabulary: 243
## 
## Word: actor  Position in vocabulary: 461
## 
## Word: woman  Position in vocabulary: 1012
##          word              dist
## 1     actress 0.815776824951172
## 2      singer 0.705898344516754
## 3  comedienne 0.665390908718109
## 4  playwright 0.655908346176147
## 5 entertainer 0.655762135982513
word_analogy(file_name = "vec.bin",
    search_words = "france paris uk",
    num = 1)
## 
## Word: france  Position in vocabulary: 303
## 
## Word: paris  Position in vocabulary: 1055
## 
## Word: uk  Position in vocabulary: 532
##     word              dist
## 1 london 0.532313704490662
word_analogy(file_name = "vec.bin",
    search_words = "up down inside",
    num = 2)
## 
## Word: up  Position in vocabulary: 98
## 
## Word: down  Position in vocabulary: 310
## 
## Word: inside  Position in vocabulary: 1319
##      word              dist
## 1 beneath 0.573975384235382
## 2 outside 0.570115745067596

And we can see some examples of algorithmic bias (but really, bias in the training data):

word_analogy(file_name = "vec.bin",
    search_words = "man woman professor",
    num = 1)
## 
## Word: man  Position in vocabulary: 243
## 
## Word: woman  Position in vocabulary: 1012
## 
## Word: professor  Position in vocabulary: 1750
##       word              dist
## 1 lecturer 0.671598970890045
word_analogy(file_name = "vec.bin",
    search_words = "man doctor woman",
    num = 1)
## 
## Word: man  Position in vocabulary: 243
## 
## Word: doctor  Position in vocabulary: 1907
## 
## Word: woman  Position in vocabulary: 1012
##    word              dist
## 1 nurse 0.520112752914429

Applications of word embeddings

Beyond this type of exploratory analysis, word embeddings can be very useful in analyses of large-scale text corpora in two different ways: to expand existing dictionaries and as a way to build features for a supervised learning classifier. The code below shows how to expand a dictionary of uncivil words. By looking for other words with semantic similarity to each of these terms, we can identify words that we may not have thought of in the first place, either because they’re slang, new words or just misspellings of existing words.

Here we will use a different set of pre-trained word embeddings, which were computed on a large corpus of public Facebook posts on the pages of US Members of Congress that we collected from the Graph API.

distance(file_name = "FBvec.bin",
        search_word = "liberal",
        num = 10)
## Entered word or sentence: liberal
## 
## Word: liberal  Position in vocabulary: 428
##           word              dist
## 1      leftist 0.875029563903809
## 2        lefty 0.808053195476532
## 3          lib 0.774020493030548
## 4    rightwing 0.768333077430725
## 5  progressive 0.766966998577118
## 6    left-wing  0.74224179983139
## 7      statist 0.741962492465973
## 8   right-wing 0.740352988243103
## 9     far-left 0.733825862407684
## 10    leftwing 0.715518414974213
distance(file_name = "FBvec.bin",
        search_word = "crooked",
        num = 10)
## Entered word or sentence: crooked
## 
## Word: crooked  Position in vocabulary: 2225
##             word              dist
## 1        corrupt 0.782054841518402
## 2       thieving 0.683514535427094
## 3          slimy 0.675886511802673
## 4         teflon 0.669225692749023
## 5          crook 0.660020768642426
## 6         corupt 0.651829242706299
## 7      dishonest 0.645328283309937
## 8      conniving 0.636701285839081
## 9    corporatist 0.629674255847931
## 10 untrustworthy 0.623017013072968
distance(file_name = "FBvec.bin",
        search_word = "libtard",
        num = 10)
## Entered word or sentence: libtard
## 
## Word: libtard  Position in vocabulary: 5753
##           word              dist
## 1          lib 0.798957586288452
## 2        lefty 0.771853387355804
## 3      libturd 0.762575328350067
## 4    teabagger 0.744283258914948
## 5     teabilly 0.715277075767517
## 6      liberal 0.709996342658997
## 7       retard 0.690707504749298
## 8      dumbass 0.690422177314758
## 9         rwnj 0.684058785438538
## 10 republitard 0.678197801113129
distance(file_name = "FBvec.bin",
        search_word = "douchebag",
        num = 10)
## Entered word or sentence: douchebag
## 
## Word: douchebag  Position in vocabulary: 9781
##         word              dist
## 1    scumbag 0.808189928531647
## 2      moron  0.80128538608551
## 3  hypocrite 0.787607729434967
## 4    jackass 0.783857941627502
## 5    shitbag 0.773443937301636
## 6        pos  0.76619291305542
## 7    dipshit 0.757693469524384
## 8      loser 0.756536900997162
## 9     coward 0.755453526973724
## 10     poser 0.750370919704437
distance(file_name = "FBvec.bin",
        search_word = "idiot",
        num = 10)
## Entered word or sentence: idiot
## 
## Word: idiot  Position in vocabulary: 646
##         word              dist
## 1   imbecile 0.867565214633942
## 2    asshole 0.848560094833374
## 3      moron 0.781079053878784
## 4     asshat 0.772150039672852
## 5     a-hole 0.765781462192535
## 6      ahole 0.760824918746948
## 7    asswipe 0.742586553096771
## 8  ignoramus 0.735219776630402
## 9   arsehole 0.732272684574127
## 10     idoit 0.720151424407959

We can also take the embeddings themselves as features at the word level and then aggregate to a document level as an alternative or complement to bag-of-word approaches.

Let’s see an example with the data we used in our most recent challenge:

library(quanteda)
## Warning: package 'quanteda' was built under R version 3.4.4
## Package version: 1.3.0
## Parallel computing: 2 of 4 threads used.
## See https://quanteda.io for tutorials and examples.
## 
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
## 
##     View
fb <- read.csv("~/data/incivility.csv", stringsAsFactors = FALSE)
fbcorpus <- corpus(fb$comment_message)
fbdfm <- dfm(fbcorpus, remove=stopwords("english"), verbose=TRUE)
## Creating a dfm from a corpus input...
##    ... lowercasing
##    ... found 3,043 documents, 12,086 features
##    ... removed 166 features
##    ... created a 3,043 x 11,920 sparse dfm
##    ... complete. 
## Elapsed time: 0.521 seconds.
fbdfm <- dfm_trim(fbdfm, min_docfreq = 2, verbose=TRUE)
## Removing features occurring: 
##   - in fewer than 2 documents: 6,444
##   Total features removed: 6,444 (54.1%).

First, we will convert the word embeddings to a data frame, and then we will match the features from each document with their corresponding embeddings.

bin_to_txt("FBvec.bin", "FBvector.txt")
## $rfile_name
## [1] "FBvec.bin"
## 
## $routput_file
## [1] "FBvector.txt"
# extracting word embeddings for words in corpus
w2v <- readr::read_delim("FBvector.txt", 
                  skip=1, delim=" ", quote="",
                  col_names=c("word", paste0("V", 1:100)))
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   word = col_character()
## )
## See spec(...) for full column specifications.
## Warning in rbind(names(probs), probs_f): number of columns of result is not
## a multiple of vector length (arg 2)
## Warning: 1 parsing failure.
## row # A tibble: 1 x 5 col      row col   expected    actual     file           expected    <int> <chr> <chr>       <chr>      <chr>          actual 1 107385 <NA>  101 columns 46 columns 'FBvector.txt' file # A tibble: 1 x 5
w2v <- w2v[w2v$word %in% featnames(fbdfm),]

# creating new feature matrix for embeddings
embed <- matrix(NA, nrow=ndoc(fbdfm), ncol=100)
for (i in 1:ndoc(fbdfm)){
  if (i %% 100 == 0) message(i, '/', ndoc(fbdfm))
  # extract word counts
  vec <- as.numeric(fbdfm[i,])
  # keep words with counts of 1 or more
  doc_words <- featnames(fbdfm)[vec>0]
  # extract embeddings for those words
  embed_vec <- w2v[w2v$word %in% doc_words, 2:101]
  # aggregate from word- to document-level embeddings by taking AVG
  embed[i,] <- colMeans(embed_vec, na.rm=TRUE)
  # if no words in embeddings, simply set to 0
  if (nrow(embed_vec)==0) embed[i,] <- 0
}
## 100/3043
## 200/3043
## 300/3043
## 400/3043
## 500/3043
## 600/3043
## 700/3043
## 800/3043
## 900/3043
## 1000/3043
## 1100/3043
## 1200/3043
## 1300/3043
## 1400/3043
## 1500/3043
## 1600/3043
## 1700/3043
## 1800/3043
## 1900/3043
## 2000/3043
## 2100/3043
## 2200/3043
## 2300/3043
## 2400/3043
## 2500/3043
## 2600/3043
## 2700/3043
## 2800/3043
## 2900/3043
## 3000/3043

Let’s now try to replicate the lasso classifier we estimated earlier with this new feature set.

set.seed(123)
training <- sample(1:nrow(fb), floor(.80 * nrow(fb)))
test <- (1:nrow(fb))[1:nrow(fb) %in% training == FALSE]

## function to compute accuracy
accuracy <- function(ypred, y){
    tab <- table(ypred, y)
    return(sum(diag(tab))/sum(tab))
}
# function to compute precision
precision <- function(ypred, y){
    tab <- table(ypred, y)
    return((tab[2,2])/(tab[2,1]+tab[2,2]))
}
# function to compute recall
recall <- function(ypred, y){
    tab <- table(ypred, y)
    return(tab[2,2]/(tab[1,2]+tab[2,2]))
}
library(glmnet)
## Loading required package: Matrix
## Warning: package 'Matrix' was built under R version 3.4.4
## Loading required package: foreach
## Loaded glmnet 2.0-13
require(doMC)
## Loading required package: doMC
## Loading required package: iterators
## Loading required package: parallel
registerDoMC(cores=3)
lasso <- cv.glmnet(embed[training,], fb$attacks[training], 
    family="binomial", alpha=1, nfolds=5, parallel=TRUE, intercept=TRUE,
    type.measure="class")

# computing predicted values
preds <- predict(lasso, embed[test,], type="class")
# confusion matrix
table(preds, fb$attacks[test])
##      
## preds   0   1
##     0  96  38
##     1 149 326
# performance metrics
accuracy(preds, fb$attacks[test])
## [1] 0.6929392
precision(preds==1, fb$attacks[test]==1)
## [1] 0.6863158
recall(preds==1, fb$attacks[test]==1)
## [1] 0.8956044
precision(preds==0, fb$attacks[test]==0)
## [1] 0.7164179
recall(preds==0, fb$attacks[test]==0)
## [1] 0.3918367

We generally find quite good performance with a much smaller set of features. Of course, one downside of this approach is that it’s very hard to interpret the coefficients we get from the lasso regression.

best.lambda <- which(lasso$lambda==lasso$lambda.1se)
beta <- lasso$glmnet.fit$beta[,best.lambda]
head(beta)
##          V1          V2          V3          V4          V5          V6 
##  0.00000000  0.00000000  0.00000000  0.00000000 -0.03309439  0.00000000
## identifying predictive features
df <- data.frame(coef = as.numeric(beta),
                word = names(beta), stringsAsFactors=F)

df <- df[order(df$coef),]
head(df[,c("coef", "word")], n=30)
##           coef word
## 98 -0.05621984  V98
## 5  -0.03309439   V5
## 72 -0.02373680  V72
## 1   0.00000000   V1
## 2   0.00000000   V2
## 3   0.00000000   V3
## 4   0.00000000   V4
## 6   0.00000000   V6
## 7   0.00000000   V7
## 8   0.00000000   V8
## 9   0.00000000   V9
## 11  0.00000000  V11
## 13  0.00000000  V13
## 14  0.00000000  V14
## 15  0.00000000  V15
## 16  0.00000000  V16
## 18  0.00000000  V18
## 19  0.00000000  V19
## 21  0.00000000  V21
## 22  0.00000000  V22
## 23  0.00000000  V23
## 24  0.00000000  V24
## 25  0.00000000  V25
## 26  0.00000000  V26
## 27  0.00000000  V27
## 28  0.00000000  V28
## 29  0.00000000  V29
## 30  0.00000000  V30
## 31  0.00000000  V31
## 32  0.00000000  V32
df <- df[order(df$coef, decreasing=TRUE),]
head(df[,c("coef", "word")], n=30)
##          coef word
## 83 0.54603922  V83
## 36 0.29928865  V36
## 43 0.12314401  V43
## 39 0.11865342  V39
## 20 0.08579542  V20
## 12 0.07094041  V12
## 10 0.05380056  V10
## 77 0.05272163  V77
## 86 0.03480883  V86
## 17 0.01549445  V17
## 1  0.00000000   V1
## 2  0.00000000   V2
## 3  0.00000000   V3
## 4  0.00000000   V4
## 6  0.00000000   V6
## 7  0.00000000   V7
## 8  0.00000000   V8
## 9  0.00000000   V9
## 11 0.00000000  V11
## 13 0.00000000  V13
## 14 0.00000000  V14
## 15 0.00000000  V15
## 16 0.00000000  V16
## 18 0.00000000  V18
## 19 0.00000000  V19
## 21 0.00000000  V21
## 22 0.00000000  V22
## 23 0.00000000  V23
## 24 0.00000000  V24
## 25 0.00000000  V25
head(w2v[order(w2v$V83, decreasing=TRUE),"word"], n=20)
## # A tibble: 20 x 1
##    word      
##    <chr>     
##  1 oath      
##  2 usc       
##  3 hour      
##  4 minutes   
##  5 job       
##  6 man       
##  7 tantrum   
##  8 years     
##  9 flag      
## 10 politician
## 11 trillion  
## 12 duty      
## 13 cfr       
## 14 senator   
## 15 yrs       
## 16 shall     
## 17 feet      
## 18 nose      
## 19 unto      
## 20 loud
head(w2v[order(w2v$V98),"word"], n=20)
## # A tibble: 20 x 1
##    word          
##    <chr>         
##  1 nobody's      
##  2 cyber-security
##  3 gun           
##  4 sense         
##  5 partisan      
##  6 civil         
##  7 mongering     
##  8 mongers       
##  9 tired         
## 10 nra           
## 11 political     
## 12 politicians   
## 13 strong        
## 14 real-time     
## 15 quo           
## 16 sure          
## 17 foreign       
## 18 noise         
## 19 politician    
## 20 decisions

Finally, if we want to maximize performance, we can simply combine both bag-of-words and embeddings features into a single matrix, and use xgboost to let it choose for us the best set of features. This combination of features and classifier gives us the best performance.

library(xgboost)
# converting matrix object
X <- as(cbind(fbdfm, embed), "dgCMatrix")
# parameters to explore
tryEta <- c(1,2)
tryDepths <- c(1,2,4)
# placeholders for now
bestEta=NA
bestDepth=NA
bestAcc=0

for(eta in tryEta){
  for(dp in tryDepths){ 
    bst <- xgb.cv(data = X[training,], 
            label =  fb$attacks[training], 
            max.depth = dp,
          eta = eta, 
          nthread = 4,
          nround = 500,
          nfold=5,
          print_every_n = 100L,
          objective = "binary:logistic")
    # cross-validated accuracy
    acc <- 1-mean(tail(bst$evaluation_log$test_error_mean))
        cat("Results for eta=",eta," and depth=", dp, " : ",
                acc," accuracy.\n",sep="")
        if(acc>bestAcc){
                bestEta=eta
                bestAcc=acc
                bestDepth=dp
        }
    }
}
## [1]  train-error:0.330012+0.004394   test-error:0.336873+0.020236 
## [101]    train-error:0.154787+0.005935   test-error:0.316349+0.029044 
## [201]    train-error:0.092851+0.004448   test-error:0.317584+0.020830 
## [301]    train-error:0.061114+0.005362   test-error:0.318416+0.022765 
## [401]    train-error:0.042214+0.002491   test-error:0.315949+0.020295 
## [500]    train-error:0.034203+0.000838   test-error:0.320466+0.017845 
## Results for eta=1 and depth=1 : 0.6801496 accuracy.
## [1]  train-error:0.309573+0.012324   test-error:0.333597+0.014509 
## [101]    train-error:0.032765+0.002085   test-error:0.322910+0.036368 
## [201]    train-error:0.030505+0.001852   test-error:0.319211+0.033141 
## [301]    train-error:0.030505+0.001852   test-error:0.317162+0.027379 
## [401]    train-error:0.030505+0.001852   test-error:0.311821+0.025696 
## [500]    train-error:0.030505+0.001852   test-error:0.313874+0.025489 
## Results for eta=1 and depth=2 : 0.6855766 accuracy.
## [1]  train-error:0.254007+0.005466   test-error:0.324563+0.016406 
## [101]    train-error:0.030300+0.001973   test-error:0.321277+0.006829 
## [201]    train-error:0.030300+0.001973   test-error:0.322928+0.008565 
## [301]    train-error:0.030300+0.001973   test-error:0.320050+0.009146 
## [401]    train-error:0.030300+0.001973   test-error:0.319635+0.009648 
## [500]    train-error:0.030300+0.001973   test-error:0.319226+0.013174 
## Results for eta=1 and depth=4 : 0.6807747 accuracy.
## [1]  train-error:0.329704+0.005710   test-error:0.338543+0.011451 
## [101]    train-error:0.399938+0.105812   test-error:0.406803+0.095547 
## [201]    train-error:0.399938+0.105812   test-error:0.406803+0.095547 
## [301]    train-error:0.399938+0.105812   test-error:0.406803+0.095547 
## [401]    train-error:0.399938+0.105812   test-error:0.406803+0.095547 
## [500]    train-error:0.399938+0.105812   test-error:0.406803+0.095547 
## Results for eta=2 and depth=1 : 0.593197 accuracy.
## [1]  train-error:0.314707+0.013002   test-error:0.320055+0.015449 
## [101]    train-error:0.424516+0.081324   test-error:0.416968+0.082079 
## [201]    train-error:0.424516+0.081324   test-error:0.416968+0.082079 
## [301]    train-error:0.424516+0.081324   test-error:0.416968+0.082079 
## [401]    train-error:0.424516+0.081324   test-error:0.416968+0.082079 
## [500]    train-error:0.424516+0.081324   test-error:0.416968+0.082079 
## Results for eta=2 and depth=2 : 0.5830322 accuracy.
## [1]  train-error:0.254623+0.007139   test-error:0.323346+0.010880 
## [101]    train-error:0.385982+0.028828   test-error:0.411719+0.042873 
## [201]    train-error:0.385982+0.028828   test-error:0.411719+0.042873 
## [301]    train-error:0.385982+0.028828   test-error:0.411719+0.042873 
## [401]    train-error:0.385982+0.028828   test-error:0.411719+0.042873 
## [500]    train-error:0.385982+0.028828   test-error:0.411719+0.042873 
## Results for eta=2 and depth=4 : 0.5882808 accuracy.
cat("Best model has eta=",bestEta," and depth=", bestDepth, " : ",
    bestAcc," accuracy.\n",sep="")
## Best model has eta=1 and depth=2 : 0.6855766 accuracy.
# running best model
rf <- xgboost(data = X[training,], 
    label = fb$attacks[training], 
        max.depth = bestDepth,
    eta = bestEta, 
    nthread = 4,
    nround = 1000,
        print_every_n=100L,
    objective = "binary:logistic")
## [1]  train-error:0.328677 
## [101]    train-error:0.048891 
## [201]    train-error:0.031224 
## [301]    train-error:0.031224 
## [401]    train-error:0.031224 
## [501]    train-error:0.031224 
## [601]    train-error:0.031224 
## [701]    train-error:0.031224 
## [801]    train-error:0.031224 
## [901]    train-error:0.031224 
## [1000]   train-error:0.031224
# out-of-sample accuracy
preds <- predict(rf, X[test,])


cat("\nAccuracy on test set=", round(accuracy(preds>.50, fb$attacks[test]),3))
## 
## Accuracy on test set= 0.66
cat("\nPrecision(1) on test set=", round(precision(preds>.50, fb$attacks[test]),3))
## 
## Precision(1) on test set= 0.689
cat("\nRecall(1) on test set=", round(recall(preds>.50, fb$attacks[test]),3))
## 
## Recall(1) on test set= 0.786
cat("\nPrecision(0) on test set=", round(precision(preds<.50, fb$attacks[test]==0),3))
## 
## Precision(0) on test set= 0.598
cat("\nRecall(0) on test set=", round(recall(preds<.50, fb$attacks[test]==0),3))
## 
## Recall(0) on test set= 0.473