Supervised machine learning

Regularized regression

Our running example will be a random sample of nearly 5,000 tweets mentioning the names of the candidates to the 2014 EP elections in the UK. We will be using a variable named communication, which indicates whether each tweet was hand-coded as being engaging (a tweet that tries to engage with the audience of the account) or broadcasting (just sending a message, without trying to elicit a response).

The source of the dataset is an article co-authored with Yannis Theocharis, Zoltan Fazekas, and Sebastian Popa, published in the Journal of Communication. The link is here. Our goal was to understand to what extent candidates are not engaging voters on Twitter because they’re exposed to mostly impolite messages.

Let’s start by reading the dataset and creating a dummy variable indicating whether each tweet is engaging

library(quanteda)

## quanteda version 0.9.9.65

## Using 3 of 4 cores for parallel computing

## 
## Attaching package: 'quanteda'

## The following object is masked from 'package:utils':
## 
##     View

tweets <- read.csv("../data/UK-tweets.csv", stringsAsFactors=F)
tweets$engaging <- ifelse(tweets$communication=="engaging", 1, 0)
tweets <- tweets[!is.na(tweets$engaging),]

We’ll do some cleaning as well – substituting handles with @. Why? We want to provent overfitting.

tweets$text <- gsub('@[0-9_A-Za-z]+', '@', tweets$text)

As we discussed last week, before we can do any type of automated text analysis, we will need to go through several “preprocessing” steps before it can be passed to a statistical model. We’ll use the quanteda package quanteda here.

The basic unit of work for the quanteda package is called a corpus, which represents a collection of text documents with some associated metadata. Documents are the subunits of a corpus. You can use summary to get some information about your corpus.

twcorpus <- corpus(tweets$text)
summary(twcorpus)

## Corpus consisting of 4566 documents, showing 100 documents.
## 
##     Text Types Tokens Sentences
##    text1     8     10         2
##    text2    17     23         3
##    text3    15     18         2
##    text4    15     19         3
##    text5    14     15         1
##    text6     3      3         1
##    text7     9      9         1
##    text8    20     21         2
##    text9    20     23         5
##   text10     8      9         1
##   text11    15     20         1
##   text12    23     27         1
##   text13    16     27         1
##   text14    22     26         2
##   text15    12     19         1
##   text16    21     23         3
##   text17     9     14         1
##   text18    19     23         1
##   text19    18     20         1
##   text20    11     14         1
##   text21     6      6         1
##   text22    14     15         1
##   text23     8     10         1
##   text24     6      7         2
##   text25    29     33         1
##   text26    21     23         3
##   text27    16     18         2
##   text28    19     21         1
##   text29    15     21         1
##   text30    19     21         1
##   text31    20     22         1
##   text32    17     17         1
##   text33     7      7         1
##   text34    22     23         2
##   text35    18     20         2
##   text36    16     16         1
##   text37     9      9         1
##   text38    24     27         2
##   text39    17     18         1
##   text40    19     22         2
##   text41    22     23         2
##   text42    26     30         3
##   text43    23     27         3
##   text44     6      8         1
##   text45     6      6         1
##   text46    18     23         1
##   text47    17     18         2
##   text48    18     21         2
##   text49     5      7         1
##   text50    20     25         1
##   text51     3     11         1
##   text52    10     12         1
##   text53    14     16         1
##   text54    24     31         1
##   text55    15     18         1
##   text56    11     12         1
##   text57    19     22         1
##   text58    19     20         2
##   text59    11     12         1
##   text60    19     22         1
##   text61    14     16         1
##   text62    13     14         1
##   text63     7      8         1
##   text64    12     12         2
##   text65    17     22         1
##   text66    10     13         1
##   text67    22     23         1
##   text68    16     16         2
##   text69    13     16         2
##   text70    17     17         3
##   text71    14     16         2
##   text72    23     25         2
##   text73    26     30         3
##   text74    21     23         2
##   text75    18     21         1
##   text76    12     14         1
##   text77    13     16         1
##   text78     8      8         1
##   text79     7      8         1
##   text80    25     28         1
##   text81     8      8         1
##   text82     9      9         3
##   text83    24     26         2
##   text84    16     20         1
##   text85    18     20         1
##   text86    12     16         1
##   text87    18     19         1
##   text88    21     22         2
##   text89    19     21         1
##   text90    11     13         1
##   text91    19     21         3
##   text92    27     31         3
##   text93    15     17         1
##   text94     5      5         1
##   text95     4      4         1
##   text96    28     29         2
##   text97     5      6         1
##   text98    22     24         1
##   text99    15     21         1
##  text100    24     27         1
## 
## Source:  /Users/pablobarbera/git/POIR613/code/* on x86_64 by pablobarbera
## Created: Tue Sep 26 16:06:33 2017
## Notes:

We can then convert a corpus into a document-feature matrix using the dfm function. We will then trim it in order to keep only tokens that appear in 2 or more tweets. Note that we keep punctuation – it turns out it can be quite informative.

twdfm <- dfm(twcorpus, remove=stopwords("english"), remove_url=TRUE, 
             ngrams=1:2, verbose=TRUE)

## Creating a dfm from a corpus ...

##    ... tokenizing texts

##    ... lowercasing

##    ... found 4,566 documents, 48,721 features

## ...

## dfm_select removed 19,434 features and 0 documents, padding 0s for 0 features and 0 documents.

##    ... created a 4,566 x 29,287 sparse dfm
##    ... complete. 
## Elapsed time: 25.9 seconds.

twdfm <- dfm_trim(twdfm, min_docfreq = 2, verbose=TRUE)

## Removing features occurring:

##   - in fewer than 2 document: 22,883

##   Total features removed: 22,883 (78.1%).

Note that other preprocessing options are:

remove_numbers
remove_punct
remove_twitter
remove_symbols
remove_separators

You can read more in the dfm and tokens help pages

Once we have the DFM, we split it into training and test set. We’ll go with 80% training and 20% set. Note the use of a random seed to make sure our results are replicable.

set.seed(123)
training <- sample(1:nrow(tweets), floor(.80 * nrow(tweets)))
test <- (1:nrow(tweets))[1:nrow(tweets) %in% training == FALSE]

Our first step is to train the classifier using cross-validation. There are many packages in R to run machine learning models. For regularized regression, glmnet is in my opinion the best. It’s much faster than caret or mlr (in my experience at least), and it has cross-validation already built-in, so we don’t need to code it from scratch. We’ll start with a ridge regression:

library(glmnet)

## Loading required package: Matrix

## Loading required package: foreach

## Loaded glmnet 2.0-10

require(doMC)

## Loading required package: doMC

## Loading required package: iterators

## Loading required package: parallel

registerDoMC(cores=3)
ridge <- cv.glmnet(twdfm[training,], tweets$engaging[training], 
    family="binomial", alpha=0, nfolds=5, parallel=TRUE, intercept=TRUE,
    type.measure="class")
plot(ridge)

We can now compute the performance metrics on the test set.

## function to compute accuracy
accuracy <- function(ypred, y){
    tab <- table(ypred, y)
    return(sum(diag(tab))/sum(tab))
}
# function to compute precision
precision <- function(ypred, y){
    tab <- table(ypred, y)
    return((tab[2,2])/(tab[2,1]+tab[2,2]))
}
# function to compute recall
recall <- function(ypred, y){
    tab <- table(ypred, y)
    return(tab[2,2]/(tab[1,2]+tab[2,2]))
}
# computing predicted values
preds <- predict(ridge, twdfm[test,], type="class")
# confusion matrix
table(preds, tweets$engaging[test])

##      
## preds   0   1
##     0  11   2
##     1 182 719

# performance metrics
accuracy(preds, tweets$engaging[test])

## [1] 0.7986871

precision(preds, tweets$engaging[test])

## [1] 0.7980022

recall(preds, tweets$engaging[test])

## [1] 0.9972261

Something that is often very useful is to look at the actual estimated coefficients and see which of these have the highest or lowest values:

# from the different values of lambda, let's pick the highest one that is
# within one standard error of the best one (why? see "one-standard-error"
# rule -- maximizes parsimony)
best.lambda <- which(ridge$lambda==ridge$lambda.1se)
beta <- ridge$glmnet.fit$beta[,best.lambda]
head(beta)

##            @          @_@            .            ,            ! 
##  0.029413077  0.025527048  0.001848582 -0.003627634  0.009469734 
##            - 
## -0.015374536

## identifying predictive features
df <- data.frame(coef = as.numeric(beta),
                word = names(beta), stringsAsFactors=F)

df <- df[order(df$coef),]
head(df[,c("coef", "word")], n=30)

##            coef                  word
## 3997 -0.3502883               reverse
## 5170 -0.3501177              knocking
## 4859 -0.3423031               managed
## 4787 -0.3409223                eu_law
## 6235 -0.3395205                cunt_.
## 3161 -0.3388851               posters
## 5529 -0.3371467               @_doing
## 3943 -0.3366187                beacon
## 5275 -0.3360461             ._they're
## 4351 -0.3352499           nationalism
## 5281 -0.3348163      populist_parties
## 4864 -0.3340288            determined
## 2584 -0.3331052            leafleting
## 5922 -0.3329338        euroscepticism
## 6283 -0.3317718              moving_!
## 6282 -0.3317718       lovely_messages
## 5060 -0.3317194                  fame
## 5009 -0.3314492              cleaning
## 4338 -0.3302880                  zone
## 3796 -0.3301211                  #yes
## 2654 -0.3290435                  nw_.
## 3709 -0.3289839               laughed
## 4339 -0.3285087               defends
## 5903 -0.3283701             effective
## 3781 -0.3271190                 god_,
## 4440 -0.3270190 scottish_independence
## 4377 -0.3265807                  dick
## 5212 -0.3260675            one_person
## 5635 -0.3256312             tonbridge
## 5215 -0.3249450                  earn

paste(df$word[1:30], collapse=", ")

## [1] "reverse, knocking, managed, eu_law, cunt_., posters, @_doing, beacon, ._they're, nationalism, populist_parties, determined, leafleting, euroscepticism, moving_!, lovely_messages, fame, cleaning, zone, #yes, nw_., laughed, defends, effective, god_,, scottish_independence, dick, one_person, tonbridge, earn"

df <- df[order(df$coef, decreasing=TRUE),]
head(df[,c("coef", "word")], n=30)

##           coef              word
## 5069 0.2059479         town_hall
## 2055 0.1647179              !_rt
## 2874 0.1637953           western
## 5024 0.1637902        west_green
## 4116 0.1633214            vote_-
## 4673 0.1556281     insignificant
## 4622 0.1498899                dt
## 4624 0.1498884              dt_@
## 2741 0.1465876              re_:
## 6232 0.1454278            seat_.
## 5762 0.1379470             :_new
## 3020 0.1362461            ;_thnx
## 3323 0.1328590             @_bbc
## 4568 0.1312390            ,_much
## 3235 0.1303059          result_,
## 4908 0.1298449          @_yougov
## 5795 0.1286261            team_!
## 3332 0.1272661          increase
## 5796 0.1270953               !_&
## 1427 0.1264018             prove
## 5401 0.1252117       compliments
## 4207 0.1245012          leave_eu
## 6248 0.1242690 ._congratulations
## 2674 0.1240234       great_piece
## 5919 0.1235130           @_we've
## 5372 0.1228772         #ukip_mep
## 2882 0.1227531          addition
## 4767 0.1227420       #badgercull
## 3784 0.1201844           "_don't
## 3063 0.1201835           charges

paste(df$word[1:30], collapse=", ")

## [1] "town_hall, !_rt, western, west_green, vote_-, insignificant, dt, dt_@, re_:, seat_., :_new, ;_thnx, @_bbc, ,_much, result_,, @_yougov, team_!, increase, !_&, prove, compliments, leave_eu, ._congratulations, great_piece, @_we've, #ukip_mep, addition, #badgercull, \"_don't, charges"

We can easily modify our code to experiment with Lasso or ElasticNet models:

lasso <- cv.glmnet(twdfm[training,], tweets$engaging[training], 
    family="binomial", alpha=1, nfolds=5, parallel=TRUE, intercept=TRUE,
    type.measure="class")

# computing predicted values
preds <- predict(lasso, twdfm[test,], type="class")
# confusion matrix
table(preds, tweets$engaging[test])

##      
## preds   0   1
##     0  44   6
##     1 149 715

# performance metrics (slightly better!)
accuracy(preds, tweets$engaging[test])

## [1] 0.8304158

precision(preds, tweets$engaging[test])

## [1] 0.8275463

recall(preds, tweets$engaging[test])

## [1] 0.9916782

best.lambda <- which(lasso$lambda==lasso$lambda.1se)
beta <- lasso$glmnet.fit$beta[,best.lambda]
head(beta)

##         @       @_@         .         ,         !         - 
## 0.6748941 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000

## identifying predictive features
df <- data.frame(coef = as.numeric(beta),
                word = names(beta), stringsAsFactors=F)

df <- df[order(df$coef),]
head(df[,c("coef", "word")], n=30)

##            coef            word
## 203  -1.8155406            on_@
## 3633 -1.7398987     #voteni2014
## 106  -1.6224775           via_@
## 853  -1.5724760     #votelabour
## 149  -1.3057004          with_@
## 113  -1.2967362            to_@
## 129  -1.1918898           and_@
## 367  -1.0251529        hustings
## 70   -0.9180099         #ep2014
## 3620 -0.8247825           far_@
## 506  -0.7098486          hacked
## 191  -0.6606128            rt_@
## 3313 -0.6589735             (_@
## 383  -0.6568307 #labourdoorstep
## 64   -0.6379266           today
## 1518 -0.5809371      just_voted
## 889  -0.5727822      #votelab14
## 71   -0.5666851           green
## 77   -0.5616970            @_is
## 2450 -0.5568338              rd
## 1091 -0.5375330            at_@
## 177  -0.5161206             '_s
## 442  -0.5150840             -_@
## 118  -0.5009806  #votegreen2014
## 119  -0.4516696       elections
## 611  -0.4492005         meeting
## 926  -0.3999965             :_"
## 91   -0.3766497            euro
## 150  -0.3680399     campaigning
## 259  -0.3677628             ;_@

df <- df[order(df$coef, decreasing=TRUE),]
head(df[,c("coef", "word")], n=30)

##            coef     word
## 22  0.682689751      @_i
## 1   0.674894097        @
## 13  0.452484535   thanks
## 44  0.293620973    thank
## 7   0.191427258        ?
## 141 0.173218648   @_good
## 74  0.148929010      :_-
## 25  0.090139103 @_thanks
## 86  0.076451338    @_yes
## 23  0.071765702     good
## 60  0.005819786    @_you
## 2   0.000000000      @_@
## 3   0.000000000        .
## 4   0.000000000        ,
## 5   0.000000000        !
## 6   0.000000000        -
## 8   0.000000000      ._.
## 11  0.000000000        ;
## 12  0.000000000        &
## 14  0.000000000        '
## 15  0.000000000      amp
## 16  0.000000000    &_amp
## 17  0.000000000    amp_;
## 18  0.000000000     ukip
## 19  0.000000000        )
## 20  0.000000000     will
## 24  0.000000000     vote
## 26  0.000000000     just
## 28  0.000000000   people
## 29  0.000000000      ._@

We now see that the coefficients for some features actually became zero.

enet <- cv.glmnet(twdfm[training,], tweets$engaging[training], 
    family="binomial", alpha=0.50, nfolds=5, parallel=TRUE, intercept=TRUE,
    type.measure="class")
# NOTE: this will not cross-validate across values of alpha

# computing predicted values
preds <- predict(enet, twdfm[test,], type="class")
# confusion matrix
table(preds, tweets$engaging[test])

##      
## preds   0   1
##     0  73  18
##     1 120 703

# performance metrics
accuracy(preds, tweets$engaging[test])

## [1] 0.8490153

precision(preds, tweets$engaging[test])

## [1] 0.854192

recall(preds, tweets$engaging[test])

## [1] 0.9750347

best.lambda <- which(enet$lambda==enet$lambda.1se)
beta <- enet$glmnet.fit$beta[,best.lambda]
head(beta)

##         @       @_@         .         ,         !         - 
## 0.6571334 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000

## identifying predictive features
df <- data.frame(coef = as.numeric(beta),
                word = names(beta), stringsAsFactors=F)

df <- df[order(df$coef),]
head(df[,c("coef", "word")], n=30)

##           coef              word
## 2654 -2.710506              nw_.
## 5781 -2.174327 @_#labourdoorstep
## 853  -2.134503       #votelabour
## 203  -2.118042              on_@
## 6275 -2.100751            meps_:
## 3633 -2.065081       #voteni2014
## 6299 -2.048960            meps_@
## 3620 -1.918778             far_@
## 3659 -1.875366               @_w
## 3313 -1.844906               (_@
## 3021 -1.756611             tries
## 2763 -1.733223            only_@
## 2535 -1.716515          password
## 367  -1.547533          hustings
## 106  -1.523086             via_@
## 1518 -1.517925        just_voted
## 113  -1.474411              to_@
## 506  -1.464165            hacked
## 3831 -1.447655          on_#marr
## 2939 -1.364172           @_event
## 149  -1.348147            with_@
## 129  -1.331389             and_@
## 5048 -1.302778        @_leaflets
## 889  -1.235518        #votelab14
## 6019 -1.191481           secured
## 4612 -1.183088           one_can
## 3014 -1.174641             out_@
## 3858 -1.164352        election_-
## 2922 -1.120758         britain_.
## 1002 -1.100790               bus

df <- df[order(df$coef, decreasing=TRUE),]
head(df[,c("coef", "word")], n=30)

##            coef       word
## 22   0.94753021        @_i
## 1    0.65713339          @
## 74   0.55519984        :_-
## 86   0.49546459      @_yes
## 13   0.46661871     thanks
## 141  0.46064031     @_good
## 863  0.41931120 #votegreen
## 60   0.40343912      @_you
## 25   0.40022237   @_thanks
## 404  0.38919132    @_great
## 44   0.34696637      thank
## 57   0.34442865    @_thank
## 68   0.34223620      @_the
## 7    0.30753244          ?
## 172  0.29925516      @_not
## 2874 0.28642753    western
## 463  0.26233208   @_please
## 4568 0.22983992     ,_much
## 187  0.22440384      @_i'm
## 23   0.21131087       good
## 56   0.17467517       need
## 5024 0.16302678 west_green
## 3235 0.15409671   result_,
## 88   0.14001035  good_luck
## 37   0.13699515       many
## 135  0.13101317     @_well
## 982  0.11846787      out_&
## 228  0.09655674      every
## 109  0.08829573     really
## 92   0.07923897        :_)

Xgboost

If we really want the best performance at a low computational cost, the cutting-edge method many people are using is Distributed Gradient Boosting, based on the same ideas as boosted trees / random forests, implemented as xgboost. You can read more about the history of this package here.

First, let’s prepare the data…

library(quanteda)
tweets <- read.csv("../data/UK-tweets.csv", stringsAsFactors=F)
tweets$engaging <- ifelse(tweets$communication=="engaging", 1, 0)
tweets <- tweets[!is.na(tweets$engaging),]
# clean text and create DFM
tweets$text <- gsub('@[0-9_A-Za-z]+', '@', tweets$text)
twcorpus <- corpus(tweets$text)
twdfm <- dfm(twcorpus, remove=stopwords("english"), remove_url=TRUE, 
             ngrams=1:2, verbose=TRUE)

## Creating a dfm from a corpus ...

##    ... tokenizing texts

##    ... lowercasing

##    ... found 4,566 documents, 48,721 features

## ...

## dfm_select removed 19,434 features and 0 documents, padding 0s for 0 features and 0 documents.

##    ... created a 4,566 x 29,287 sparse dfm
##    ... complete. 
## Elapsed time: 27.3 seconds.

twdfm <- dfm_trim(twdfm, min_docfreq = 2, verbose=TRUE)

## Removing features occurring:

##   - in fewer than 2 document: 22,883

##   Total features removed: 22,883 (78.1%).

# training and test sets
set.seed(123)
training <- sample(1:nrow(tweets), floor(.80 * nrow(tweets)))
test <- (1:nrow(tweets))[1:nrow(tweets) %in% training == FALSE]

Now we can train the model:

library(xgboost)
# converting matrix object
X <- as(twdfm, "dgCMatrix")
# parameters to explore
tryEta <- c(1,2)
tryDepths <- c(1,2,4)
# placeholders for now
bestEta=NA
bestDepth=NA
bestAcc=0

for(eta in tryEta){
  for(dp in tryDepths){ 
    bst <- xgb.cv(data = X[training,], 
            label =  tweets$engaging[training], 
            max.depth = dp,
          eta = eta, 
          nthread = 4,
          nround = 500,
          nfold=5,
          print_every_n = 100L,
          objective = "binary:logistic")
    # cross-validated accuracy
    acc <- 1-mean(tail(bst$evaluation_log$test_error_mean))
        cat("Results for eta=",eta," and depth=", dp, " : ",
                acc," accuracy.\n",sep="")
        if(acc>bestAcc){
                bestEta=eta
                bestAcc=acc
                bestDepth=dp
        }
    }
}

## [1]  train-error:0.143484+0.003302   test-error:0.143492+0.013204 
## [101]    train-error:0.086734+0.001456   test-error:0.115010+0.008680 
## [201]    train-error:0.073179+0.002765   test-error:0.112546+0.009483 
## [301]    train-error:0.068319+0.003528   test-error:0.111176+0.010121 
## [401]    train-error:0.066882+0.002997   test-error:0.113916+0.012035 
## [500]    train-error:0.063322+0.002072   test-error:0.117201+0.011626 
## Results for eta=1 and depth=1 : 0.8829353 accuracy.
## [1]  train-error:0.141840+0.003106   test-error:0.143204+0.008415 
## [101]    train-error:0.063732+0.001551   test-error:0.112542+0.005309 
## [201]    train-error:0.047850+0.001489   test-error:0.114185+0.008953 
## [301]    train-error:0.037993+0.001295   test-error:0.117469+0.009495 
## [401]    train-error:0.029915+0.001522   test-error:0.118839+0.007003 
## [500]    train-error:0.024986+0.002205   test-error:0.120209+0.008912 
## Results for eta=1 and depth=2 : 0.8804297 accuracy.
## [1]  train-error:0.131914+0.001697   test-error:0.139377+0.009563 
## [101]    train-error:0.037787+0.003154   test-error:0.118295+0.015582 
## [201]    train-error:0.019304+0.002656   test-error:0.117197+0.015071 
## [301]    train-error:0.012117+0.001539   test-error:0.119113+0.013914 
## [401]    train-error:0.009310+0.001212   test-error:0.118019+0.013796 
## [500]    train-error:0.008146+0.000821   test-error:0.121307+0.014865 
## Results for eta=1 and depth=4 : 0.8800171 accuracy.
## [1]  train-error:0.143483+0.002022   test-error:0.143482+0.008082 
## [101]    train-error:0.438246+0.109140   test-error:0.433491+0.108165 
## [201]    train-error:0.438246+0.109140   test-error:0.433491+0.108165 
## [301]    train-error:0.438246+0.109140   test-error:0.433491+0.108165 
## [401]    train-error:0.438246+0.109140   test-error:0.433491+0.108165 
## [500]    train-error:0.438246+0.109140   test-error:0.433491+0.108165 
## Results for eta=2 and depth=1 : 0.5665086 accuracy.
## [1]  train-error:0.143483+0.000963   test-error:0.143484+0.003853 
## [101]    train-error:0.289844+0.086089   test-error:0.291884+0.094108 
## [201]    train-error:0.289844+0.086089   test-error:0.291884+0.094108 
## [301]    train-error:0.289844+0.086089   test-error:0.291884+0.094108 
## [401]    train-error:0.289844+0.086089   test-error:0.291884+0.094108 
## [500]    train-error:0.289844+0.086089   test-error:0.291884+0.094108 
## Results for eta=2 and depth=2 : 0.708116 accuracy.
## [1]  train-error:0.130202+0.003691   test-error:0.142107+0.013936 
## [101]    train-error:0.186615+0.077187   test-error:0.207264+0.040891 
## [201]    train-error:0.182577+0.085251   test-error:0.206716+0.041891 
## [301]    train-error:0.182029+0.086344   test-error:0.205894+0.043398 
## [401]    train-error:0.181824+0.086754   test-error:0.206716+0.041891 
## [500]    train-error:0.181756+0.086891   test-error:0.206990+0.041390 
## Results for eta=2 and depth=4 : 0.7930559 accuracy.

cat("Best model has eta=",bestEta," and depth=", bestDepth, " : ",
    bestAcc," accuracy.\n",sep="")

## Best model has eta=1 and depth=1 : 0.8829353 accuracy.

How well does it perform out-of-sample?

# running best model
rf <- xgboost(data = X[training,], 
    label = tweets$engaging[training], 
        max.depth = bestDepth,
    eta = bestEta, 
    nthread = 4,
    nround = 1000,
        print_every_n=100L,
    objective = "binary:logistic")

## [1]  train-error:0.143483 
## [101]    train-error:0.086802 
## [201]    train-error:0.075575 
## [301]    train-error:0.073111 
## [401]    train-error:0.066539 
## [501]    train-error:0.063253 
## [601]    train-error:0.064896 
## [701]    train-error:0.061336 
## [801]    train-error:0.061062 
## [901]    train-error:0.061062 
## [1000]   train-error:0.061336

# out-of-sample accuracy
preds <- predict(rf, X[test,])

## function to compute accuracy
accuracy <- function(ypred, y){
    tab <- table(ypred, y)
    return(sum(diag(tab))/sum(tab))
}
# function to compute precision
precision <- function(ypred, y){
    tab <- table(ypred, y)
    return((tab[2,2])/(tab[2,1]+tab[2,2]))
}
# function to compute recall
recall <- function(ypred, y){
    tab <- table(ypred, y)
    return(tab[2,2]/(tab[1,2]+tab[2,2]))
}

cat("\nAccuracy on test set=", round(accuracy(preds>.50, tweets$engaging[test]),3))

## 
## Accuracy on test set= 0.869

cat("\nPrecision on test set=", round(precision(preds>.50, tweets$engaging[test]),3))

## 
## Precision on test set= 0.907

cat("\nRecall on test set=", round(recall(preds>.50, tweets$engaging[test]),3))

## 
## Recall on test set= 0.929

What we sacrifice is interpretability (yet again!). We can check feature importance, but it’s often hard to tell what’s going on exactly. Why? We only what features “matter”, but not why!

# feature importance
labels <- dimnames(X)[[2]]
importance <- xgb.importance(labels, model = rf, data=X, label=tweets$engaging)

## Warning in `[.data.table`(result, , `:=`("RealCover", as.numeric(vec)), :
## with=FALSE ignored, it isn't needed when using :=. See ?':=' for examples.

importance <- importance[order(importance$Gain, decreasing=TRUE),]
head(importance, n=20)

##         Feature        Split        Gain       Cover Frequency RealCover
##  1:           @           18 0.317075959 0.005701276     0.003       979
##  2:         @_@           16 0.070935824 0.002328724     0.001       530
##  3:         @_:            4 0.039597797 0.009091385     0.008        30
##  4:       via_@ -9.53674e-07 0.037202199 0.003077324     0.002         4
##  5:      with_@            4 0.026151768 0.001795029     0.001         3
##  6:        to_@ -9.53674e-07 0.023263293 0.003530451     0.003        10
##  7:      thanks          1.5 0.017029328 0.004719396     0.004       128
##  8:     #ep2014 -9.53674e-07 0.016931430 0.002705811     0.002         9
##  9:        @_is            4 0.015453980 0.002405848     0.002        12
## 10:        on_@            4 0.014059682 0.006254939     0.006         2
## 11:           @          8.5 0.013889075 0.001662281     0.001       991
## 12:           ?           16 0.011642680 0.001452536     0.001       217
## 13:         @_i -9.53674e-07 0.010909844 0.006573625     0.006        99
## 14:         @_' -9.53674e-07 0.010605283 0.001436444     0.001         8
## 15:       today            4 0.010334038 0.002378009     0.002        15
## 16:       and_@            4 0.007915031 0.001422331     0.001         7
## 17: #votelabour -9.53674e-07 0.007698537 0.003470392     0.003         0
## 18:       green          1.5 0.006968371 0.001379948     0.001        13
## 19:      hacked -9.53674e-07 0.006263610 0.004351157     0.004         2
## 20:           "          1.5 0.005490267 0.001356821     0.001        26
##      RealCover %
##  1: 0.2751545812
##  2: 0.1489600899
##  3: 0.0084317032
##  4: 0.0011242271
##  5: 0.0008431703
##  6: 0.0028105677
##  7: 0.0359752670
##  8: 0.0025295110
##  9: 0.0033726813
## 10: 0.0005621135
## 11: 0.2785272625
## 12: 0.0609893198
## 13: 0.0278246206
## 14: 0.0022484542
## 15: 0.0042158516
## 16: 0.0019673974
## 17: 0.0000000000
## 18: 0.0036537381
## 19: 0.0005621135
## 20: 0.0073074761

# adding sign
sums <- list()
for (v in 0:1){
    sums[[v+1]] <- colSums(X[tweets[,"engaging"]==v,])
}
sums <- do.call(cbind, sums)
sign <- apply(sums, 1, which.max)
    
df <- data.frame(
    Feature = labels, 
    sign = sign-1,
    stringsAsFactors=F)
importance <- merge(importance, df, by="Feature")
    
## best predictors
for (v in 0:1){
    cat("\n\n")
    cat("value==", v)
    importance <- importance[order(importance$Gain, decreasing=TRUE),]
    print(head(importance[importance$sign==v,], n=50))
    cat("\n")
    cat(paste(unique(head(importance$Feature[importance$sign==v], n=50)), collapse=", "))
}

## 
## 
## value== 0            Feature        Split         Gain       Cover Frequency
##  1:           via_@ -9.53674e-07 0.0372021989 0.003077324     0.002
##  2:          with_@            4 0.0261517677 0.001795029     0.001
##  3:            to_@ -9.53674e-07 0.0232632932 0.003530451     0.003
##  4:         #ep2014 -9.53674e-07 0.0169314299 0.002705811     0.002
##  5:            on_@            4 0.0140596818 0.006254939     0.006
##  6:             @_' -9.53674e-07 0.0106052830 0.001436444     0.001
##  7:           today            4 0.0103340378 0.002378009     0.002
##  8:           and_@            4 0.0079150306 0.001422331     0.001
##  9:     #votelabour -9.53674e-07 0.0076985374 0.003470392     0.003
## 10:           green          1.5 0.0069683712 0.001379948     0.001
## 11:          hacked -9.53674e-07 0.0062636103 0.004351157     0.004
## 12:            meet            4 0.0051424163 0.003338398     0.003
## 13:       elections -9.53674e-07 0.0049795409 0.002374194     0.002
## 14:  #votegreen2014            4 0.0049374267 0.004299185     0.004
## 15: #labourdoorstep -9.53674e-07 0.0038669121 0.004206947     0.004
## 16:            rt_@ -9.53674e-07 0.0032562307 0.001323676     0.001
## 17:      candidates          1.5 0.0032194462 0.002327759     0.002
## 18:            at_@ -9.53674e-07 0.0031727733 0.002280168     0.002
## 19:             -_@ -9.53674e-07 0.0030844565 0.005115433     0.005
## 20:             ,_@          1.5 0.0030382871 0.004094624     0.004
## 21:            seat            4 0.0027930090 0.001241858     0.001
## 22:             :_" -9.53674e-07 0.0027303949 0.001160337     0.001
## 23:          from_@            4 0.0026258916 0.002124374     0.002
## 24:        hustings -9.53674e-07 0.0025360397 0.002288552     0.002
## 25:         english          1.5 0.0024157110 0.010450661     0.011
## 26:        european          1.5 0.0023950405 0.004098199     0.004
## 27:          with_@          1.5 0.0022739461 0.002120355     0.002
## 28:     campaigning -9.53674e-07 0.0021413964 0.004040229     0.004
## 29:             #eu -9.53674e-07 0.0021397238 0.003148474     0.003
## 30:              mp -9.53674e-07 0.0020997256 0.002184421     0.002
## 31:          #bbcqt            4 0.0020329772 0.003059013     0.003
## 32:           for_@          1.5 0.0018614026 0.003019753     0.003
## 33:            of_@ -9.53674e-07 0.0018029113 0.003092100     0.003
## 34:              rt -9.53674e-07 0.0017769030 0.005928090     0.006
## 35:         meeting -9.53674e-07 0.0016835141 0.004041613     0.004
## 36:            in_@ -9.53674e-07 0.0016784296 0.002229748     0.002
## 37:       candidate -9.53674e-07 0.0016500573 0.003015176     0.003
## 38:           out_! -9.53674e-07 0.0014571668 0.004899975     0.005
## 39:               |            6 0.0013339202 0.016360403     0.018
## 40:             ._" -9.53674e-07 0.0012311108 0.003926545     0.004
## 41:               |          2.5 0.0011420356 0.016292804     0.018
## 42:         england -9.53674e-07 0.0011096390 0.002122868     0.002
## 43:       community -9.53674e-07 0.0010519142 0.001114778     0.001
## 44:          that_@ -9.53674e-07 0.0010118648 0.004877524     0.005
## 45:         britain -9.53674e-07 0.0009239397 0.002965073     0.003
## 46:           the_@            4 0.0007959244 0.002002425     0.002
## 47:           mep_@ -9.53674e-07 0.0007164653 0.003902668     0.004
## 48:       delighted -9.53674e-07 0.0006982868 0.003893584     0.004
## 49:          london            4 0.0006417953 0.002924526     0.003
## 50:     green_party -9.53674e-07 0.0006182929 0.001096317     0.001
##             Feature        Split         Gain       Cover Frequency
##     RealCover  RealCover % sign
##  1:         4 0.0011242271    0
##  2:         3 0.0008431703    0
##  3:        10 0.0028105677    0
##  4:         9 0.0025295110    0
##  5:         2 0.0005621135    0
##  6:         8 0.0022484542    0
##  7:        15 0.0042158516    0
##  8:         7 0.0019673974    0
##  9:         0 0.0000000000    0
## 10:        13 0.0036537381    0
## 11:         2 0.0005621135    0
## 12:         4 0.0011242271    0
## 13:         8 0.0022484542    0
## 14:         8 0.0022484542    0
## 15:         0 0.0000000000    0
## 16:         4 0.0011242271    0
## 17:         5 0.0014052839    0
## 18:         0 0.0000000000    0
## 19:         4 0.0011242271    0
## 20:         8 0.0022484542    0
## 21:         5 0.0014052839    0
## 22:         0 0.0000000000    0
## 23:         3 0.0008431703    0
## 24:         0 0.0000000000    0
## 25:         6 0.0016863406    0
## 26:        13 0.0036537381    0
## 27:         9 0.0025295110    0
## 28:         5 0.0014052839    0
## 29:         3 0.0008431703    0
## 30:         4 0.0011242271    0
## 31:         4 0.0011242271    0
## 32:        12 0.0033726813    0
## 33:         5 0.0014052839    0
## 34:         8 0.0022484542    0
## 35:         1 0.0002810568    0
## 36:         1 0.0002810568    0
## 37:         6 0.0016863406    0
## 38:         2 0.0005621135    0
## 39:         3 0.0008431703    0
## 40:         7 0.0019673974    0
## 41:         0 0.0000000000    0
## 42:         4 0.0011242271    0
## 43:         4 0.0011242271    0
## 44:         2 0.0005621135    0
## 45:         3 0.0008431703    0
## 46:         5 0.0014052839    0
## 47:         0 0.0000000000    0
## 48:         1 0.0002810568    0
## 49:         3 0.0008431703    0
## 50:         2 0.0005621135    0
##     RealCover  RealCover % sign
## 
## via_@, with_@, to_@, #ep2014, on_@, @_', today, and_@, #votelabour, green, hacked, meet, elections, #votegreen2014, #labourdoorstep, rt_@, candidates, at_@, -_@, ,_@, seat, :_", from_@, hustings, english, european, campaigning, #eu, mp, #bbcqt, for_@, of_@, rt, meeting, in_@, candidate, out_!, |, ._", england, community, that_@, britain, the_@, mep_@, delighted, london, green_party
## 
## value== 1            Feature        Split        Gain       Cover Frequency
##  1:               @           18 0.317075959 0.005701276     0.003
##  2:             @_@           16 0.070935824 0.002328724     0.001
##  3:             @_:            4 0.039597797 0.009091385     0.008
##  4:          thanks          1.5 0.017029328 0.004719396     0.004
##  5:            @_is            4 0.015453980 0.002405848     0.002
##  6:               @          8.5 0.013889075 0.001662281     0.001
##  7:               ?           16 0.011642680 0.001452536     0.001
##  8:             @_i -9.53674e-07 0.010909844 0.006573625     0.006
##  9:               "          1.5 0.005490267 0.001356821     0.001
## 10:               .          7.5 0.005013800 0.003382743     0.003
## 11:            good          1.5 0.004944777 0.001374732     0.001
## 12:           thank -9.53674e-07 0.004879654 0.002298679     0.002
## 13:             :_- -9.53674e-07 0.004744827 0.005176608     0.005
## 14:            meps -9.53674e-07 0.004412182 0.001341181     0.001
## 15:         another -9.53674e-07 0.004164543 0.004139356     0.004
## 16:          really          1.5 0.003953506 0.004331584     0.004
## 17:               (            6 0.003858481 0.002257590     0.002
## 18:             yes          2.5 0.003764573 0.004214826     0.004
## 19:           #ukip          1.5 0.003509819 0.001318302     0.001
## 20:               !          1.5 0.003384069 0.003232096     0.003
## 21:            need            4 0.003104709 0.005149799     0.005
## 22:              eu            4 0.003100187 0.002237712     0.002
## 23:            @_on -9.53674e-07 0.003080565 0.003129560     0.003
## 24: congratulations            4 0.003005913 0.004200734     0.004
## 25:          please -9.53674e-07 0.002958356 0.006984084     0.007
## 26:         @_great -9.53674e-07 0.002876869 0.002250571     0.002
## 27:               @          7.5 0.002846132 0.001305125     0.001
## 28:             can          1.5 0.002764517 0.002200581     0.002
## 29:               @          1.5 0.002729959 0.008812900     0.009
## 30:             !_@ -9.53674e-07 0.002620017 0.006951994     0.007
## 31:              uk            6 0.002525966 0.003110272     0.003
## 32:            anti -9.53674e-07 0.002274024 0.002130588     0.002
## 33:         twitter          1.5 0.002228139 0.003145225     0.003
## 34:         council            4 0.002161979 0.004051947     0.004
## 35:        campaign -9.53674e-07 0.002150206 0.001219883     0.001
## 36:           @_the -9.53674e-07 0.002113128 0.005072627     0.005
## 37:             far          1.5 0.001910249 0.003095117     0.003
## 38:            real            4 0.001868129 0.002087940     0.002
## 39:           @_for -9.53674e-07 0.001844937 0.001203005     0.001
## 40:            many          1.5 0.001842180 0.003145021     0.003
## 41:        actually -9.53674e-07 0.001819911 0.002241564     0.002
## 42:         parties            4 0.001771041 0.005011465     0.005
## 43:           @_not -9.53674e-07 0.001729829 0.002224320     0.002
## 44:             lib          1.5 0.001710898 0.004000720     0.004
## 45:           night            4 0.001629950 0.001186950     0.001
## 46:             sad -9.53674e-07 0.001600430 0.003936113     0.004
## 47:           ._not            4 0.001555707 0.008516418     0.009
## 48:            @_no -9.53674e-07 0.001555048 0.004049651     0.004
## 49:           @_was -9.53674e-07 0.001546615 0.002081523     0.002
## 50:          @_good -9.53674e-07 0.001539965 0.001185114     0.001
##             Feature        Split        Gain       Cover Frequency
##     RealCover  RealCover % sign
##  1:       979 0.2751545812    1
##  2:       530 0.1489600899    1
##  3:        30 0.0084317032    1
##  4:       128 0.0359752670    1
##  5:        12 0.0033726813    1
##  6:       991 0.2785272625    1
##  7:       217 0.0609893198    1
##  8:        99 0.0278246206    1
##  9:        26 0.0073074761    1
## 10:       491 0.1379988758    1
## 11:        73 0.0205171445    1
## 12:        45 0.0126475548    1
## 13:        28 0.0078695897    1
## 14:         9 0.0025295110    1
## 15:         7 0.0019673974    1
## 16:        19 0.0053400787    1
## 17:        24 0.0067453626    1
## 18:        31 0.0087127600    1
## 19:        46 0.0129286116    1
## 20:       249 0.0699831366    1
## 21:        37 0.0103991006    1
## 22:        58 0.0163012929    1
## 23:        13 0.0036537381    1
## 24:        17 0.0047779651    1
## 25:        18 0.0050590219    1
## 26:         8 0.0022484542    1
## 27:      1028 0.2889263631    1
## 28:        55 0.0154581225    1
## 29:      1003 0.2818999438    1
## 30:         5 0.0014052839    1
## 31:        31 0.0087127600    1
## 32:        11 0.0030916245    1
## 33:        10 0.0028105677    1
## 34:         6 0.0016863406    1
## 35:        15 0.0042158516    1
## 36:        30 0.0084317032    1
## 37:         9 0.0025295110    1
## 38:         8 0.0022484542    1
## 39:         4 0.0011242271    1
## 40:        55 0.0154581225    1
## 41:        11 0.0030916245    1
## 42:        16 0.0044969084    1
## 43:        18 0.0050590219    1
## 44:         6 0.0016863406    1
## 45:         8 0.0022484542    1
## 46:         2 0.0005621135    1
## 47:         3 0.0008431703    1
## 48:        36 0.0101180438    1
## 49:         2 0.0005621135    1
## 50:        17 0.0047779651    1
##     RealCover  RealCover % sign
## 
## @, @_@, @_:, thanks, @_is, ?, @_i, ", ., good, thank, :_-, meps, another, really, (, yes, #ukip, !, need, eu, @_on, congratulations, please, @_great, can, !_@, uk, anti, twitter, council, campaign, @_the, far, real, @_for, many, actually, parties, @_not, lib, night, sad, ._not, @_no, @_was, @_good

Supervised machine learning

Pablo Barbera

September 26, 2017

Regularized regression

Xgboost