Our running example will be a random sample of nearly 5,000 tweets mentioning the names of the candidates to the 2014 EP elections in the UK. We will be using a variable named communication
, which indicates whether each tweet was hand-coded as being engaging (a tweet that tries to engage with the audience of the account) or broadcasting (just sending a message, without trying to elicit a response).
The source of the dataset is an article co-authored with Yannis Theocharis, Zoltan Fazekas, and Sebastian Popa, published in the Journal of Communication. The link is here. Our goal was to understand to what extent candidates are not engaging voters on Twitter because they’re exposed to mostly impolite messages.
Let’s start by reading the dataset and creating a dummy variable indicating whether each tweet is engaging
library(quanteda)
## quanteda version 0.9.9.65
## Using 3 of 4 cores for parallel computing
##
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
##
## View
tweets <- read.csv("../data/UK-tweets.csv", stringsAsFactors=F)
tweets$engaging <- ifelse(tweets$communication=="engaging", 1, 0)
tweets <- tweets[!is.na(tweets$engaging),]
We’ll do some cleaning as well – substituting handles with @. Why? We want to provent overfitting.
tweets$text <- gsub('@[0-9_A-Za-z]+', '@', tweets$text)
As we discussed last week, before we can do any type of automated text analysis, we will need to go through several “preprocessing” steps before it can be passed to a statistical model. We’ll use the quanteda package quanteda here.
The basic unit of work for the quanteda package is called a corpus, which represents a collection of text documents with some associated metadata. Documents are the subunits of a corpus. You can use summary to get some information about your corpus.
twcorpus <- corpus(tweets$text)
summary(twcorpus)
## Corpus consisting of 4566 documents, showing 100 documents.
##
## Text Types Tokens Sentences
## text1 8 10 2
## text2 17 23 3
## text3 15 18 2
## text4 15 19 3
## text5 14 15 1
## text6 3 3 1
## text7 9 9 1
## text8 20 21 2
## text9 20 23 5
## text10 8 9 1
## text11 15 20 1
## text12 23 27 1
## text13 16 27 1
## text14 22 26 2
## text15 12 19 1
## text16 21 23 3
## text17 9 14 1
## text18 19 23 1
## text19 18 20 1
## text20 11 14 1
## text21 6 6 1
## text22 14 15 1
## text23 8 10 1
## text24 6 7 2
## text25 29 33 1
## text26 21 23 3
## text27 16 18 2
## text28 19 21 1
## text29 15 21 1
## text30 19 21 1
## text31 20 22 1
## text32 17 17 1
## text33 7 7 1
## text34 22 23 2
## text35 18 20 2
## text36 16 16 1
## text37 9 9 1
## text38 24 27 2
## text39 17 18 1
## text40 19 22 2
## text41 22 23 2
## text42 26 30 3
## text43 23 27 3
## text44 6 8 1
## text45 6 6 1
## text46 18 23 1
## text47 17 18 2
## text48 18 21 2
## text49 5 7 1
## text50 20 25 1
## text51 3 11 1
## text52 10 12 1
## text53 14 16 1
## text54 24 31 1
## text55 15 18 1
## text56 11 12 1
## text57 19 22 1
## text58 19 20 2
## text59 11 12 1
## text60 19 22 1
## text61 14 16 1
## text62 13 14 1
## text63 7 8 1
## text64 12 12 2
## text65 17 22 1
## text66 10 13 1
## text67 22 23 1
## text68 16 16 2
## text69 13 16 2
## text70 17 17 3
## text71 14 16 2
## text72 23 25 2
## text73 26 30 3
## text74 21 23 2
## text75 18 21 1
## text76 12 14 1
## text77 13 16 1
## text78 8 8 1
## text79 7 8 1
## text80 25 28 1
## text81 8 8 1
## text82 9 9 3
## text83 24 26 2
## text84 16 20 1
## text85 18 20 1
## text86 12 16 1
## text87 18 19 1
## text88 21 22 2
## text89 19 21 1
## text90 11 13 1
## text91 19 21 3
## text92 27 31 3
## text93 15 17 1
## text94 5 5 1
## text95 4 4 1
## text96 28 29 2
## text97 5 6 1
## text98 22 24 1
## text99 15 21 1
## text100 24 27 1
##
## Source: /Users/pablobarbera/git/POIR613/code/* on x86_64 by pablobarbera
## Created: Tue Sep 26 16:06:33 2017
## Notes:
We can then convert a corpus into a document-feature matrix using the dfm function. We will then trim it in order to keep only tokens that appear in 2 or more tweets. Note that we keep punctuation – it turns out it can be quite informative.
twdfm <- dfm(twcorpus, remove=stopwords("english"), remove_url=TRUE,
ngrams=1:2, verbose=TRUE)
## Creating a dfm from a corpus ...
## ... tokenizing texts
## ... lowercasing
## ... found 4,566 documents, 48,721 features
## ...
## dfm_select removed 19,434 features and 0 documents, padding 0s for 0 features and 0 documents.
## ... created a 4,566 x 29,287 sparse dfm
## ... complete.
## Elapsed time: 25.9 seconds.
twdfm <- dfm_trim(twdfm, min_docfreq = 2, verbose=TRUE)
## Removing features occurring:
## - in fewer than 2 document: 22,883
## Total features removed: 22,883 (78.1%).
Note that other preprocessing options are:
You can read more in the dfm
and tokens
help pages
Once we have the DFM, we split it into training and test set. We’ll go with 80% training and 20% set. Note the use of a random seed to make sure our results are replicable.
set.seed(123)
training <- sample(1:nrow(tweets), floor(.80 * nrow(tweets)))
test <- (1:nrow(tweets))[1:nrow(tweets) %in% training == FALSE]
Our first step is to train the classifier using cross-validation. There are many packages in R to run machine learning models. For regularized regression, glmnet is in my opinion the best. It’s much faster than caret or mlr (in my experience at least), and it has cross-validation already built-in, so we don’t need to code it from scratch. We’ll start with a ridge regression:
library(glmnet)
## Loading required package: Matrix
## Loading required package: foreach
## Loaded glmnet 2.0-10
require(doMC)
## Loading required package: doMC
## Loading required package: iterators
## Loading required package: parallel
registerDoMC(cores=3)
ridge <- cv.glmnet(twdfm[training,], tweets$engaging[training],
family="binomial", alpha=0, nfolds=5, parallel=TRUE, intercept=TRUE,
type.measure="class")
plot(ridge)
We can now compute the performance metrics on the test set.
## function to compute accuracy
accuracy <- function(ypred, y){
tab <- table(ypred, y)
return(sum(diag(tab))/sum(tab))
}
# function to compute precision
precision <- function(ypred, y){
tab <- table(ypred, y)
return((tab[2,2])/(tab[2,1]+tab[2,2]))
}
# function to compute recall
recall <- function(ypred, y){
tab <- table(ypred, y)
return(tab[2,2]/(tab[1,2]+tab[2,2]))
}
# computing predicted values
preds <- predict(ridge, twdfm[test,], type="class")
# confusion matrix
table(preds, tweets$engaging[test])
##
## preds 0 1
## 0 11 2
## 1 182 719
# performance metrics
accuracy(preds, tweets$engaging[test])
## [1] 0.7986871
precision(preds, tweets$engaging[test])
## [1] 0.7980022
recall(preds, tweets$engaging[test])
## [1] 0.9972261
Something that is often very useful is to look at the actual estimated coefficients and see which of these have the highest or lowest values:
# from the different values of lambda, let's pick the highest one that is
# within one standard error of the best one (why? see "one-standard-error"
# rule -- maximizes parsimony)
best.lambda <- which(ridge$lambda==ridge$lambda.1se)
beta <- ridge$glmnet.fit$beta[,best.lambda]
head(beta)
## @ @_@ . , !
## 0.029413077 0.025527048 0.001848582 -0.003627634 0.009469734
## -
## -0.015374536
## identifying predictive features
df <- data.frame(coef = as.numeric(beta),
word = names(beta), stringsAsFactors=F)
df <- df[order(df$coef),]
head(df[,c("coef", "word")], n=30)
## coef word
## 3997 -0.3502883 reverse
## 5170 -0.3501177 knocking
## 4859 -0.3423031 managed
## 4787 -0.3409223 eu_law
## 6235 -0.3395205 cunt_.
## 3161 -0.3388851 posters
## 5529 -0.3371467 @_doing
## 3943 -0.3366187 beacon
## 5275 -0.3360461 ._they're
## 4351 -0.3352499 nationalism
## 5281 -0.3348163 populist_parties
## 4864 -0.3340288 determined
## 2584 -0.3331052 leafleting
## 5922 -0.3329338 euroscepticism
## 6283 -0.3317718 moving_!
## 6282 -0.3317718 lovely_messages
## 5060 -0.3317194 fame
## 5009 -0.3314492 cleaning
## 4338 -0.3302880 zone
## 3796 -0.3301211 #yes
## 2654 -0.3290435 nw_.
## 3709 -0.3289839 laughed
## 4339 -0.3285087 defends
## 5903 -0.3283701 effective
## 3781 -0.3271190 god_,
## 4440 -0.3270190 scottish_independence
## 4377 -0.3265807 dick
## 5212 -0.3260675 one_person
## 5635 -0.3256312 tonbridge
## 5215 -0.3249450 earn
paste(df$word[1:30], collapse=", ")
## [1] "reverse, knocking, managed, eu_law, cunt_., posters, @_doing, beacon, ._they're, nationalism, populist_parties, determined, leafleting, euroscepticism, moving_!, lovely_messages, fame, cleaning, zone, #yes, nw_., laughed, defends, effective, god_,, scottish_independence, dick, one_person, tonbridge, earn"
df <- df[order(df$coef, decreasing=TRUE),]
head(df[,c("coef", "word")], n=30)
## coef word
## 5069 0.2059479 town_hall
## 2055 0.1647179 !_rt
## 2874 0.1637953 western
## 5024 0.1637902 west_green
## 4116 0.1633214 vote_-
## 4673 0.1556281 insignificant
## 4622 0.1498899 dt
## 4624 0.1498884 dt_@
## 2741 0.1465876 re_:
## 6232 0.1454278 seat_.
## 5762 0.1379470 :_new
## 3020 0.1362461 ;_thnx
## 3323 0.1328590 @_bbc
## 4568 0.1312390 ,_much
## 3235 0.1303059 result_,
## 4908 0.1298449 @_yougov
## 5795 0.1286261 team_!
## 3332 0.1272661 increase
## 5796 0.1270953 !_&
## 1427 0.1264018 prove
## 5401 0.1252117 compliments
## 4207 0.1245012 leave_eu
## 6248 0.1242690 ._congratulations
## 2674 0.1240234 great_piece
## 5919 0.1235130 @_we've
## 5372 0.1228772 #ukip_mep
## 2882 0.1227531 addition
## 4767 0.1227420 #badgercull
## 3784 0.1201844 "_don't
## 3063 0.1201835 charges
paste(df$word[1:30], collapse=", ")
## [1] "town_hall, !_rt, western, west_green, vote_-, insignificant, dt, dt_@, re_:, seat_., :_new, ;_thnx, @_bbc, ,_much, result_,, @_yougov, team_!, increase, !_&, prove, compliments, leave_eu, ._congratulations, great_piece, @_we've, #ukip_mep, addition, #badgercull, \"_don't, charges"
We can easily modify our code to experiment with Lasso or ElasticNet models:
lasso <- cv.glmnet(twdfm[training,], tweets$engaging[training],
family="binomial", alpha=1, nfolds=5, parallel=TRUE, intercept=TRUE,
type.measure="class")
# computing predicted values
preds <- predict(lasso, twdfm[test,], type="class")
# confusion matrix
table(preds, tweets$engaging[test])
##
## preds 0 1
## 0 44 6
## 1 149 715
# performance metrics (slightly better!)
accuracy(preds, tweets$engaging[test])
## [1] 0.8304158
precision(preds, tweets$engaging[test])
## [1] 0.8275463
recall(preds, tweets$engaging[test])
## [1] 0.9916782
best.lambda <- which(lasso$lambda==lasso$lambda.1se)
beta <- lasso$glmnet.fit$beta[,best.lambda]
head(beta)
## @ @_@ . , ! -
## 0.6748941 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## identifying predictive features
df <- data.frame(coef = as.numeric(beta),
word = names(beta), stringsAsFactors=F)
df <- df[order(df$coef),]
head(df[,c("coef", "word")], n=30)
## coef word
## 203 -1.8155406 on_@
## 3633 -1.7398987 #voteni2014
## 106 -1.6224775 via_@
## 853 -1.5724760 #votelabour
## 149 -1.3057004 with_@
## 113 -1.2967362 to_@
## 129 -1.1918898 and_@
## 367 -1.0251529 hustings
## 70 -0.9180099 #ep2014
## 3620 -0.8247825 far_@
## 506 -0.7098486 hacked
## 191 -0.6606128 rt_@
## 3313 -0.6589735 (_@
## 383 -0.6568307 #labourdoorstep
## 64 -0.6379266 today
## 1518 -0.5809371 just_voted
## 889 -0.5727822 #votelab14
## 71 -0.5666851 green
## 77 -0.5616970 @_is
## 2450 -0.5568338 rd
## 1091 -0.5375330 at_@
## 177 -0.5161206 '_s
## 442 -0.5150840 -_@
## 118 -0.5009806 #votegreen2014
## 119 -0.4516696 elections
## 611 -0.4492005 meeting
## 926 -0.3999965 :_"
## 91 -0.3766497 euro
## 150 -0.3680399 campaigning
## 259 -0.3677628 ;_@
df <- df[order(df$coef, decreasing=TRUE),]
head(df[,c("coef", "word")], n=30)
## coef word
## 22 0.682689751 @_i
## 1 0.674894097 @
## 13 0.452484535 thanks
## 44 0.293620973 thank
## 7 0.191427258 ?
## 141 0.173218648 @_good
## 74 0.148929010 :_-
## 25 0.090139103 @_thanks
## 86 0.076451338 @_yes
## 23 0.071765702 good
## 60 0.005819786 @_you
## 2 0.000000000 @_@
## 3 0.000000000 .
## 4 0.000000000 ,
## 5 0.000000000 !
## 6 0.000000000 -
## 8 0.000000000 ._.
## 11 0.000000000 ;
## 12 0.000000000 &
## 14 0.000000000 '
## 15 0.000000000 amp
## 16 0.000000000 &_amp
## 17 0.000000000 amp_;
## 18 0.000000000 ukip
## 19 0.000000000 )
## 20 0.000000000 will
## 24 0.000000000 vote
## 26 0.000000000 just
## 28 0.000000000 people
## 29 0.000000000 ._@
We now see that the coefficients for some features actually became zero.
enet <- cv.glmnet(twdfm[training,], tweets$engaging[training],
family="binomial", alpha=0.50, nfolds=5, parallel=TRUE, intercept=TRUE,
type.measure="class")
# NOTE: this will not cross-validate across values of alpha
# computing predicted values
preds <- predict(enet, twdfm[test,], type="class")
# confusion matrix
table(preds, tweets$engaging[test])
##
## preds 0 1
## 0 73 18
## 1 120 703
# performance metrics
accuracy(preds, tweets$engaging[test])
## [1] 0.8490153
precision(preds, tweets$engaging[test])
## [1] 0.854192
recall(preds, tweets$engaging[test])
## [1] 0.9750347
best.lambda <- which(enet$lambda==enet$lambda.1se)
beta <- enet$glmnet.fit$beta[,best.lambda]
head(beta)
## @ @_@ . , ! -
## 0.6571334 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## identifying predictive features
df <- data.frame(coef = as.numeric(beta),
word = names(beta), stringsAsFactors=F)
df <- df[order(df$coef),]
head(df[,c("coef", "word")], n=30)
## coef word
## 2654 -2.710506 nw_.
## 5781 -2.174327 @_#labourdoorstep
## 853 -2.134503 #votelabour
## 203 -2.118042 on_@
## 6275 -2.100751 meps_:
## 3633 -2.065081 #voteni2014
## 6299 -2.048960 meps_@
## 3620 -1.918778 far_@
## 3659 -1.875366 @_w
## 3313 -1.844906 (_@
## 3021 -1.756611 tries
## 2763 -1.733223 only_@
## 2535 -1.716515 password
## 367 -1.547533 hustings
## 106 -1.523086 via_@
## 1518 -1.517925 just_voted
## 113 -1.474411 to_@
## 506 -1.464165 hacked
## 3831 -1.447655 on_#marr
## 2939 -1.364172 @_event
## 149 -1.348147 with_@
## 129 -1.331389 and_@
## 5048 -1.302778 @_leaflets
## 889 -1.235518 #votelab14
## 6019 -1.191481 secured
## 4612 -1.183088 one_can
## 3014 -1.174641 out_@
## 3858 -1.164352 election_-
## 2922 -1.120758 britain_.
## 1002 -1.100790 bus
df <- df[order(df$coef, decreasing=TRUE),]
head(df[,c("coef", "word")], n=30)
## coef word
## 22 0.94753021 @_i
## 1 0.65713339 @
## 74 0.55519984 :_-
## 86 0.49546459 @_yes
## 13 0.46661871 thanks
## 141 0.46064031 @_good
## 863 0.41931120 #votegreen
## 60 0.40343912 @_you
## 25 0.40022237 @_thanks
## 404 0.38919132 @_great
## 44 0.34696637 thank
## 57 0.34442865 @_thank
## 68 0.34223620 @_the
## 7 0.30753244 ?
## 172 0.29925516 @_not
## 2874 0.28642753 western
## 463 0.26233208 @_please
## 4568 0.22983992 ,_much
## 187 0.22440384 @_i'm
## 23 0.21131087 good
## 56 0.17467517 need
## 5024 0.16302678 west_green
## 3235 0.15409671 result_,
## 88 0.14001035 good_luck
## 37 0.13699515 many
## 135 0.13101317 @_well
## 982 0.11846787 out_&
## 228 0.09655674 every
## 109 0.08829573 really
## 92 0.07923897 :_)
If we really want the best performance at a low computational cost, the cutting-edge method many people are using is Distributed Gradient Boosting, based on the same ideas as boosted trees / random forests, implemented as xgboost
. You can read more about the history of this package here.
First, let’s prepare the data…
library(quanteda)
tweets <- read.csv("../data/UK-tweets.csv", stringsAsFactors=F)
tweets$engaging <- ifelse(tweets$communication=="engaging", 1, 0)
tweets <- tweets[!is.na(tweets$engaging),]
# clean text and create DFM
tweets$text <- gsub('@[0-9_A-Za-z]+', '@', tweets$text)
twcorpus <- corpus(tweets$text)
twdfm <- dfm(twcorpus, remove=stopwords("english"), remove_url=TRUE,
ngrams=1:2, verbose=TRUE)
## Creating a dfm from a corpus ...
## ... tokenizing texts
## ... lowercasing
## ... found 4,566 documents, 48,721 features
## ...
## dfm_select removed 19,434 features and 0 documents, padding 0s for 0 features and 0 documents.
## ... created a 4,566 x 29,287 sparse dfm
## ... complete.
## Elapsed time: 27.3 seconds.
twdfm <- dfm_trim(twdfm, min_docfreq = 2, verbose=TRUE)
## Removing features occurring:
## - in fewer than 2 document: 22,883
## Total features removed: 22,883 (78.1%).
# training and test sets
set.seed(123)
training <- sample(1:nrow(tweets), floor(.80 * nrow(tweets)))
test <- (1:nrow(tweets))[1:nrow(tweets) %in% training == FALSE]
Now we can train the model:
library(xgboost)
# converting matrix object
X <- as(twdfm, "dgCMatrix")
# parameters to explore
tryEta <- c(1,2)
tryDepths <- c(1,2,4)
# placeholders for now
bestEta=NA
bestDepth=NA
bestAcc=0
for(eta in tryEta){
for(dp in tryDepths){
bst <- xgb.cv(data = X[training,],
label = tweets$engaging[training],
max.depth = dp,
eta = eta,
nthread = 4,
nround = 500,
nfold=5,
print_every_n = 100L,
objective = "binary:logistic")
# cross-validated accuracy
acc <- 1-mean(tail(bst$evaluation_log$test_error_mean))
cat("Results for eta=",eta," and depth=", dp, " : ",
acc," accuracy.\n",sep="")
if(acc>bestAcc){
bestEta=eta
bestAcc=acc
bestDepth=dp
}
}
}
## [1] train-error:0.143484+0.003302 test-error:0.143492+0.013204
## [101] train-error:0.086734+0.001456 test-error:0.115010+0.008680
## [201] train-error:0.073179+0.002765 test-error:0.112546+0.009483
## [301] train-error:0.068319+0.003528 test-error:0.111176+0.010121
## [401] train-error:0.066882+0.002997 test-error:0.113916+0.012035
## [500] train-error:0.063322+0.002072 test-error:0.117201+0.011626
## Results for eta=1 and depth=1 : 0.8829353 accuracy.
## [1] train-error:0.141840+0.003106 test-error:0.143204+0.008415
## [101] train-error:0.063732+0.001551 test-error:0.112542+0.005309
## [201] train-error:0.047850+0.001489 test-error:0.114185+0.008953
## [301] train-error:0.037993+0.001295 test-error:0.117469+0.009495
## [401] train-error:0.029915+0.001522 test-error:0.118839+0.007003
## [500] train-error:0.024986+0.002205 test-error:0.120209+0.008912
## Results for eta=1 and depth=2 : 0.8804297 accuracy.
## [1] train-error:0.131914+0.001697 test-error:0.139377+0.009563
## [101] train-error:0.037787+0.003154 test-error:0.118295+0.015582
## [201] train-error:0.019304+0.002656 test-error:0.117197+0.015071
## [301] train-error:0.012117+0.001539 test-error:0.119113+0.013914
## [401] train-error:0.009310+0.001212 test-error:0.118019+0.013796
## [500] train-error:0.008146+0.000821 test-error:0.121307+0.014865
## Results for eta=1 and depth=4 : 0.8800171 accuracy.
## [1] train-error:0.143483+0.002022 test-error:0.143482+0.008082
## [101] train-error:0.438246+0.109140 test-error:0.433491+0.108165
## [201] train-error:0.438246+0.109140 test-error:0.433491+0.108165
## [301] train-error:0.438246+0.109140 test-error:0.433491+0.108165
## [401] train-error:0.438246+0.109140 test-error:0.433491+0.108165
## [500] train-error:0.438246+0.109140 test-error:0.433491+0.108165
## Results for eta=2 and depth=1 : 0.5665086 accuracy.
## [1] train-error:0.143483+0.000963 test-error:0.143484+0.003853
## [101] train-error:0.289844+0.086089 test-error:0.291884+0.094108
## [201] train-error:0.289844+0.086089 test-error:0.291884+0.094108
## [301] train-error:0.289844+0.086089 test-error:0.291884+0.094108
## [401] train-error:0.289844+0.086089 test-error:0.291884+0.094108
## [500] train-error:0.289844+0.086089 test-error:0.291884+0.094108
## Results for eta=2 and depth=2 : 0.708116 accuracy.
## [1] train-error:0.130202+0.003691 test-error:0.142107+0.013936
## [101] train-error:0.186615+0.077187 test-error:0.207264+0.040891
## [201] train-error:0.182577+0.085251 test-error:0.206716+0.041891
## [301] train-error:0.182029+0.086344 test-error:0.205894+0.043398
## [401] train-error:0.181824+0.086754 test-error:0.206716+0.041891
## [500] train-error:0.181756+0.086891 test-error:0.206990+0.041390
## Results for eta=2 and depth=4 : 0.7930559 accuracy.
cat("Best model has eta=",bestEta," and depth=", bestDepth, " : ",
bestAcc," accuracy.\n",sep="")
## Best model has eta=1 and depth=1 : 0.8829353 accuracy.
How well does it perform out-of-sample?
# running best model
rf <- xgboost(data = X[training,],
label = tweets$engaging[training],
max.depth = bestDepth,
eta = bestEta,
nthread = 4,
nround = 1000,
print_every_n=100L,
objective = "binary:logistic")
## [1] train-error:0.143483
## [101] train-error:0.086802
## [201] train-error:0.075575
## [301] train-error:0.073111
## [401] train-error:0.066539
## [501] train-error:0.063253
## [601] train-error:0.064896
## [701] train-error:0.061336
## [801] train-error:0.061062
## [901] train-error:0.061062
## [1000] train-error:0.061336
# out-of-sample accuracy
preds <- predict(rf, X[test,])
## function to compute accuracy
accuracy <- function(ypred, y){
tab <- table(ypred, y)
return(sum(diag(tab))/sum(tab))
}
# function to compute precision
precision <- function(ypred, y){
tab <- table(ypred, y)
return((tab[2,2])/(tab[2,1]+tab[2,2]))
}
# function to compute recall
recall <- function(ypred, y){
tab <- table(ypred, y)
return(tab[2,2]/(tab[1,2]+tab[2,2]))
}
cat("\nAccuracy on test set=", round(accuracy(preds>.50, tweets$engaging[test]),3))
##
## Accuracy on test set= 0.869
cat("\nPrecision on test set=", round(precision(preds>.50, tweets$engaging[test]),3))
##
## Precision on test set= 0.907
cat("\nRecall on test set=", round(recall(preds>.50, tweets$engaging[test]),3))
##
## Recall on test set= 0.929
What we sacrifice is interpretability (yet again!). We can check feature importance, but it’s often hard to tell what’s going on exactly. Why? We only what features “matter”, but not why!
# feature importance
labels <- dimnames(X)[[2]]
importance <- xgb.importance(labels, model = rf, data=X, label=tweets$engaging)
## Warning in `[.data.table`(result, , `:=`("RealCover", as.numeric(vec)), :
## with=FALSE ignored, it isn't needed when using :=. See ?':=' for examples.
importance <- importance[order(importance$Gain, decreasing=TRUE),]
head(importance, n=20)
## Feature Split Gain Cover Frequency RealCover
## 1: @ 18 0.317075959 0.005701276 0.003 979
## 2: @_@ 16 0.070935824 0.002328724 0.001 530
## 3: @_: 4 0.039597797 0.009091385 0.008 30
## 4: via_@ -9.53674e-07 0.037202199 0.003077324 0.002 4
## 5: with_@ 4 0.026151768 0.001795029 0.001 3
## 6: to_@ -9.53674e-07 0.023263293 0.003530451 0.003 10
## 7: thanks 1.5 0.017029328 0.004719396 0.004 128
## 8: #ep2014 -9.53674e-07 0.016931430 0.002705811 0.002 9
## 9: @_is 4 0.015453980 0.002405848 0.002 12
## 10: on_@ 4 0.014059682 0.006254939 0.006 2
## 11: @ 8.5 0.013889075 0.001662281 0.001 991
## 12: ? 16 0.011642680 0.001452536 0.001 217
## 13: @_i -9.53674e-07 0.010909844 0.006573625 0.006 99
## 14: @_' -9.53674e-07 0.010605283 0.001436444 0.001 8
## 15: today 4 0.010334038 0.002378009 0.002 15
## 16: and_@ 4 0.007915031 0.001422331 0.001 7
## 17: #votelabour -9.53674e-07 0.007698537 0.003470392 0.003 0
## 18: green 1.5 0.006968371 0.001379948 0.001 13
## 19: hacked -9.53674e-07 0.006263610 0.004351157 0.004 2
## 20: " 1.5 0.005490267 0.001356821 0.001 26
## RealCover %
## 1: 0.2751545812
## 2: 0.1489600899
## 3: 0.0084317032
## 4: 0.0011242271
## 5: 0.0008431703
## 6: 0.0028105677
## 7: 0.0359752670
## 8: 0.0025295110
## 9: 0.0033726813
## 10: 0.0005621135
## 11: 0.2785272625
## 12: 0.0609893198
## 13: 0.0278246206
## 14: 0.0022484542
## 15: 0.0042158516
## 16: 0.0019673974
## 17: 0.0000000000
## 18: 0.0036537381
## 19: 0.0005621135
## 20: 0.0073074761
# adding sign
sums <- list()
for (v in 0:1){
sums[[v+1]] <- colSums(X[tweets[,"engaging"]==v,])
}
sums <- do.call(cbind, sums)
sign <- apply(sums, 1, which.max)
df <- data.frame(
Feature = labels,
sign = sign-1,
stringsAsFactors=F)
importance <- merge(importance, df, by="Feature")
## best predictors
for (v in 0:1){
cat("\n\n")
cat("value==", v)
importance <- importance[order(importance$Gain, decreasing=TRUE),]
print(head(importance[importance$sign==v,], n=50))
cat("\n")
cat(paste(unique(head(importance$Feature[importance$sign==v], n=50)), collapse=", "))
}
##
##
## value== 0 Feature Split Gain Cover Frequency
## 1: via_@ -9.53674e-07 0.0372021989 0.003077324 0.002
## 2: with_@ 4 0.0261517677 0.001795029 0.001
## 3: to_@ -9.53674e-07 0.0232632932 0.003530451 0.003
## 4: #ep2014 -9.53674e-07 0.0169314299 0.002705811 0.002
## 5: on_@ 4 0.0140596818 0.006254939 0.006
## 6: @_' -9.53674e-07 0.0106052830 0.001436444 0.001
## 7: today 4 0.0103340378 0.002378009 0.002
## 8: and_@ 4 0.0079150306 0.001422331 0.001
## 9: #votelabour -9.53674e-07 0.0076985374 0.003470392 0.003
## 10: green 1.5 0.0069683712 0.001379948 0.001
## 11: hacked -9.53674e-07 0.0062636103 0.004351157 0.004
## 12: meet 4 0.0051424163 0.003338398 0.003
## 13: elections -9.53674e-07 0.0049795409 0.002374194 0.002
## 14: #votegreen2014 4 0.0049374267 0.004299185 0.004
## 15: #labourdoorstep -9.53674e-07 0.0038669121 0.004206947 0.004
## 16: rt_@ -9.53674e-07 0.0032562307 0.001323676 0.001
## 17: candidates 1.5 0.0032194462 0.002327759 0.002
## 18: at_@ -9.53674e-07 0.0031727733 0.002280168 0.002
## 19: -_@ -9.53674e-07 0.0030844565 0.005115433 0.005
## 20: ,_@ 1.5 0.0030382871 0.004094624 0.004
## 21: seat 4 0.0027930090 0.001241858 0.001
## 22: :_" -9.53674e-07 0.0027303949 0.001160337 0.001
## 23: from_@ 4 0.0026258916 0.002124374 0.002
## 24: hustings -9.53674e-07 0.0025360397 0.002288552 0.002
## 25: english 1.5 0.0024157110 0.010450661 0.011
## 26: european 1.5 0.0023950405 0.004098199 0.004
## 27: with_@ 1.5 0.0022739461 0.002120355 0.002
## 28: campaigning -9.53674e-07 0.0021413964 0.004040229 0.004
## 29: #eu -9.53674e-07 0.0021397238 0.003148474 0.003
## 30: mp -9.53674e-07 0.0020997256 0.002184421 0.002
## 31: #bbcqt 4 0.0020329772 0.003059013 0.003
## 32: for_@ 1.5 0.0018614026 0.003019753 0.003
## 33: of_@ -9.53674e-07 0.0018029113 0.003092100 0.003
## 34: rt -9.53674e-07 0.0017769030 0.005928090 0.006
## 35: meeting -9.53674e-07 0.0016835141 0.004041613 0.004
## 36: in_@ -9.53674e-07 0.0016784296 0.002229748 0.002
## 37: candidate -9.53674e-07 0.0016500573 0.003015176 0.003
## 38: out_! -9.53674e-07 0.0014571668 0.004899975 0.005
## 39: | 6 0.0013339202 0.016360403 0.018
## 40: ._" -9.53674e-07 0.0012311108 0.003926545 0.004
## 41: | 2.5 0.0011420356 0.016292804 0.018
## 42: england -9.53674e-07 0.0011096390 0.002122868 0.002
## 43: community -9.53674e-07 0.0010519142 0.001114778 0.001
## 44: that_@ -9.53674e-07 0.0010118648 0.004877524 0.005
## 45: britain -9.53674e-07 0.0009239397 0.002965073 0.003
## 46: the_@ 4 0.0007959244 0.002002425 0.002
## 47: mep_@ -9.53674e-07 0.0007164653 0.003902668 0.004
## 48: delighted -9.53674e-07 0.0006982868 0.003893584 0.004
## 49: london 4 0.0006417953 0.002924526 0.003
## 50: green_party -9.53674e-07 0.0006182929 0.001096317 0.001
## Feature Split Gain Cover Frequency
## RealCover RealCover % sign
## 1: 4 0.0011242271 0
## 2: 3 0.0008431703 0
## 3: 10 0.0028105677 0
## 4: 9 0.0025295110 0
## 5: 2 0.0005621135 0
## 6: 8 0.0022484542 0
## 7: 15 0.0042158516 0
## 8: 7 0.0019673974 0
## 9: 0 0.0000000000 0
## 10: 13 0.0036537381 0
## 11: 2 0.0005621135 0
## 12: 4 0.0011242271 0
## 13: 8 0.0022484542 0
## 14: 8 0.0022484542 0
## 15: 0 0.0000000000 0
## 16: 4 0.0011242271 0
## 17: 5 0.0014052839 0
## 18: 0 0.0000000000 0
## 19: 4 0.0011242271 0
## 20: 8 0.0022484542 0
## 21: 5 0.0014052839 0
## 22: 0 0.0000000000 0
## 23: 3 0.0008431703 0
## 24: 0 0.0000000000 0
## 25: 6 0.0016863406 0
## 26: 13 0.0036537381 0
## 27: 9 0.0025295110 0
## 28: 5 0.0014052839 0
## 29: 3 0.0008431703 0
## 30: 4 0.0011242271 0
## 31: 4 0.0011242271 0
## 32: 12 0.0033726813 0
## 33: 5 0.0014052839 0
## 34: 8 0.0022484542 0
## 35: 1 0.0002810568 0
## 36: 1 0.0002810568 0
## 37: 6 0.0016863406 0
## 38: 2 0.0005621135 0
## 39: 3 0.0008431703 0
## 40: 7 0.0019673974 0
## 41: 0 0.0000000000 0
## 42: 4 0.0011242271 0
## 43: 4 0.0011242271 0
## 44: 2 0.0005621135 0
## 45: 3 0.0008431703 0
## 46: 5 0.0014052839 0
## 47: 0 0.0000000000 0
## 48: 1 0.0002810568 0
## 49: 3 0.0008431703 0
## 50: 2 0.0005621135 0
## RealCover RealCover % sign
##
## via_@, with_@, to_@, #ep2014, on_@, @_', today, and_@, #votelabour, green, hacked, meet, elections, #votegreen2014, #labourdoorstep, rt_@, candidates, at_@, -_@, ,_@, seat, :_", from_@, hustings, english, european, campaigning, #eu, mp, #bbcqt, for_@, of_@, rt, meeting, in_@, candidate, out_!, |, ._", england, community, that_@, britain, the_@, mep_@, delighted, london, green_party
##
## value== 1 Feature Split Gain Cover Frequency
## 1: @ 18 0.317075959 0.005701276 0.003
## 2: @_@ 16 0.070935824 0.002328724 0.001
## 3: @_: 4 0.039597797 0.009091385 0.008
## 4: thanks 1.5 0.017029328 0.004719396 0.004
## 5: @_is 4 0.015453980 0.002405848 0.002
## 6: @ 8.5 0.013889075 0.001662281 0.001
## 7: ? 16 0.011642680 0.001452536 0.001
## 8: @_i -9.53674e-07 0.010909844 0.006573625 0.006
## 9: " 1.5 0.005490267 0.001356821 0.001
## 10: . 7.5 0.005013800 0.003382743 0.003
## 11: good 1.5 0.004944777 0.001374732 0.001
## 12: thank -9.53674e-07 0.004879654 0.002298679 0.002
## 13: :_- -9.53674e-07 0.004744827 0.005176608 0.005
## 14: meps -9.53674e-07 0.004412182 0.001341181 0.001
## 15: another -9.53674e-07 0.004164543 0.004139356 0.004
## 16: really 1.5 0.003953506 0.004331584 0.004
## 17: ( 6 0.003858481 0.002257590 0.002
## 18: yes 2.5 0.003764573 0.004214826 0.004
## 19: #ukip 1.5 0.003509819 0.001318302 0.001
## 20: ! 1.5 0.003384069 0.003232096 0.003
## 21: need 4 0.003104709 0.005149799 0.005
## 22: eu 4 0.003100187 0.002237712 0.002
## 23: @_on -9.53674e-07 0.003080565 0.003129560 0.003
## 24: congratulations 4 0.003005913 0.004200734 0.004
## 25: please -9.53674e-07 0.002958356 0.006984084 0.007
## 26: @_great -9.53674e-07 0.002876869 0.002250571 0.002
## 27: @ 7.5 0.002846132 0.001305125 0.001
## 28: can 1.5 0.002764517 0.002200581 0.002
## 29: @ 1.5 0.002729959 0.008812900 0.009
## 30: !_@ -9.53674e-07 0.002620017 0.006951994 0.007
## 31: uk 6 0.002525966 0.003110272 0.003
## 32: anti -9.53674e-07 0.002274024 0.002130588 0.002
## 33: twitter 1.5 0.002228139 0.003145225 0.003
## 34: council 4 0.002161979 0.004051947 0.004
## 35: campaign -9.53674e-07 0.002150206 0.001219883 0.001
## 36: @_the -9.53674e-07 0.002113128 0.005072627 0.005
## 37: far 1.5 0.001910249 0.003095117 0.003
## 38: real 4 0.001868129 0.002087940 0.002
## 39: @_for -9.53674e-07 0.001844937 0.001203005 0.001
## 40: many 1.5 0.001842180 0.003145021 0.003
## 41: actually -9.53674e-07 0.001819911 0.002241564 0.002
## 42: parties 4 0.001771041 0.005011465 0.005
## 43: @_not -9.53674e-07 0.001729829 0.002224320 0.002
## 44: lib 1.5 0.001710898 0.004000720 0.004
## 45: night 4 0.001629950 0.001186950 0.001
## 46: sad -9.53674e-07 0.001600430 0.003936113 0.004
## 47: ._not 4 0.001555707 0.008516418 0.009
## 48: @_no -9.53674e-07 0.001555048 0.004049651 0.004
## 49: @_was -9.53674e-07 0.001546615 0.002081523 0.002
## 50: @_good -9.53674e-07 0.001539965 0.001185114 0.001
## Feature Split Gain Cover Frequency
## RealCover RealCover % sign
## 1: 979 0.2751545812 1
## 2: 530 0.1489600899 1
## 3: 30 0.0084317032 1
## 4: 128 0.0359752670 1
## 5: 12 0.0033726813 1
## 6: 991 0.2785272625 1
## 7: 217 0.0609893198 1
## 8: 99 0.0278246206 1
## 9: 26 0.0073074761 1
## 10: 491 0.1379988758 1
## 11: 73 0.0205171445 1
## 12: 45 0.0126475548 1
## 13: 28 0.0078695897 1
## 14: 9 0.0025295110 1
## 15: 7 0.0019673974 1
## 16: 19 0.0053400787 1
## 17: 24 0.0067453626 1
## 18: 31 0.0087127600 1
## 19: 46 0.0129286116 1
## 20: 249 0.0699831366 1
## 21: 37 0.0103991006 1
## 22: 58 0.0163012929 1
## 23: 13 0.0036537381 1
## 24: 17 0.0047779651 1
## 25: 18 0.0050590219 1
## 26: 8 0.0022484542 1
## 27: 1028 0.2889263631 1
## 28: 55 0.0154581225 1
## 29: 1003 0.2818999438 1
## 30: 5 0.0014052839 1
## 31: 31 0.0087127600 1
## 32: 11 0.0030916245 1
## 33: 10 0.0028105677 1
## 34: 6 0.0016863406 1
## 35: 15 0.0042158516 1
## 36: 30 0.0084317032 1
## 37: 9 0.0025295110 1
## 38: 8 0.0022484542 1
## 39: 4 0.0011242271 1
## 40: 55 0.0154581225 1
## 41: 11 0.0030916245 1
## 42: 16 0.0044969084 1
## 43: 18 0.0050590219 1
## 44: 6 0.0016863406 1
## 45: 8 0.0022484542 1
## 46: 2 0.0005621135 1
## 47: 3 0.0008431703 1
## 48: 36 0.0101180438 1
## 49: 2 0.0005621135 1
## 50: 17 0.0047779651 1
## RealCover RealCover % sign
##
## @, @_@, @_:, thanks, @_is, ?, @_i, ", ., good, thank, :_-, meps, another, really, (, yes, #ukip, !, need, eu, @_on, congratulations, please, @_great, can, !_@, uk, anti, twitter, council, campaign, @_the, far, real, @_for, many, actually, parties, @_not, lib, night, sad, ._not, @_no, @_was, @_good