To learn how to do supervised machine learning applied to social media text, we will use a random sample of nearly 5,000 tweets mentioning the names of the candidates to the 2014 EP elections in the UK. We will be analyzing variable named communication
, which indicates whether each tweet was hand-coded as being engaging (a tweet that tries to engage with the audience of the account) or broadcasting (just sending a message, without trying to elicit a response).
The source of the dataset is an article co-authored with Yannis Theocharis, Zoltan Fazekas, and Sebastian Popa, published in the Journal of Communication. The link is here. Our goal was to understand to what extent candidates are not engaging voters on Twitter because they’re exposed to mostly impolite messages.
Let’s start by reading the dataset and creating a dummy variable indicating whether each tweet is engaging
library(quanteda)
## Warning: package 'quanteda' was built under R version 3.4.4
## Package version: 1.3.0
## Parallel computing: 2 of 4 threads used.
## See https://quanteda.io for tutorials and examples.
##
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
##
## View
tweets <- read.csv("~/data/UK-tweets.csv", stringsAsFactors=F)
tweets$engaging <- ifelse(tweets$communication=="engaging", 1, 0)
tweets <- tweets[!is.na(tweets$engaging),]
We’ll do some cleaning as well – substituting handles with @. Why? We want to provent overfitting.
tweets$text <- gsub('@[0-9_A-Za-z]+', '@', tweets$text)
As we discussed earlier today, before we can do any type of automated text analysis, we will need to go through several “preprocessing” steps before it can be passed to a statistical model. We’ll use the quanteda package quanteda here.
twcorpus <- corpus(tweets$text)
summary(twcorpus)
## Corpus consisting of 4566 documents, showing 100 documents:
##
## Text Types Tokens Sentences
## text1 8 10 2
## text2 17 23 3
## text3 15 18 2
## text4 15 19 3
## text5 14 15 1
## text6 3 3 1
## text7 9 9 1
## text8 20 21 2
## text9 20 23 5
## text10 8 9 1
## text11 15 20 1
## text12 23 27 1
## text13 16 27 1
## text14 22 26 2
## text15 12 19 1
## text16 21 23 3
## text17 9 14 1
## text18 19 23 1
## text19 16 18 1
## text20 11 14 1
## text21 6 6 1
## text22 14 15 1
## text23 8 10 1
## text24 6 7 2
## text25 29 33 1
## text26 21 23 3
## text27 16 18 2
## text28 17 19 1
## text29 15 21 1
## text30 19 21 1
## text31 20 22 1
## text32 15 15 1
## text33 7 7 1
## text34 22 23 2
## text35 18 20 2
## text36 16 16 1
## text37 9 9 1
## text38 24 27 2
## text39 17 18 1
## text40 19 22 2
## text41 22 23 2
## text42 26 30 3
## text43 23 27 3
## text44 6 8 1
## text45 6 6 1
## text46 18 23 1
## text47 17 18 2
## text48 18 21 2
## text49 5 7 1
## text50 20 25 1
## text51 3 11 1
## text52 10 12 1
## text53 14 16 1
## text54 24 31 1
## text55 15 18 1
## text56 11 12 1
## text57 19 22 1
## text58 19 20 2
## text59 11 12 1
## text60 19 22 1
## text61 14 16 1
## text62 13 14 1
## text63 7 8 1
## text64 12 12 2
## text65 17 22 1
## text66 10 13 1
## text67 20 21 1
## text68 16 16 2
## text69 13 16 2
## text70 17 17 3
## text71 14 16 2
## text72 23 25 2
## text73 26 30 3
## text74 21 23 2
## text75 18 21 1
## text76 12 14 1
## text77 13 16 1
## text78 8 8 1
## text79 7 8 1
## text80 23 26 1
## text81 8 8 1
## text82 9 9 3
## text83 24 26 2
## text84 16 20 1
## text85 18 20 1
## text86 12 16 1
## text87 18 19 1
## text88 21 22 2
## text89 17 19 1
## text90 11 13 1
## text91 19 21 3
## text92 27 31 3
## text93 15 17 1
## text94 5 5 1
## text95 4 4 1
## text96 28 29 2
## text97 5 6 1
## text98 22 24 1
## text99 15 21 1
## text100 24 27 1
##
## Source: /Users/pablobarbera/git/ECPR-SC105/code/* on x86_64 by pablobarbera
## Created: Thu Aug 9 11:08:16 2018
## Notes:
We can then convert a corpus into a document-feature matrix using the dfm function. We will then trim it in order to keep only tokens that appear in 2 or more tweets. Note that we keep punctuation – it turns out it can be quite informative.
twdfm <- dfm(twcorpus, remove=stopwords("english"), remove_url=TRUE,
ngrams=1:2, verbose=TRUE)
## Creating a dfm from a corpus input...
## ... lowercasing
## ... found 4,566 documents, 48,657 features
## ... removed 169 features
## ... created a 4,566 x 48,488 sparse dfm
## ... complete.
## Elapsed time: 1.28 seconds.
twdfm <- dfm_trim(twdfm, min_docfreq = 2, verbose=TRUE)
## Removing features occurring:
## - in fewer than 2 documents: 38,258
## Total features removed: 38,258 (78.9%).
Note that other preprocessing options are:
You can read more in the dfm
and tokens
help pages
Once we have the DFM, we split it into training and test set. We’ll go with 80% training and 20% set. Note the use of a random seed to make sure our results are replicable.
set.seed(123)
training <- sample(1:nrow(tweets), floor(.80 * nrow(tweets)))
test <- (1:nrow(tweets))[1:nrow(tweets) %in% training == FALSE]
Our first step is to train the classifier using cross-validation. There are many packages in R to run machine learning models. For regularized regression, glmnet is in my opinion the best. It’s much faster than caret or mlr (in my experience at least), and it has cross-validation already built-in, so we don’t need to code it from scratch. We’ll start with a ridge regression:
library(glmnet)
## Loading required package: Matrix
## Warning: package 'Matrix' was built under R version 3.4.4
## Loading required package: foreach
## Loaded glmnet 2.0-13
require(doMC)
## Loading required package: doMC
## Loading required package: iterators
## Loading required package: parallel
registerDoMC(cores=3)
ridge <- cv.glmnet(twdfm[training,], tweets$engaging[training],
family="binomial", alpha=0, nfolds=5, parallel=TRUE, intercept=TRUE,
type.measure="class")
plot(ridge)
We can now compute the performance metrics on the test set.
## function to compute accuracy
accuracy <- function(ypred, y){
tab <- table(ypred, y)
return(sum(diag(tab))/sum(tab))
}
# function to compute precision
precision <- function(ypred, y){
tab <- table(ypred, y)
return((tab[2,2])/(tab[2,1]+tab[2,2]))
}
# function to compute recall
recall <- function(ypred, y){
tab <- table(ypred, y)
return(tab[2,2]/(tab[1,2]+tab[2,2]))
}
# computing predicted values
preds <- predict(ridge, twdfm[test,], type="class")
# confusion matrix
table(preds, tweets$engaging[test])
##
## preds 0 1
## 0 17 3
## 1 176 718
# performance metrics
accuracy(preds, tweets$engaging[test])
## [1] 0.8041575
precision(preds==1, tweets$engaging[test]==1)
## [1] 0.803132
recall(preds==1, tweets$engaging[test]==1)
## [1] 0.9958391
precision(preds==0, tweets$engaging[test]==0)
## [1] 0.85
recall(preds==0, tweets$engaging[test]==0)
## [1] 0.0880829
Something that is often very useful is to look at the actual estimated coefficients and see which of these have the highest or lowest values:
# from the different values of lambda, let's pick the highest one that is
# within one standard error of the best one (why? see "one-standard-error"
# rule -- maximizes parsimony)
best.lambda <- which(ridge$lambda==ridge$lambda.1se)
beta <- ridge$glmnet.fit$beta[,best.lambda]
head(beta)
## @ thank ! look @_@
## 0.026491469 0.057415856 0.009114663 -0.003353636 0.023129622
## @_thank
## 0.067938135
## identifying predictive features
df <- data.frame(coef = as.numeric(beta),
word = names(beta), stringsAsFactors=F)
df <- df[order(df$coef),]
head(df[,c("coef", "word")], n=30)
## coef word
## 4412 -0.3378288 reverse
## 9671 -0.3369787 knocking
## 6265 -0.3347340 that_man
## 5764 -0.3330700 and_share
## 262 -0.3303776 and_its
## 2332 -0.3252179 god_,
## 4026 -0.3211980 beacon
## 8676 -0.3204813 posters
## 4703 -0.3184335 eu_law
## 6448 -0.3178059 zone
## 6449 -0.3170302 defends
## 6507 -0.3170106 tonbridge
## 8310 -0.3162415 political_class
## 6072 -0.3145641 the_weather
## 8932 -0.3129769 on_being
## 2595 -0.3128641 #yes
## 555 -0.3118844 cunts
## 8719 -0.3117144 would_make
## 6769 -0.3114111 initiative
## 9254 -0.3086579 debates
## 8356 -0.3073173 determined
## 6316 -0.3070599 tweets_,
## 7878 -0.3069591 entertainment
## 9263 -0.3054351 cleaning
## 7021 -0.3035772 earn
## 4694 -0.3034597 they_know
## 8701 -0.3020196 from_today
## 10036 -0.3011251 interview_i
## 8660 -0.3010277 twitter_account
## 2061 -0.3008453 scottish_independence
paste(df$word[1:30], collapse=", ")
## [1] "reverse, knocking, that_man, and_share, and_its, god_,, beacon, posters, eu_law, zone, defends, tonbridge, political_class, the_weather, on_being, #yes, cunts, would_make, initiative, debates, determined, tweets_,, entertainment, cleaning, earn, they_know, from_today, interview_i, twitter_account, scottish_independence"
df <- df[order(df$coef, decreasing=TRUE),]
head(df[,c("coef", "word")], n=30)
## coef word
## 9429 0.1849087 town_hall
## 3412 0.1487278 !_rt
## 7699 0.1486876 dt_@
## 7697 0.1486873 dt
## 9319 0.1483975 west_green
## 6136 0.1483925 western
## 8671 0.1467153 seat_.
## 5143 0.1426298 insignificant
## 4626 0.1354776 re_:
## 5275 0.1307437 vote_-
## 6326 0.1289008 ;_thnx
## 9866 0.1287744 is_big
## 7468 0.1264568 ,_much
## 8435 0.1224417 to_prove
## 9516 0.1223285 of_scotland's
## 2203 0.1221726 :_new
## 8227 0.1208279 for_tonight
## 6211 0.1191033 leader_,
## 1826 0.1188151 bank_of
## 8869 0.1186131 @_yougov
## 8640 0.1155231 result_,
## 987 0.1154847 @_bbc
## 8883 0.1144142 ._congratulations
## 7085 0.1143405 to_labour
## 2786 0.1134104 !_&
## 2785 0.1126346 team_!
## 3747 0.1126003 great_piece
## 3711 0.1118065 /_eu
## 6185 0.1114531 addition
## 9514 0.1102726 compliments
paste(df$word[1:30], collapse=", ")
## [1] "town_hall, !_rt, dt_@, dt, west_green, western, seat_., insignificant, re_:, vote_-, ;_thnx, is_big, ,_much, to_prove, of_scotland's, :_new, for_tonight, leader_,, bank_of, @_yougov, result_,, @_bbc, ._congratulations, to_labour, !_&, team_!, great_piece, /_eu, addition, compliments"
We can easily modify our code to experiment with Lasso or ElasticNet models:
lasso <- cv.glmnet(twdfm[training,], tweets$engaging[training],
family="binomial", alpha=1, nfolds=5, parallel=TRUE, intercept=TRUE,
type.measure="class")
# computing predicted values
preds <- predict(lasso, twdfm[test,], type="class")
# confusion matrix
table(preds, tweets$engaging[test])
##
## preds 0 1
## 0 42 6
## 1 151 715
# performance metrics (slightly better!)
accuracy(preds, tweets$engaging[test])
## [1] 0.8282276
precision(preds==1, tweets$engaging[test]==1)
## [1] 0.8256351
recall(preds==1, tweets$engaging[test]==1)
## [1] 0.9916782
precision(preds==0, tweets$engaging[test]==0)
## [1] 0.875
recall(preds==0, tweets$engaging[test]==0)
## [1] 0.2176166
best.lambda <- which(lasso$lambda==lasso$lambda.1se)
beta <- lasso$glmnet.fit$beta[,best.lambda]
head(beta)
## @ thank ! look @_@ @_thank
## 0.6783611 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## identifying predictive features
df <- data.frame(coef = as.numeric(beta),
word = names(beta), stringsAsFactors=F)
df <- df[order(df$coef),]
head(df[,c("coef", "word")], n=30)
## coef word
## 601 -1.8190566 on_@
## 9434 -1.8044280 #voteni2014
## 2074 -1.6229927 via_@
## 5326 -1.5450336 #votelabour
## 1375 -1.3048701 to_@
## 282 -1.2989784 with_@
## 1266 -1.2099612 and_@
## 7537 -1.0223134 with_his
## 228 -0.9689621 #ep2014
## 1089 -0.8800491 hustings
## 8786 -0.7803956 far_@
## 109 -0.6992762 hacked
## 9790 -0.6859966 (_@
## 498 -0.6475886 #labourdoorstep
## 669 -0.6445125 today
## 1004 -0.6263757 rt_@
## 1922 -0.6211318 #votelab14
## 844 -0.5889416 green
## 7881 -0.5575388 at_@
## 5928 -0.5570703 just_voted
## 456 -0.5563673 @_is
## 2504 -0.5491547 rd
## 1386 -0.5211930 elections
## 851 -0.5132192 #votegreen2014
## 471 -0.5061936 '_s
## 1700 -0.4984246 -_@
## 1497 -0.4602545 meeting
## 1950 -0.3944299 :_"
## 969 -0.3754333 ;_@
## 200 -0.3648572 campaigning
df <- df[order(df$coef, decreasing=TRUE),]
head(df[,c("coef", "word")], n=30)
## coef word
## 558 0.68922703 @_i
## 1 0.67836113 @
## 398 0.46034281 thanks
## 8 0.41625360 thank_you
## 72 0.19778527 ?
## 614 0.17325089 @_good
## 54 0.15182300 :_-
## 399 0.09752179 @_thanks
## 1186 0.07863422 @_yes
## 587 0.07487055 good
## 515 0.01586562 @_you
## 2 0.00000000 thank
## 3 0.00000000 !
## 4 0.00000000 look
## 5 0.00000000 @_@
## 6 0.00000000 @_thank
## 7 0.00000000 back
## 9 0.00000000 you_!
## 10 0.00000000 tomorrow
## 11 0.00000000 well
## 12 0.00000000 daily
## 13 0.00000000 politics
## 15 0.00000000 -
## 16 0.00000000 )
## 17 0.00000000 seems
## 18 0.00000000 !_will
## 19 0.00000000 people
## 20 0.00000000 keep
## 21 0.00000000 will_have
## 22 0.00000000 have_a
We now see that the coefficients for some features actually became zero.
enet <- cv.glmnet(twdfm[training,], tweets$engaging[training],
family="binomial", alpha=0.50, nfolds=5, parallel=TRUE, intercept=TRUE,
type.measure="class")
# NOTE: this will not cross-validate across values of alpha
# computing predicted values
preds <- predict(enet, twdfm[test,], type="class")
# confusion matrix
table(preds, tweets$engaging[test])
##
## preds 0 1
## 0 56 10
## 1 137 711
# performance metrics
accuracy(preds, tweets$engaging[test])
## [1] 0.8391685
precision(preds==1, tweets$engaging[test]==1)
## [1] 0.8384434
recall(preds==1, tweets$engaging[test]==1)
## [1] 0.9861304
precision(preds==0, tweets$engaging[test]==0)
## [1] 0.8484848
recall(preds==0, tweets$engaging[test]==0)
## [1] 0.2901554
best.lambda <- which(enet$lambda==enet$lambda.1se)
beta <- enet$glmnet.fit$beta[,best.lambda]
head(beta)
## @ thank ! look @_@ @_thank
## 0.5962622 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## identifying predictive features
df <- data.frame(coef = as.numeric(beta),
word = names(beta), stringsAsFactors=F)
df <- df[order(df$coef),]
head(df[,c("coef", "word")], n=30)
## coef word
## 9434 -1.9397130 #voteni2014
## 601 -1.8435300 on_@
## 3459 -1.7475230 nw_.
## 7537 -1.7255967 with_his
## 5326 -1.7242325 #votelabour
## 2514 -1.4582612 @_#labourdoorstep
## 9790 -1.4446327 (_@
## 2074 -1.3407118 via_@
## 1375 -1.3175990 to_@
## 8786 -1.2491845 far_@
## 282 -1.2335795 with_@
## 1089 -1.1967385 hustings
## 1266 -1.1953930 and_@
## 5928 -1.1647694 just_voted
## 109 -1.1537820 hacked
## 95 -1.1458042 password
## 6667 -1.0784478 @_event
## 1922 -1.0753018 #votelab14
## 4878 -1.0619328 only_@
## 228 -0.9851047 #ep2014
## 9153 -0.8968774 meps_:
## 495 -0.8729090 cameron_on
## 1497 -0.8693616 meeting
## 1950 -0.8454411 :_"
## 6265 -0.8273472 that_man
## 1700 -0.8232255 -_@
## 8298 -0.8151523 on_friday
## 669 -0.8048479 today
## 9205 -0.7875784 starting_to
## 7881 -0.7608702 at_@
df <- df[order(df$coef, decreasing=TRUE),]
head(df[,c("coef", "word")], n=30)
## coef word
## 558 0.81795613 @_i
## 1 0.59626220 @
## 8 0.58092430 thank_you
## 398 0.46442642 thanks
## 54 0.37830055 :_-
## 614 0.36000883 @_good
## 1186 0.34125006 @_yes
## 399 0.28964424 @_thanks
## 72 0.25662864 ?
## 515 0.25328467 @_you
## 5630 0.20551387 #votegreen
## 1074 0.17914739 @_the
## 1899 0.15842653 @_great
## 688 0.13645408 @_not
## 587 0.13181617 good
## 1065 0.10523721 good_luck
## 1343 0.05205937 need
## 1018 0.04141076 @_i'm
## 1270 0.03609250 any_of
## 975 0.02718622 many
## 632 0.01708541 :_)
## 781 0.01002222 @_please
## 2 0.00000000 thank
## 3 0.00000000 !
## 4 0.00000000 look
## 5 0.00000000 @_@
## 6 0.00000000 @_thank
## 7 0.00000000 back
## 9 0.00000000 you_!
## 10 0.00000000 tomorrow