To learn how to do supervised machine learning applied to social media text, we will use a random sample of nearly 5,000 tweets mentioning the names of the candidates to the 2014 EP elections in the UK. We will be analyzing variable named communication
, which indicates whether each tweet was hand-coded as being engaging (a tweet that tries to engage with the audience of the account) or broadcasting (just sending a message, without trying to elicit a response).
The source of the dataset is an article co-authored with Yannis Theocharis, Zoltan Fazekas, and Sebastian Popa, published in the Journal of Communication. The link is here. Our goal was to understand to what extent candidates are not engaging voters on Twitter because they’re exposed to mostly impolite messages.
Let’s start by reading the dataset and creating a dummy variable indicating whether each tweet is engaging
library(quanteda)
## Warning: package 'quanteda' was built under R version 3.4.4
## Package version: 1.3.0
## Parallel computing: 2 of 4 threads used.
## See https://quanteda.io for tutorials and examples.
##
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
##
## View
tweets <- read.csv("../data/UK-tweets.csv", stringsAsFactors=F)
tweets$engaging <- ifelse(tweets$communication=="engaging", 1, 0)
tweets <- tweets[!is.na(tweets$engaging),]
We’ll do some cleaning as well – substituting handles with @. Why? We want to provent overfitting.
tweets$text <- gsub('@[0-9_A-Za-z]+', '@', tweets$text)
As we discussed earlier today, before we can do any type of automated text analysis, we will need to go through several “preprocessing” steps before it can be passed to a statistical model. We’ll use the quanteda package quanteda here.
twcorpus <- corpus(tweets$text)
summary(twcorpus)
## Corpus consisting of 4566 documents, showing 100 documents:
##
## Text Types Tokens Sentences
## text1 8 10 2
## text2 17 23 3
## text3 15 18 2
## text4 15 19 3
## text5 14 15 1
## text6 3 3 1
## text7 9 9 1
## text8 20 21 2
## text9 20 23 5
## text10 8 9 1
## text11 15 20 1
## text12 23 27 1
## text13 16 27 1
## text14 22 26 2
## text15 12 19 1
## text16 21 23 3
## text17 9 14 1
## text18 19 23 1
## text19 16 18 1
## text20 11 14 1
## text21 6 6 1
## text22 14 15 1
## text23 8 10 1
## text24 6 7 2
## text25 29 33 1
## text26 21 23 3
## text27 16 18 2
## text28 17 19 1
## text29 15 21 1
## text30 19 21 1
## text31 20 22 1
## text32 15 15 1
## text33 7 7 1
## text34 22 23 2
## text35 18 20 2
## text36 16 16 1
## text37 9 9 1
## text38 24 27 2
## text39 17 18 1
## text40 19 22 2
## text41 22 23 2
## text42 26 30 3
## text43 23 27 3
## text44 6 8 1
## text45 6 6 1
## text46 18 23 1
## text47 17 18 2
## text48 18 21 2
## text49 5 7 1
## text50 20 25 1
## text51 3 11 1
## text52 10 12 1
## text53 14 16 1
## text54 24 31 1
## text55 15 18 1
## text56 11 12 1
## text57 19 22 1
## text58 19 20 2
## text59 11 12 1
## text60 19 22 1
## text61 14 16 1
## text62 13 14 1
## text63 7 8 1
## text64 12 12 2
## text65 17 22 1
## text66 10 13 1
## text67 20 21 1
## text68 16 16 2
## text69 13 16 2
## text70 17 17 3
## text71 14 16 2
## text72 23 25 2
## text73 26 30 3
## text74 21 23 2
## text75 18 21 1
## text76 12 14 1
## text77 13 16 1
## text78 8 8 1
## text79 7 8 1
## text80 23 26 1
## text81 8 8 1
## text82 9 9 3
## text83 24 26 2
## text84 16 20 1
## text85 18 20 1
## text86 12 16 1
## text87 18 19 1
## text88 21 22 2
## text89 17 19 1
## text90 11 13 1
## text91 19 21 3
## text92 27 31 3
## text93 15 17 1
## text94 5 5 1
## text95 4 4 1
## text96 28 29 2
## text97 5 6 1
## text98 22 24 1
## text99 15 21 1
## text100 24 27 1
##
## Source: /Users/pablobarbera/git/social-media-upf/code/* on x86_64 by pablobarbera
## Created: Tue Jul 3 10:45:36 2018
## Notes:
We can then convert a corpus into a document-feature matrix using the dfm function. We will then trim it in order to keep only tokens that appear in 2 or more tweets. Note that we keep punctuation – it turns out it can be quite informative.
twdfm <- dfm(twcorpus, remove=stopwords("english"), remove_url=TRUE,
ngrams=1:2, verbose=TRUE)
## Creating a dfm from a corpus input...
## ... lowercasing
## ... found 4,566 documents, 48,657 features
## ... removed 169 features
## ... created a 4,566 x 48,488 sparse dfm
## ... complete.
## Elapsed time: 0.992 seconds.
twdfm <- dfm_trim(twdfm, min_docfreq = 2, verbose=TRUE)
## Removing features occurring:
## - in fewer than 2 documents: 38,258
## Total features removed: 38,258 (78.9%).
Note that other preprocessing options are:
You can read more in the dfm
and tokens
help pages
Once we have the DFM, we split it into training and test set. We’ll go with 80% training and 20% set. Note the use of a random seed to make sure our results are replicable.
set.seed(123)
training <- sample(1:nrow(tweets), floor(.80 * nrow(tweets)))
test <- (1:nrow(tweets))[1:nrow(tweets) %in% training == FALSE]
Our first step is to train the classifier using cross-validation. There are many packages in R to run machine learning models. For regularized regression, glmnet is in my opinion the best. It’s much faster than caret or mlr (in my experience at least), and it has cross-validation already built-in, so we don’t need to code it from scratch. We’ll start with a ridge regression:
library(glmnet)
## Loading required package: Matrix
## Loading required package: foreach
## Loaded glmnet 2.0-13
require(doMC)
## Loading required package: doMC
## Loading required package: iterators
## Loading required package: parallel
registerDoMC(cores=3)
ridge <- cv.glmnet(twdfm[training,], tweets$engaging[training],
family="binomial", alpha=0, nfolds=5, parallel=TRUE, intercept=TRUE,
type.measure="class")
plot(ridge)
We can now compute the performance metrics on the test set.
## function to compute accuracy
accuracy <- function(ypred, y){
tab <- table(ypred, y)
return(sum(diag(tab))/sum(tab))
}
# function to compute precision
precision <- function(ypred, y){
tab <- table(ypred, y)
return((tab[2,2])/(tab[2,1]+tab[2,2]))
}
# function to compute recall
recall <- function(ypred, y){
tab <- table(ypred, y)
return(tab[2,2]/(tab[1,2]+tab[2,2]))
}
# computing predicted values
preds <- predict(ridge, twdfm[test,], type="class")
# confusion matrix
table(preds, tweets$engaging[test])
##
## preds 0 1
## 0 17 3
## 1 176 718
# performance metrics
accuracy(preds, tweets$engaging[test])
## [1] 0.8041575
precision(preds==1, tweets$engaging[test]==1)
## [1] 0.803132
recall(preds==1, tweets$engaging[test]==1)
## [1] 0.9958391
precision(preds==0, tweets$engaging[test]==0)
## [1] 0.85
recall(preds==0, tweets$engaging[test]==0)
## [1] 0.0880829
Something that is often very useful is to look at the actual estimated coefficients and see which of these have the highest or lowest values:
# from the different values of lambda, let's pick the highest one that is
# within one standard error of the best one (why? see "one-standard-error"
# rule -- maximizes parsimony)
best.lambda <- which(ridge$lambda==ridge$lambda.1se)
beta <- ridge$glmnet.fit$beta[,best.lambda]
head(beta)
## @ thank ! look @_@
## 0.026491467 0.057415857 0.009114665 -0.003353644 0.023129622
## @_thank
## 0.067938129
## identifying predictive features
df <- data.frame(coef = as.numeric(beta),
word = names(beta), stringsAsFactors=F)
df <- df[order(df$coef),]
head(df[,c("coef", "word")], n=30)
## coef word
## 4159 -0.3378288 reverse
## 9566 -0.3369787 knocking
## 6203 -0.3347340 that_man
## 5612 -0.3330700 and_share
## 223 -0.3303776 and_its
## 1908 -0.3252179 god_,
## 3658 -0.3211980 beacon
## 8580 -0.3204813 posters
## 4876 -0.3184335 eu_law
## 6418 -0.3178059 zone
## 6419 -0.3170302 defends
## 6531 -0.3170106 tonbridge
## 8231 -0.3162415 political_class
## 6167 -0.3145641 the_weather
## 9012 -0.3129769 on_being
## 2139 -0.3128640 #yes
## 501 -0.3118844 cunts
## 8619 -0.3117144 would_make
## 6787 -0.3114111 initiative
## 9386 -0.3086579 debates
## 8453 -0.3073173 determined
## 6366 -0.3070599 tweets_,
## 7890 -0.3069591 entertainment
## 9129 -0.3054351 cleaning
## 6958 -0.3035772 earn
## 4859 -0.3034597 they_know
## 8825 -0.3020196 from_today
## 9992 -0.3011251 interview_i
## 8785 -0.3010277 twitter_account
## 2366 -0.3008453 scottish_independence
paste(df$word[1:30], collapse=", ")
## [1] "reverse, knocking, that_man, and_share, and_its, god_,, beacon, posters, eu_law, zone, defends, tonbridge, political_class, the_weather, on_being, #yes, cunts, would_make, initiative, debates, determined, tweets_,, entertainment, cleaning, earn, they_know, from_today, interview_i, twitter_account, scottish_independence"
df <- df[order(df$coef, decreasing=TRUE),]
head(df[,c("coef", "word")], n=30)
## coef word
## 9310 0.1849088 town_hall
## 2887 0.1487279 !_rt
## 7713 0.1486876 dt_@
## 7711 0.1486873 dt
## 9193 0.1483949 west_green
## 6059 0.1483919 western
## 8798 0.1467153 seat_.
## 5397 0.1426298 insignificant
## 4437 0.1354776 re_:
## 5029 0.1307438 vote_-
## 6374 0.1289006 ;_thnx
## 9819 0.1287744 is_big
## 7515 0.1264567 ,_much
## 8511 0.1224417 to_prove
## 9610 0.1223285 of_scotland's
## 2528 0.1221726 :_new
## 8292 0.1208279 for_tonight
## 6145 0.1191033 leader_,
## 2094 0.1188151 bank_of
## 8768 0.1186134 @_yougov
## 8765 0.1155230 result_,
## 1121 0.1154849 @_bbc
## 8967 0.1144141 ._congratulations
## 7135 0.1143405 to_labour
## 3184 0.1134104 !_&
## 3183 0.1126346 team_!
## 3317 0.1126004 great_piece
## 3275 0.1118064 /_eu
## 6118 0.1114532 addition
## 9608 0.1102726 compliments
paste(df$word[1:30], collapse=", ")
## [1] "town_hall, !_rt, dt_@, dt, west_green, western, seat_., insignificant, re_:, vote_-, ;_thnx, is_big, ,_much, to_prove, of_scotland's, :_new, for_tonight, leader_,, bank_of, @_yougov, result_,, @_bbc, ._congratulations, to_labour, !_&, team_!, great_piece, /_eu, addition, compliments"
We can easily modify our code to experiment with Lasso or ElasticNet models:
lasso <- cv.glmnet(twdfm[training,], tweets$engaging[training],
family="binomial", alpha=1, nfolds=5, parallel=TRUE, intercept=TRUE,
type.measure="class")
# computing predicted values
preds <- predict(lasso, twdfm[test,], type="class")
# confusion matrix
table(preds, tweets$engaging[test])
##
## preds 0 1
## 0 42 6
## 1 151 715
# performance metrics (slightly better!)
accuracy(preds, tweets$engaging[test])
## [1] 0.8282276
precision(preds==1, tweets$engaging[test]==1)
## [1] 0.8256351
recall(preds==1, tweets$engaging[test]==1)
## [1] 0.9916782
precision(preds==0, tweets$engaging[test]==0)
## [1] 0.875
recall(preds==0, tweets$engaging[test]==0)
## [1] 0.2176166
best.lambda <- which(lasso$lambda==lasso$lambda.1se)
beta <- lasso$glmnet.fit$beta[,best.lambda]
head(beta)
## @ thank ! look @_@ @_thank
## 0.6783597 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## identifying predictive features
df <- data.frame(coef = as.numeric(beta),
word = names(beta), stringsAsFactors=F)
df <- df[order(df$coef),]
head(df[,c("coef", "word")], n=30)
## coef word
## 676 -1.8190662 on_@
## 9560 -1.8044093 #voteni2014
## 1733 -1.6229837 via_@
## 5549 -1.5450174 #votelabour
## 1572 -1.3048526 to_@
## 297 -1.2989550 with_@
## 1092 -1.2099696 and_@
## 7565 -1.0223143 with_his
## 195 -0.9689625 #ep2014
## 947 -0.8800502 hustings
## 8908 -0.7804123 far_@
## 121 -0.6992853 hacked
## 9724 -0.6860014 (_@
## 570 -0.6475644 #labourdoorstep
## 606 -0.6445173 today
## 1147 -0.6263880 rt_@
## 1588 -0.6211406 #votelab14
## 747 -0.5889406 green
## 7868 -0.5575370 at_@
## 6049 -0.5570636 just_voted
## 402 -0.5563883 @_is
## 2876 -0.5491715 rd
## 1182 -0.5211947 elections
## 754 -0.5132174 #votegreen2014
## 414 -0.5060993 '_s
## 1407 -0.4984036 -_@
## 1724 -0.4602532 meeting
## 2240 -0.3943573 :_"
## 1103 -0.3754607 ;_@
## 182 -0.3648502 campaigning
df <- df[order(df$coef, decreasing=TRUE),]
head(df[,c("coef", "word")], n=30)
## coef word
## 503 0.68922257 @_i
## 1 0.67835966 @
## 445 0.46034214 thanks
## 7 0.41625078 thank_you
## 75 0.19778461 ?
## 553 0.17321136 @_good
## 56 0.15185118 :_-
## 447 0.09750679 @_thanks
## 1350 0.07860851 @_yes
## 549 0.07487025 good
## 584 0.01588899 @_you
## 2 0.00000000 thank
## 3 0.00000000 !
## 4 0.00000000 look
## 5 0.00000000 @_@
## 6 0.00000000 @_thank
## 8 0.00000000 you_!
## 9 0.00000000 back
## 10 0.00000000 !_will
## 11 0.00000000 will_have
## 12 0.00000000 have_a
## 13 0.00000000 a_look
## 14 0.00000000 tomorrow
## 15 0.00000000 well
## 16 0.00000000 nice
## 17 0.00000000 daily
## 18 0.00000000 politics
## 19 0.00000000 admit
## 21 0.00000000 -
## 22 0.00000000 supporters
We now see that the coefficients for some features actually became zero.
enet <- cv.glmnet(twdfm[training,], tweets$engaging[training],
family="binomial", alpha=0.50, nfolds=5, parallel=TRUE, intercept=TRUE,
type.measure="class")
# NOTE: this will not cross-validate across values of alpha
# computing predicted values
preds <- predict(enet, twdfm[test,], type="class")
# confusion matrix
table(preds, tweets$engaging[test])
##
## preds 0 1
## 0 56 10
## 1 137 711
# performance metrics
accuracy(preds, tweets$engaging[test])
## [1] 0.8391685
precision(preds==1, tweets$engaging[test]==1)
## [1] 0.8384434
recall(preds==1, tweets$engaging[test]==1)
## [1] 0.9861304
precision(preds==0, tweets$engaging[test]==0)
## [1] 0.8484848
recall(preds==0, tweets$engaging[test]==0)
## [1] 0.2901554
best.lambda <- which(enet$lambda==enet$lambda.1se)
beta <- enet$glmnet.fit$beta[,best.lambda]
head(beta)
## @ thank ! look @_@ @_thank
## 0.5962636 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## identifying predictive features
df <- data.frame(coef = as.numeric(beta),
word = names(beta), stringsAsFactors=F)
df <- df[order(df$coef),]
head(df[,c("coef", "word")], n=30)
## coef word
## 9560 -1.9397670 #voteni2014
## 676 -1.8435486 on_@
## 2954 -1.7475130 nw_.
## 7565 -1.7256060 with_his
## 5549 -1.7242362 #votelabour
## 2888 -1.4585419 @_#labourdoorstep
## 9724 -1.4446374 (_@
## 1733 -1.3415249 via_@
## 1572 -1.3175980 to_@
## 8908 -1.2490577 far_@
## 297 -1.2335535 with_@
## 947 -1.1967375 hustings
## 1092 -1.1953927 and_@
## 6049 -1.1647580 just_voted
## 121 -1.1537672 hacked
## 99 -1.1458006 password
## 6674 -1.0784753 @_event
## 1588 -1.0753073 #votelab14
## 4673 -1.0619661 only_@
## 195 -0.9850781 #ep2014
## 9283 -0.8968599 meps_:
## 430 -0.8728814 cameron_on
## 1724 -0.8694604 meeting
## 2240 -0.8453640 :_"
## 6203 -0.8273457 that_man
## 1407 -0.8232311 -_@
## 8216 -0.8151511 on_friday
## 606 -0.8048400 today
## 9333 -0.7875681 starting_to
## 7868 -0.7608485 at_@
df <- df[order(df$coef, decreasing=TRUE),]
head(df[,c("coef", "word")], n=30)
## coef word
## 503 0.81796307 @_i
## 1 0.59626363 @
## 7 0.58092619 thank_you
## 445 0.46444452 thanks
## 56 0.37833801 :_-
## 553 0.36001448 @_good
## 1350 0.34125432 @_yes
## 447 0.28963311 @_thanks
## 75 0.25662721 ?
## 584 0.25329533 @_you
## 5805 0.20606005 #votegreen
## 1225 0.17915089 @_the
## 2184 0.15845517 @_great
## 767 0.13645318 @_not
## 549 0.13182246 good
## 925 0.10522658 good_luck
## 1531 0.05203757 need
## 1162 0.04144368 @_i'm
## 1438 0.03536373 any_of
## 1107 0.02717977 many
## 569 0.01709664 :_)
## 859 0.01002803 @_please
## 2 0.00000000 thank
## 3 0.00000000 !
## 4 0.00000000 look
## 5 0.00000000 @_@
## 6 0.00000000 @_thank
## 8 0.00000000 you_!
## 9 0.00000000 back
## 10 0.00000000 !_will