Regularized regression

To learn how to do supervised machine learning applied to social media text, we will use a random sample of nearly 5,000 tweets mentioning the names of the candidates to the 2014 EP elections in the UK. We will be analyzing variable named communication, which indicates whether each tweet was hand-coded as being engaging (a tweet that tries to engage with the audience of the account) or broadcasting (just sending a message, without trying to elicit a response).

The source of the dataset is an article co-authored with Yannis Theocharis, Zoltan Fazekas, and Sebastian Popa, published in the Journal of Communication. The link is here. Our goal was to understand to what extent candidates are not engaging voters on Twitter because they’re exposed to mostly impolite messages.

Let’s start by reading the dataset and creating a dummy variable indicating whether each tweet is engaging

library(quanteda)
## Warning: package 'quanteda' was built under R version 3.4.4
## Package version: 1.3.0
## Parallel computing: 2 of 4 threads used.
## See https://quanteda.io for tutorials and examples.
## 
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
## 
##     View
tweets <- read.csv("~/data/UK-tweets.csv", stringsAsFactors=F)
tweets$engaging <- ifelse(tweets$communication=="engaging", 1, 0)
tweets <- tweets[!is.na(tweets$engaging),]

We’ll do some cleaning as well – substituting handles with @. Why? We want to provent overfitting.

tweets$text <- gsub('@[0-9_A-Za-z]+', '@', tweets$text)

As we discussed earlier today, before we can do any type of automated text analysis, we will need to go through several “preprocessing” steps before it can be passed to a statistical model. We’ll use the quanteda package quanteda here.

twcorpus <- corpus(tweets$text)
summary(twcorpus)
## Corpus consisting of 4566 documents, showing 100 documents:
## 
##     Text Types Tokens Sentences
##    text1     8     10         2
##    text2    17     23         3
##    text3    15     18         2
##    text4    15     19         3
##    text5    14     15         1
##    text6     3      3         1
##    text7     9      9         1
##    text8    20     21         2
##    text9    20     23         5
##   text10     8      9         1
##   text11    15     20         1
##   text12    23     27         1
##   text13    16     27         1
##   text14    22     26         2
##   text15    12     19         1
##   text16    21     23         3
##   text17     9     14         1
##   text18    19     23         1
##   text19    16     18         1
##   text20    11     14         1
##   text21     6      6         1
##   text22    14     15         1
##   text23     8     10         1
##   text24     6      7         2
##   text25    29     33         1
##   text26    21     23         3
##   text27    16     18         2
##   text28    17     19         1
##   text29    15     21         1
##   text30    19     21         1
##   text31    20     22         1
##   text32    15     15         1
##   text33     7      7         1
##   text34    22     23         2
##   text35    18     20         2
##   text36    16     16         1
##   text37     9      9         1
##   text38    24     27         2
##   text39    17     18         1
##   text40    19     22         2
##   text41    22     23         2
##   text42    26     30         3
##   text43    23     27         3
##   text44     6      8         1
##   text45     6      6         1
##   text46    18     23         1
##   text47    17     18         2
##   text48    18     21         2
##   text49     5      7         1
##   text50    20     25         1
##   text51     3     11         1
##   text52    10     12         1
##   text53    14     16         1
##   text54    24     31         1
##   text55    15     18         1
##   text56    11     12         1
##   text57    19     22         1
##   text58    19     20         2
##   text59    11     12         1
##   text60    19     22         1
##   text61    14     16         1
##   text62    13     14         1
##   text63     7      8         1
##   text64    12     12         2
##   text65    17     22         1
##   text66    10     13         1
##   text67    20     21         1
##   text68    16     16         2
##   text69    13     16         2
##   text70    17     17         3
##   text71    14     16         2
##   text72    23     25         2
##   text73    26     30         3
##   text74    21     23         2
##   text75    18     21         1
##   text76    12     14         1
##   text77    13     16         1
##   text78     8      8         1
##   text79     7      8         1
##   text80    23     26         1
##   text81     8      8         1
##   text82     9      9         3
##   text83    24     26         2
##   text84    16     20         1
##   text85    18     20         1
##   text86    12     16         1
##   text87    18     19         1
##   text88    21     22         2
##   text89    17     19         1
##   text90    11     13         1
##   text91    19     21         3
##   text92    27     31         3
##   text93    15     17         1
##   text94     5      5         1
##   text95     4      4         1
##   text96    28     29         2
##   text97     5      6         1
##   text98    22     24         1
##   text99    15     21         1
##  text100    24     27         1
## 
## Source: /Users/pablobarbera/git/text-analysis-vienna/code/* on x86_64 by pablobarbera
## Created: Tue Oct 16 23:03:18 2018
## Notes:

We can then convert a corpus into a document-feature matrix using the dfm function. We will then trim it in order to keep only tokens that appear in 2 or more tweets. Note that we keep punctuation – it turns out it can be quite informative.

twdfm <- dfm(twcorpus, remove=stopwords("english"), remove_url=TRUE, 
             ngrams=1:2, verbose=TRUE)
## Creating a dfm from a corpus input...
##    ... lowercasing
##    ... found 4,566 documents, 48,657 features
##    ... removed 169 features
##    ... created a 4,566 x 48,488 sparse dfm
##    ... complete. 
## Elapsed time: 1.1 seconds.
twdfm <- dfm_trim(twdfm, min_docfreq = 2, verbose=TRUE)
## Removing features occurring: 
##   - in fewer than 2 documents: 38,258
##   Total features removed: 38,258 (78.9%).

Note that other preprocessing options are:

You can read more in the dfm and tokens help pages

Once we have the DFM, we split it into training and test set. We’ll go with 80% training and 20% set. Note the use of a random seed to make sure our results are replicable.

set.seed(123)
training <- sample(1:nrow(tweets), floor(.80 * nrow(tweets)))
test <- (1:nrow(tweets))[1:nrow(tweets) %in% training == FALSE]

Our first step is to train the classifier using cross-validation. There are many packages in R to run machine learning models. For regularized regression, glmnet is in my opinion the best. It’s much faster than caret or mlr (in my experience at least), and it has cross-validation already built-in, so we don’t need to code it from scratch. We’ll start with a ridge regression:

library(glmnet)
## Loading required package: Matrix
## Warning: package 'Matrix' was built under R version 3.4.4
## Loading required package: foreach
## Loaded glmnet 2.0-13
require(doMC)
## Loading required package: doMC
## Loading required package: iterators
## Loading required package: parallel
registerDoMC(cores=3)
ridge <- cv.glmnet(twdfm[training,], tweets$engaging[training], 
    family="binomial", alpha=0, nfolds=5, parallel=TRUE, intercept=TRUE,
    type.measure="class")
plot(ridge)

We can now compute the performance metrics on the test set.

## function to compute accuracy
accuracy <- function(ypred, y){
    tab <- table(ypred, y)
    return(sum(diag(tab))/sum(tab))
}
# function to compute precision
precision <- function(ypred, y){
    tab <- table(ypred, y)
    return((tab[2,2])/(tab[2,1]+tab[2,2]))
}
# function to compute recall
recall <- function(ypred, y){
    tab <- table(ypred, y)
    return(tab[2,2]/(tab[1,2]+tab[2,2]))
}
# computing predicted values
preds <- predict(ridge, twdfm[test,], type="class")
# confusion matrix
table(preds, tweets$engaging[test])
##      
## preds   0   1
##     0  17   3
##     1 176 718
# performance metrics
accuracy(preds, tweets$engaging[test])
## [1] 0.8041575
precision(preds==1, tweets$engaging[test]==1)
## [1] 0.803132
recall(preds==1, tweets$engaging[test]==1)
## [1] 0.9958391
precision(preds==0, tweets$engaging[test]==0)
## [1] 0.85
recall(preds==0, tweets$engaging[test]==0)
## [1] 0.0880829

Something that is often very useful is to look at the actual estimated coefficients and see which of these have the highest or lowest values:

# from the different values of lambda, let's pick the highest one that is
# within one standard error of the best one (why? see "one-standard-error"
# rule -- maximizes parsimony)
best.lambda <- which(ridge$lambda==ridge$lambda.1se)
beta <- ridge$glmnet.fit$beta[,best.lambda]
head(beta)
##            @        thank         back            !     tomorrow 
##  0.026491467  0.057415867  0.011189273  0.009114665 -0.017390637 
##         well 
##  0.021270635
## identifying predictive features
df <- data.frame(coef = as.numeric(beta),
                word = names(beta), stringsAsFactors=F)

df <- df[order(df$coef),]
head(df[,c("coef", "word")], n=30)
##             coef                  word
## 4138  -0.3378288               reverse
## 9599  -0.3369787              knocking
## 6198  -0.3347340              that_man
## 5584  -0.3330700             and_share
## 214   -0.3303776               and_its
## 1939  -0.3252179                 god_,
## 3644  -0.3211980                beacon
## 8636  -0.3204813               posters
## 4917  -0.3184335                eu_law
## 6489  -0.3178059                  zone
## 6490  -0.3170302               defends
## 6446  -0.3170106             tonbridge
## 8288  -0.3162415       political_class
## 6179  -0.3145641           the_weather
## 8977  -0.3129769              on_being
## 2165  -0.3128641                  #yes
## 581   -0.3118844                 cunts
## 8687  -0.3117144            would_make
## 6838  -0.3114111            initiative
## 9354  -0.3086579               debates
## 8396  -0.3073173            determined
## 6366  -0.3070599              tweets_,
## 7826  -0.3069591         entertainment
## 9169  -0.3054351              cleaning
## 6906  -0.3035772                  earn
## 4899  -0.3034597             they_know
## 8729  -0.3020196            from_today
## 10007 -0.3011251           interview_i
## 8693  -0.3010277       twitter_account
## 2343  -0.3008454 scottish_independence
paste(df$word[1:30], collapse=", ")
## [1] "reverse, knocking, that_man, and_share, and_its, god_,, beacon, posters, eu_law, zone, defends, tonbridge, political_class, the_weather, on_being, #yes, cunts, would_make, initiative, debates, determined, tweets_,, entertainment, cleaning, earn, they_know, from_today, interview_i, twitter_account, scottish_independence"
df <- df[order(df$coef, decreasing=TRUE),]
head(df[,c("coef", "word")], n=30)
##           coef              word
## 9350 0.1849089         town_hall
## 2882 0.1487279              !_rt
## 7778 0.1486876              dt_@
## 7776 0.1486873                dt
## 9228 0.1483949        west_green
## 6033 0.1483920           western
## 8702 0.1467153            seat_.
## 5436 0.1426296     insignificant
## 4407 0.1354776              re_:
## 5013 0.1307438            vote_-
## 6372 0.1289006            ;_thnx
## 9840 0.1287744            is_big
## 7578 0.1264567            ,_much
## 8462 0.1224417          to_prove
## 9587 0.1223285     of_scotland's
## 2520 0.1221726             :_new
## 8245 0.1208279       for_tonight
## 6130 0.1191033          leader_,
## 2072 0.1188151           bank_of
## 8831 0.1186134          @_yougov
## 8671 0.1155230          result_,
## 1054 0.1154848             @_bbc
## 8922 0.1144141 ._congratulations
## 7194 0.1143405         to_labour
## 3204 0.1134104               !_&
## 3203 0.1126346            team_!
## 3303 0.1126004       great_piece
## 3261 0.1118064              /_eu
## 6099 0.1114532          addition
## 9585 0.1102726       compliments
paste(df$word[1:30], collapse=", ")
## [1] "town_hall, !_rt, dt_@, dt, west_green, western, seat_., insignificant, re_:, vote_-, ;_thnx, is_big, ,_much, to_prove, of_scotland's, :_new, for_tonight, leader_,, bank_of, @_yougov, result_,, @_bbc, ._congratulations, to_labour, !_&, team_!, great_piece, /_eu, addition, compliments"

We can easily modify our code to experiment with Lasso or ElasticNet models:

lasso <- cv.glmnet(twdfm[training,], tweets$engaging[training], 
    family="binomial", alpha=1, nfolds=5, parallel=TRUE, intercept=TRUE,
    type.measure="class")
# computing predicted values
preds <- predict(lasso, twdfm[test,], type="class")
# confusion matrix
table(preds, tweets$engaging[test])
##      
## preds   0   1
##     0  42   6
##     1 151 715
# performance metrics (slightly better!)
accuracy(preds, tweets$engaging[test])
## [1] 0.8282276
precision(preds==1, tweets$engaging[test]==1)
## [1] 0.8256351
recall(preds==1, tweets$engaging[test]==1)
## [1] 0.9916782
precision(preds==0, tweets$engaging[test]==0)
## [1] 0.875
recall(preds==0, tweets$engaging[test]==0)
## [1] 0.2176166
best.lambda <- which(lasso$lambda==lasso$lambda.1se)
beta <- lasso$glmnet.fit$beta[,best.lambda]
head(beta)
##         @     thank      back         !  tomorrow      well 
## 0.6783606 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## identifying predictive features
df <- data.frame(coef = as.numeric(beta),
                word = names(beta), stringsAsFactors=F)

df <- df[order(df$coef),]
head(df[,c("coef", "word")], n=30)
##            coef            word
## 586  -1.8190575            on_@
## 9520 -1.8044096     #voteni2014
## 1770 -1.6229863           via_@
## 5583 -1.5450208     #votelabour
## 1525 -1.3048501            to_@
## 338  -1.2989997          with_@
## 1153 -1.2099682           and_@
## 7638 -1.0223167        with_his
## 193  -0.9689723         #ep2014
## 1015 -0.8800475        hustings
## 8846 -0.7804095           far_@
## 131  -0.6992836          hacked
## 9752 -0.6860012             (_@
## 461  -0.6475536 #labourdoorstep
## 685  -0.6445072           today
## 1076 -0.6263706            rt_@
## 1642 -0.6211360      #votelab14
## 819  -0.5889407           green
## 7927 -0.5575486            at_@
## 6073 -0.5570695      just_voted
## 494  -0.5563689            @_is
## 2880 -0.5491800              rd
## 1230 -0.5211952       elections
## 830  -0.5132158  #votegreen2014
## 502  -0.5061432             '_s
## 1458 -0.4984102             -_@
## 1678 -0.4602522         meeting
## 2209 -0.3943764             :_"
## 1033 -0.3754483             ;_@
## 176  -0.3648419     campaigning
df <- df[order(df$coef, decreasing=TRUE),]
head(df[,c("coef", "word")], n=30)
##            coef       word
## 584  0.68922187        @_i
## 1    0.67836062          @
## 341  0.46033939     thanks
## 15   0.41625233  thank_you
## 79   0.19778512          ?
## 626  0.17325267     @_good
## 46   0.15185734        :_-
## 343  0.09751149   @_thanks
## 1285 0.07860876      @_yes
## 565  0.07486636       good
## 474  0.01586118      @_you
## 2    0.00000000      thank
## 3    0.00000000       back
## 4    0.00000000          !
## 5    0.00000000   tomorrow
## 6    0.00000000       well
## 7    0.00000000       look
## 8    0.00000000        @_@
## 9    0.00000000      daily
## 10   0.00000000   politics
## 12   0.00000000    @_thank
## 13   0.00000000          -
## 14   0.00000000          )
## 16   0.00000000      seems
## 17   0.00000000     people
## 18   0.00000000      you_!
## 19   0.00000000       keep
## 20   0.00000000     seeing
## 21   0.00000000 everywhere
## 22   0.00000000     moment

We now see that the coefficients for some features actually became zero.

enet <- cv.glmnet(twdfm[training,], tweets$engaging[training], 
    family="binomial", alpha=0.50, nfolds=5, parallel=TRUE, intercept=TRUE,
    type.measure="class")
# NOTE: this will not cross-validate across values of alpha

# computing predicted values
preds <- predict(enet, twdfm[test,], type="class")
# confusion matrix
table(preds, tweets$engaging[test])
##      
## preds   0   1
##     0  56  10
##     1 137 711
# performance metrics
accuracy(preds, tweets$engaging[test])
## [1] 0.8391685
precision(preds==1, tweets$engaging[test]==1)
## [1] 0.8384434
recall(preds==1, tweets$engaging[test]==1)
## [1] 0.9861304
precision(preds==0, tweets$engaging[test]==0)
## [1] 0.8484848
recall(preds==0, tweets$engaging[test]==0)
## [1] 0.2901554
best.lambda <- which(enet$lambda==enet$lambda.1se)
beta <- enet$glmnet.fit$beta[,best.lambda]
head(beta)
##         @     thank      back         !  tomorrow      well 
## 0.5962621 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## identifying predictive features
df <- data.frame(coef = as.numeric(beta),
                word = names(beta), stringsAsFactors=F)

df <- df[order(df$coef),]
head(df[,c("coef", "word")], n=30)
##            coef              word
## 9520 -1.9397086       #voteni2014
## 586  -1.8435345              on_@
## 2945 -1.7475249              nw_.
## 7638 -1.7255964          with_his
## 5583 -1.7242298       #votelabour
## 2897 -1.4583091 @_#labourdoorstep
## 9752 -1.4446369               (_@
## 1770 -1.3407251             via_@
## 1525 -1.3175995              to_@
## 8846 -1.2491866             far_@
## 338  -1.2335697            with_@
## 1015 -1.1967364          hustings
## 1153 -1.1953932             and_@
## 6073 -1.1647631        just_voted
## 131  -1.1537809            hacked
## 111  -1.1458023          password
## 6709 -1.0784496           @_event
## 1642 -1.0753043        #votelab14
## 4645 -1.0619340            only_@
## 193  -0.9850949           #ep2014
## 9241 -0.8968665            meps_:
## 516  -0.8729016        cameron_on
## 1678 -0.8693618           meeting
## 2209 -0.8454162               :_"
## 6198 -0.8273467          that_man
## 1458 -0.8232270               -_@
## 8268 -0.8151539         on_friday
## 685  -0.8048469             today
## 9295 -0.7875773       starting_to
## 7927 -0.7608666              at_@
df <- df[order(df$coef, decreasing=TRUE),]
head(df[,c("coef", "word")], n=30)
##            coef       word
## 584  0.81795629        @_i
## 1    0.59626209          @
## 15   0.58092415  thank_you
## 341  0.46442734     thanks
## 46   0.37831107        :_-
## 626  0.36001317     @_good
## 1285 0.34124918      @_yes
## 343  0.28964252   @_thanks
## 79   0.25662852          ?
## 474  0.25328455      @_you
## 5824 0.20552723 #votegreen
## 1167 0.17914718      @_the
## 2153 0.15843101    @_great
## 674  0.13645280      @_not
## 565  0.13181937       good
## 992  0.10522722  good_luck
## 1477 0.05205723       need
## 1095 0.04141222      @_i'm
## 1393 0.03607920     any_of
## 1037 0.02718496       many
## 642  0.01709674        :_)
## 785  0.01002262   @_please
## 2    0.00000000      thank
## 3    0.00000000       back
## 4    0.00000000          !
## 5    0.00000000   tomorrow
## 6    0.00000000       well
## 7    0.00000000       look
## 8    0.00000000        @_@
## 9    0.00000000      daily