Supervised machine learning

Regularized regression

To learn how to do supervised machine learning applied to social media text, we will use a random sample of nearly 5,000 tweets mentioning the names of the candidates to the 2014 EP elections in the UK. We will be analyzing variable named communication, which indicates whether each tweet was hand-coded as being engaging (a tweet that tries to engage with the audience of the account) or broadcasting (just sending a message, without trying to elicit a response).

The source of the dataset is an article co-authored with Yannis Theocharis, Zoltan Fazekas, and Sebastian Popa, published in the Journal of Communication. The link is here. Our goal was to understand to what extent candidates are not engaging voters on Twitter because they’re exposed to mostly impolite messages.

Let’s start by reading the dataset and creating a dummy variable indicating whether each tweet is engaging

library(quanteda)

## Warning: package 'quanteda' was built under R version 3.4.4

## Package version: 1.3.0

## Parallel computing: 2 of 4 threads used.

## See https://quanteda.io for tutorials and examples.

## 
## Attaching package: 'quanteda'

## The following object is masked from 'package:utils':
## 
##     View

tweets <- read.csv("~/data/UK-tweets.csv", stringsAsFactors=F)
tweets$engaging <- ifelse(tweets$communication=="engaging", 1, 0)
tweets <- tweets[!is.na(tweets$engaging),]

We’ll do some cleaning as well – substituting handles with @. Why? We want to provent overfitting.

tweets$text <- gsub('@[0-9_A-Za-z]+', '@', tweets$text)

As we discussed earlier today, before we can do any type of automated text analysis, we will need to go through several “preprocessing” steps before it can be passed to a statistical model. We’ll use the quanteda package quanteda here.

twcorpus <- corpus(tweets$text)
summary(twcorpus)

## Corpus consisting of 4566 documents, showing 100 documents:
## 
##     Text Types Tokens Sentences
##    text1     8     10         2
##    text2    17     23         3
##    text3    15     18         2
##    text4    15     19         3
##    text5    14     15         1
##    text6     3      3         1
##    text7     9      9         1
##    text8    20     21         2
##    text9    20     23         5
##   text10     8      9         1
##   text11    15     20         1
##   text12    23     27         1
##   text13    16     27         1
##   text14    22     26         2
##   text15    12     19         1
##   text16    21     23         3
##   text17     9     14         1
##   text18    19     23         1
##   text19    16     18         1
##   text20    11     14         1
##   text21     6      6         1
##   text22    14     15         1
##   text23     8     10         1
##   text24     6      7         2
##   text25    29     33         1
##   text26    21     23         3
##   text27    16     18         2
##   text28    17     19         1
##   text29    15     21         1
##   text30    19     21         1
##   text31    20     22         1
##   text32    15     15         1
##   text33     7      7         1
##   text34    22     23         2
##   text35    18     20         2
##   text36    16     16         1
##   text37     9      9         1
##   text38    24     27         2
##   text39    17     18         1
##   text40    19     22         2
##   text41    22     23         2
##   text42    26     30         3
##   text43    23     27         3
##   text44     6      8         1
##   text45     6      6         1
##   text46    18     23         1
##   text47    17     18         2
##   text48    18     21         2
##   text49     5      7         1
##   text50    20     25         1
##   text51     3     11         1
##   text52    10     12         1
##   text53    14     16         1
##   text54    24     31         1
##   text55    15     18         1
##   text56    11     12         1
##   text57    19     22         1
##   text58    19     20         2
##   text59    11     12         1
##   text60    19     22         1
##   text61    14     16         1
##   text62    13     14         1
##   text63     7      8         1
##   text64    12     12         2
##   text65    17     22         1
##   text66    10     13         1
##   text67    20     21         1
##   text68    16     16         2
##   text69    13     16         2
##   text70    17     17         3
##   text71    14     16         2
##   text72    23     25         2
##   text73    26     30         3
##   text74    21     23         2
##   text75    18     21         1
##   text76    12     14         1
##   text77    13     16         1
##   text78     8      8         1
##   text79     7      8         1
##   text80    23     26         1
##   text81     8      8         1
##   text82     9      9         3
##   text83    24     26         2
##   text84    16     20         1
##   text85    18     20         1
##   text86    12     16         1
##   text87    18     19         1
##   text88    21     22         2
##   text89    17     19         1
##   text90    11     13         1
##   text91    19     21         3
##   text92    27     31         3
##   text93    15     17         1
##   text94     5      5         1
##   text95     4      4         1
##   text96    28     29         2
##   text97     5      6         1
##   text98    22     24         1
##   text99    15     21         1
##  text100    24     27         1
## 
## Source: /Users/pablobarbera/git/ECPR-SC105/code/* on x86_64 by pablobarbera
## Created: Thu Aug  9 11:08:16 2018
## Notes:

We can then convert a corpus into a document-feature matrix using the dfm function. We will then trim it in order to keep only tokens that appear in 2 or more tweets. Note that we keep punctuation – it turns out it can be quite informative.

twdfm <- dfm(twcorpus, remove=stopwords("english"), remove_url=TRUE, 
             ngrams=1:2, verbose=TRUE)

## Creating a dfm from a corpus input...

##    ... lowercasing

##    ... found 4,566 documents, 48,657 features

##    ... removed 169 features
##    ... created a 4,566 x 48,488 sparse dfm
##    ... complete. 
## Elapsed time: 1.28 seconds.

twdfm <- dfm_trim(twdfm, min_docfreq = 2, verbose=TRUE)

## Removing features occurring: 
##   - in fewer than 2 documents: 38,258
##   Total features removed: 38,258 (78.9%).

Note that other preprocessing options are:

remove_numbers
remove_punct
remove_twitter
remove_symbols
remove_separators

You can read more in the dfm and tokens help pages

Once we have the DFM, we split it into training and test set. We’ll go with 80% training and 20% set. Note the use of a random seed to make sure our results are replicable.

set.seed(123)
training <- sample(1:nrow(tweets), floor(.80 * nrow(tweets)))
test <- (1:nrow(tweets))[1:nrow(tweets) %in% training == FALSE]

Our first step is to train the classifier using cross-validation. There are many packages in R to run machine learning models. For regularized regression, glmnet is in my opinion the best. It’s much faster than caret or mlr (in my experience at least), and it has cross-validation already built-in, so we don’t need to code it from scratch. We’ll start with a ridge regression:

library(glmnet)

## Loading required package: Matrix

## Warning: package 'Matrix' was built under R version 3.4.4

## Loading required package: foreach

## Loaded glmnet 2.0-13

require(doMC)

## Loading required package: doMC

## Loading required package: iterators

## Loading required package: parallel

registerDoMC(cores=3)
ridge <- cv.glmnet(twdfm[training,], tweets$engaging[training], 
    family="binomial", alpha=0, nfolds=5, parallel=TRUE, intercept=TRUE,
    type.measure="class")
plot(ridge)

We can now compute the performance metrics on the test set.

## function to compute accuracy
accuracy <- function(ypred, y){
    tab <- table(ypred, y)
    return(sum(diag(tab))/sum(tab))
}
# function to compute precision
precision <- function(ypred, y){
    tab <- table(ypred, y)
    return((tab[2,2])/(tab[2,1]+tab[2,2]))
}
# function to compute recall
recall <- function(ypred, y){
    tab <- table(ypred, y)
    return(tab[2,2]/(tab[1,2]+tab[2,2]))
}
# computing predicted values
preds <- predict(ridge, twdfm[test,], type="class")
# confusion matrix
table(preds, tweets$engaging[test])

##      
## preds   0   1
##     0  17   3
##     1 176 718

# performance metrics
accuracy(preds, tweets$engaging[test])

## [1] 0.8041575

precision(preds==1, tweets$engaging[test]==1)

## [1] 0.803132

recall(preds==1, tweets$engaging[test]==1)

## [1] 0.9958391

precision(preds==0, tweets$engaging[test]==0)

## [1] 0.85

recall(preds==0, tweets$engaging[test]==0)

## [1] 0.0880829

Something that is often very useful is to look at the actual estimated coefficients and see which of these have the highest or lowest values:

# from the different values of lambda, let's pick the highest one that is
# within one standard error of the best one (why? see "one-standard-error"
# rule -- maximizes parsimony)
best.lambda <- which(ridge$lambda==ridge$lambda.1se)
beta <- ridge$glmnet.fit$beta[,best.lambda]
head(beta)

##            @        thank            !         look          @_@ 
##  0.026491469  0.057415856  0.009114663 -0.003353636  0.023129622 
##      @_thank 
##  0.067938135

## identifying predictive features
df <- data.frame(coef = as.numeric(beta),
                word = names(beta), stringsAsFactors=F)

df <- df[order(df$coef),]
head(df[,c("coef", "word")], n=30)

##             coef                  word
## 4412  -0.3378288               reverse
## 9671  -0.3369787              knocking
## 6265  -0.3347340              that_man
## 5764  -0.3330700             and_share
## 262   -0.3303776               and_its
## 2332  -0.3252179                 god_,
## 4026  -0.3211980                beacon
## 8676  -0.3204813               posters
## 4703  -0.3184335                eu_law
## 6448  -0.3178059                  zone
## 6449  -0.3170302               defends
## 6507  -0.3170106             tonbridge
## 8310  -0.3162415       political_class
## 6072  -0.3145641           the_weather
## 8932  -0.3129769              on_being
## 2595  -0.3128641                  #yes
## 555   -0.3118844                 cunts
## 8719  -0.3117144            would_make
## 6769  -0.3114111            initiative
## 9254  -0.3086579               debates
## 8356  -0.3073173            determined
## 6316  -0.3070599              tweets_,
## 7878  -0.3069591         entertainment
## 9263  -0.3054351              cleaning
## 7021  -0.3035772                  earn
## 4694  -0.3034597             they_know
## 8701  -0.3020196            from_today
## 10036 -0.3011251           interview_i
## 8660  -0.3010277       twitter_account
## 2061  -0.3008453 scottish_independence

paste(df$word[1:30], collapse=", ")

## [1] "reverse, knocking, that_man, and_share, and_its, god_,, beacon, posters, eu_law, zone, defends, tonbridge, political_class, the_weather, on_being, #yes, cunts, would_make, initiative, debates, determined, tweets_,, entertainment, cleaning, earn, they_know, from_today, interview_i, twitter_account, scottish_independence"

df <- df[order(df$coef, decreasing=TRUE),]
head(df[,c("coef", "word")], n=30)

##           coef              word
## 9429 0.1849087         town_hall
## 3412 0.1487278              !_rt
## 7699 0.1486876              dt_@
## 7697 0.1486873                dt
## 9319 0.1483975        west_green
## 6136 0.1483925           western
## 8671 0.1467153            seat_.
## 5143 0.1426298     insignificant
## 4626 0.1354776              re_:
## 5275 0.1307437            vote_-
## 6326 0.1289008            ;_thnx
## 9866 0.1287744            is_big
## 7468 0.1264568            ,_much
## 8435 0.1224417          to_prove
## 9516 0.1223285     of_scotland's
## 2203 0.1221726             :_new
## 8227 0.1208279       for_tonight
## 6211 0.1191033          leader_,
## 1826 0.1188151           bank_of
## 8869 0.1186131          @_yougov
## 8640 0.1155231          result_,
## 987  0.1154847             @_bbc
## 8883 0.1144142 ._congratulations
## 7085 0.1143405         to_labour
## 2786 0.1134104               !_&
## 2785 0.1126346            team_!
## 3747 0.1126003       great_piece
## 3711 0.1118065              /_eu
## 6185 0.1114531          addition
## 9514 0.1102726       compliments

paste(df$word[1:30], collapse=", ")

## [1] "town_hall, !_rt, dt_@, dt, west_green, western, seat_., insignificant, re_:, vote_-, ;_thnx, is_big, ,_much, to_prove, of_scotland's, :_new, for_tonight, leader_,, bank_of, @_yougov, result_,, @_bbc, ._congratulations, to_labour, !_&, team_!, great_piece, /_eu, addition, compliments"

We can easily modify our code to experiment with Lasso or ElasticNet models:

lasso <- cv.glmnet(twdfm[training,], tweets$engaging[training], 
    family="binomial", alpha=1, nfolds=5, parallel=TRUE, intercept=TRUE,
    type.measure="class")

# computing predicted values
preds <- predict(lasso, twdfm[test,], type="class")
# confusion matrix
table(preds, tweets$engaging[test])

##      
## preds   0   1
##     0  42   6
##     1 151 715

# performance metrics (slightly better!)
accuracy(preds, tweets$engaging[test])

## [1] 0.8282276

precision(preds==1, tweets$engaging[test]==1)

## [1] 0.8256351

recall(preds==1, tweets$engaging[test]==1)

## [1] 0.9916782

precision(preds==0, tweets$engaging[test]==0)

## [1] 0.875

recall(preds==0, tweets$engaging[test]==0)

## [1] 0.2176166

best.lambda <- which(lasso$lambda==lasso$lambda.1se)
beta <- lasso$glmnet.fit$beta[,best.lambda]
head(beta)

##         @     thank         !      look       @_@   @_thank 
## 0.6783611 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000

## identifying predictive features
df <- data.frame(coef = as.numeric(beta),
                word = names(beta), stringsAsFactors=F)

df <- df[order(df$coef),]
head(df[,c("coef", "word")], n=30)

##            coef            word
## 601  -1.8190566            on_@
## 9434 -1.8044280     #voteni2014
## 2074 -1.6229927           via_@
## 5326 -1.5450336     #votelabour
## 1375 -1.3048701            to_@
## 282  -1.2989784          with_@
## 1266 -1.2099612           and_@
## 7537 -1.0223134        with_his
## 228  -0.9689621         #ep2014
## 1089 -0.8800491        hustings
## 8786 -0.7803956           far_@
## 109  -0.6992762          hacked
## 9790 -0.6859966             (_@
## 498  -0.6475886 #labourdoorstep
## 669  -0.6445125           today
## 1004 -0.6263757            rt_@
## 1922 -0.6211318      #votelab14
## 844  -0.5889416           green
## 7881 -0.5575388            at_@
## 5928 -0.5570703      just_voted
## 456  -0.5563673            @_is
## 2504 -0.5491547              rd
## 1386 -0.5211930       elections
## 851  -0.5132192  #votegreen2014
## 471  -0.5061936             '_s
## 1700 -0.4984246             -_@
## 1497 -0.4602545         meeting
## 1950 -0.3944299             :_"
## 969  -0.3754333             ;_@
## 200  -0.3648572     campaigning

df <- df[order(df$coef, decreasing=TRUE),]
head(df[,c("coef", "word")], n=30)

##            coef      word
## 558  0.68922703       @_i
## 1    0.67836113         @
## 398  0.46034281    thanks
## 8    0.41625360 thank_you
## 72   0.19778527         ?
## 614  0.17325089    @_good
## 54   0.15182300       :_-
## 399  0.09752179  @_thanks
## 1186 0.07863422     @_yes
## 587  0.07487055      good
## 515  0.01586562     @_you
## 2    0.00000000     thank
## 3    0.00000000         !
## 4    0.00000000      look
## 5    0.00000000       @_@
## 6    0.00000000   @_thank
## 7    0.00000000      back
## 9    0.00000000     you_!
## 10   0.00000000  tomorrow
## 11   0.00000000      well
## 12   0.00000000     daily
## 13   0.00000000  politics
## 15   0.00000000         -
## 16   0.00000000         )
## 17   0.00000000     seems
## 18   0.00000000    !_will
## 19   0.00000000    people
## 20   0.00000000      keep
## 21   0.00000000 will_have
## 22   0.00000000    have_a

We now see that the coefficients for some features actually became zero.

enet <- cv.glmnet(twdfm[training,], tweets$engaging[training], 
    family="binomial", alpha=0.50, nfolds=5, parallel=TRUE, intercept=TRUE,
    type.measure="class")
# NOTE: this will not cross-validate across values of alpha

# computing predicted values
preds <- predict(enet, twdfm[test,], type="class")
# confusion matrix
table(preds, tweets$engaging[test])

##      
## preds   0   1
##     0  56  10
##     1 137 711

# performance metrics
accuracy(preds, tweets$engaging[test])

## [1] 0.8391685

precision(preds==1, tweets$engaging[test]==1)

## [1] 0.8384434

recall(preds==1, tweets$engaging[test]==1)

## [1] 0.9861304

precision(preds==0, tweets$engaging[test]==0)

## [1] 0.8484848

recall(preds==0, tweets$engaging[test]==0)

## [1] 0.2901554

best.lambda <- which(enet$lambda==enet$lambda.1se)
beta <- enet$glmnet.fit$beta[,best.lambda]
head(beta)

##         @     thank         !      look       @_@   @_thank 
## 0.5962622 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000

## identifying predictive features
df <- data.frame(coef = as.numeric(beta),
                word = names(beta), stringsAsFactors=F)

df <- df[order(df$coef),]
head(df[,c("coef", "word")], n=30)

##            coef              word
## 9434 -1.9397130       #voteni2014
## 601  -1.8435300              on_@
## 3459 -1.7475230              nw_.
## 7537 -1.7255967          with_his
## 5326 -1.7242325       #votelabour
## 2514 -1.4582612 @_#labourdoorstep
## 9790 -1.4446327               (_@
## 2074 -1.3407118             via_@
## 1375 -1.3175990              to_@
## 8786 -1.2491845             far_@
## 282  -1.2335795            with_@
## 1089 -1.1967385          hustings
## 1266 -1.1953930             and_@
## 5928 -1.1647694        just_voted
## 109  -1.1537820            hacked
## 95   -1.1458042          password
## 6667 -1.0784478           @_event
## 1922 -1.0753018        #votelab14
## 4878 -1.0619328            only_@
## 228  -0.9851047           #ep2014
## 9153 -0.8968774            meps_:
## 495  -0.8729090        cameron_on
## 1497 -0.8693616           meeting
## 1950 -0.8454411               :_"
## 6265 -0.8273472          that_man
## 1700 -0.8232255               -_@
## 8298 -0.8151523         on_friday
## 669  -0.8048479             today
## 9205 -0.7875784       starting_to
## 7881 -0.7608702              at_@

df <- df[order(df$coef, decreasing=TRUE),]
head(df[,c("coef", "word")], n=30)

##            coef       word
## 558  0.81795613        @_i
## 1    0.59626220          @
## 8    0.58092430  thank_you
## 398  0.46442642     thanks
## 54   0.37830055        :_-
## 614  0.36000883     @_good
## 1186 0.34125006      @_yes
## 399  0.28964424   @_thanks
## 72   0.25662864          ?
## 515  0.25328467      @_you
## 5630 0.20551387 #votegreen
## 1074 0.17914739      @_the
## 1899 0.15842653    @_great
## 688  0.13645408      @_not
## 587  0.13181617       good
## 1065 0.10523721  good_luck
## 1343 0.05205937       need
## 1018 0.04141076      @_i'm
## 1270 0.03609250     any_of
## 975  0.02718622       many
## 632  0.01708541        :_)
## 781  0.01002222   @_please
## 2    0.00000000      thank
## 3    0.00000000          !
## 4    0.00000000       look
## 5    0.00000000        @_@
## 6    0.00000000    @_thank
## 7    0.00000000       back
## 9    0.00000000      you_!
## 10   0.00000000   tomorrow

Supervised machine learning

Pablo Barbera

August 9, 2018

Regularized regression