Supervised machine learning

Regularized regression

To learn how to do supervised machine learning applied to social media text, we will use a random sample of nearly 5,000 tweets mentioning the names of the candidates to the 2014 EP elections in the UK. We will be analyzing variable named communication, which indicates whether each tweet was hand-coded as being engaging (a tweet that tries to engage with the audience of the account) or broadcasting (just sending a message, without trying to elicit a response).

The source of the dataset is an article co-authored with Yannis Theocharis, Zoltan Fazekas, and Sebastian Popa, published in the Journal of Communication. The link is here. Our goal was to understand to what extent candidates are not engaging voters on Twitter because they’re exposed to mostly impolite messages.

Let’s start by reading the dataset and creating a dummy variable indicating whether each tweet is engaging

library(quanteda)

## Warning: package 'quanteda' was built under R version 3.4.4

## Package version: 1.3.0

## Parallel computing: 2 of 4 threads used.

## See https://quanteda.io for tutorials and examples.

## 
## Attaching package: 'quanteda'

## The following object is masked from 'package:utils':
## 
##     View

tweets <- read.csv("../data/UK-tweets.csv", stringsAsFactors=F)
tweets$engaging <- ifelse(tweets$communication=="engaging", 1, 0)
tweets <- tweets[!is.na(tweets$engaging),]

We’ll do some cleaning as well – substituting handles with @. Why? We want to provent overfitting.

tweets$text <- gsub('@[0-9_A-Za-z]+', '@', tweets$text)

As we discussed earlier today, before we can do any type of automated text analysis, we will need to go through several “preprocessing” steps before it can be passed to a statistical model. We’ll use the quanteda package quanteda here.

twcorpus <- corpus(tweets$text)
summary(twcorpus)

## Corpus consisting of 4566 documents, showing 100 documents:
## 
##     Text Types Tokens Sentences
##    text1     8     10         2
##    text2    17     23         3
##    text3    15     18         2
##    text4    15     19         3
##    text5    14     15         1
##    text6     3      3         1
##    text7     9      9         1
##    text8    20     21         2
##    text9    20     23         5
##   text10     8      9         1
##   text11    15     20         1
##   text12    23     27         1
##   text13    16     27         1
##   text14    22     26         2
##   text15    12     19         1
##   text16    21     23         3
##   text17     9     14         1
##   text18    19     23         1
##   text19    16     18         1
##   text20    11     14         1
##   text21     6      6         1
##   text22    14     15         1
##   text23     8     10         1
##   text24     6      7         2
##   text25    29     33         1
##   text26    21     23         3
##   text27    16     18         2
##   text28    17     19         1
##   text29    15     21         1
##   text30    19     21         1
##   text31    20     22         1
##   text32    15     15         1
##   text33     7      7         1
##   text34    22     23         2
##   text35    18     20         2
##   text36    16     16         1
##   text37     9      9         1
##   text38    24     27         2
##   text39    17     18         1
##   text40    19     22         2
##   text41    22     23         2
##   text42    26     30         3
##   text43    23     27         3
##   text44     6      8         1
##   text45     6      6         1
##   text46    18     23         1
##   text47    17     18         2
##   text48    18     21         2
##   text49     5      7         1
##   text50    20     25         1
##   text51     3     11         1
##   text52    10     12         1
##   text53    14     16         1
##   text54    24     31         1
##   text55    15     18         1
##   text56    11     12         1
##   text57    19     22         1
##   text58    19     20         2
##   text59    11     12         1
##   text60    19     22         1
##   text61    14     16         1
##   text62    13     14         1
##   text63     7      8         1
##   text64    12     12         2
##   text65    17     22         1
##   text66    10     13         1
##   text67    20     21         1
##   text68    16     16         2
##   text69    13     16         2
##   text70    17     17         3
##   text71    14     16         2
##   text72    23     25         2
##   text73    26     30         3
##   text74    21     23         2
##   text75    18     21         1
##   text76    12     14         1
##   text77    13     16         1
##   text78     8      8         1
##   text79     7      8         1
##   text80    23     26         1
##   text81     8      8         1
##   text82     9      9         3
##   text83    24     26         2
##   text84    16     20         1
##   text85    18     20         1
##   text86    12     16         1
##   text87    18     19         1
##   text88    21     22         2
##   text89    17     19         1
##   text90    11     13         1
##   text91    19     21         3
##   text92    27     31         3
##   text93    15     17         1
##   text94     5      5         1
##   text95     4      4         1
##   text96    28     29         2
##   text97     5      6         1
##   text98    22     24         1
##   text99    15     21         1
##  text100    24     27         1
## 
## Source: /Users/pablobarbera/git/social-media-upf/code/* on x86_64 by pablobarbera
## Created: Tue Jul  3 10:45:36 2018
## Notes:

We can then convert a corpus into a document-feature matrix using the dfm function. We will then trim it in order to keep only tokens that appear in 2 or more tweets. Note that we keep punctuation – it turns out it can be quite informative.

twdfm <- dfm(twcorpus, remove=stopwords("english"), remove_url=TRUE, 
             ngrams=1:2, verbose=TRUE)

## Creating a dfm from a corpus input...

##    ... lowercasing

##    ... found 4,566 documents, 48,657 features

##    ... removed 169 features
##    ... created a 4,566 x 48,488 sparse dfm
##    ... complete. 
## Elapsed time: 0.992 seconds.

twdfm <- dfm_trim(twdfm, min_docfreq = 2, verbose=TRUE)

## Removing features occurring: 
##   - in fewer than 2 documents: 38,258
##   Total features removed: 38,258 (78.9%).

Note that other preprocessing options are:

remove_numbers
remove_punct
remove_twitter
remove_symbols
remove_separators

You can read more in the dfm and tokens help pages

Once we have the DFM, we split it into training and test set. We’ll go with 80% training and 20% set. Note the use of a random seed to make sure our results are replicable.

set.seed(123)
training <- sample(1:nrow(tweets), floor(.80 * nrow(tweets)))
test <- (1:nrow(tweets))[1:nrow(tweets) %in% training == FALSE]

Our first step is to train the classifier using cross-validation. There are many packages in R to run machine learning models. For regularized regression, glmnet is in my opinion the best. It’s much faster than caret or mlr (in my experience at least), and it has cross-validation already built-in, so we don’t need to code it from scratch. We’ll start with a ridge regression:

library(glmnet)

## Loading required package: Matrix

## Loading required package: foreach

## Loaded glmnet 2.0-13

require(doMC)

## Loading required package: doMC

## Loading required package: iterators

## Loading required package: parallel

registerDoMC(cores=3)
ridge <- cv.glmnet(twdfm[training,], tweets$engaging[training], 
    family="binomial", alpha=0, nfolds=5, parallel=TRUE, intercept=TRUE,
    type.measure="class")
plot(ridge)

We can now compute the performance metrics on the test set.

## function to compute accuracy
accuracy <- function(ypred, y){
    tab <- table(ypred, y)
    return(sum(diag(tab))/sum(tab))
}
# function to compute precision
precision <- function(ypred, y){
    tab <- table(ypred, y)
    return((tab[2,2])/(tab[2,1]+tab[2,2]))
}
# function to compute recall
recall <- function(ypred, y){
    tab <- table(ypred, y)
    return(tab[2,2]/(tab[1,2]+tab[2,2]))
}
# computing predicted values
preds <- predict(ridge, twdfm[test,], type="class")
# confusion matrix
table(preds, tweets$engaging[test])

##      
## preds   0   1
##     0  17   3
##     1 176 718

# performance metrics
accuracy(preds, tweets$engaging[test])

## [1] 0.8041575

precision(preds==1, tweets$engaging[test]==1)

## [1] 0.803132

recall(preds==1, tweets$engaging[test]==1)

## [1] 0.9958391

precision(preds==0, tweets$engaging[test]==0)

## [1] 0.85

recall(preds==0, tweets$engaging[test]==0)

## [1] 0.0880829

Something that is often very useful is to look at the actual estimated coefficients and see which of these have the highest or lowest values:

# from the different values of lambda, let's pick the highest one that is
# within one standard error of the best one (why? see "one-standard-error"
# rule -- maximizes parsimony)
best.lambda <- which(ridge$lambda==ridge$lambda.1se)
beta <- ridge$glmnet.fit$beta[,best.lambda]
head(beta)

##            @        thank            !         look          @_@ 
##  0.026491467  0.057415857  0.009114665 -0.003353644  0.023129622 
##      @_thank 
##  0.067938129

## identifying predictive features
df <- data.frame(coef = as.numeric(beta),
                word = names(beta), stringsAsFactors=F)

df <- df[order(df$coef),]
head(df[,c("coef", "word")], n=30)

##            coef                  word
## 4159 -0.3378288               reverse
## 9566 -0.3369787              knocking
## 6203 -0.3347340              that_man
## 5612 -0.3330700             and_share
## 223  -0.3303776               and_its
## 1908 -0.3252179                 god_,
## 3658 -0.3211980                beacon
## 8580 -0.3204813               posters
## 4876 -0.3184335                eu_law
## 6418 -0.3178059                  zone
## 6419 -0.3170302               defends
## 6531 -0.3170106             tonbridge
## 8231 -0.3162415       political_class
## 6167 -0.3145641           the_weather
## 9012 -0.3129769              on_being
## 2139 -0.3128640                  #yes
## 501  -0.3118844                 cunts
## 8619 -0.3117144            would_make
## 6787 -0.3114111            initiative
## 9386 -0.3086579               debates
## 8453 -0.3073173            determined
## 6366 -0.3070599              tweets_,
## 7890 -0.3069591         entertainment
## 9129 -0.3054351              cleaning
## 6958 -0.3035772                  earn
## 4859 -0.3034597             they_know
## 8825 -0.3020196            from_today
## 9992 -0.3011251           interview_i
## 8785 -0.3010277       twitter_account
## 2366 -0.3008453 scottish_independence

paste(df$word[1:30], collapse=", ")

## [1] "reverse, knocking, that_man, and_share, and_its, god_,, beacon, posters, eu_law, zone, defends, tonbridge, political_class, the_weather, on_being, #yes, cunts, would_make, initiative, debates, determined, tweets_,, entertainment, cleaning, earn, they_know, from_today, interview_i, twitter_account, scottish_independence"

df <- df[order(df$coef, decreasing=TRUE),]
head(df[,c("coef", "word")], n=30)

##           coef              word
## 9310 0.1849088         town_hall
## 2887 0.1487279              !_rt
## 7713 0.1486876              dt_@
## 7711 0.1486873                dt
## 9193 0.1483949        west_green
## 6059 0.1483919           western
## 8798 0.1467153            seat_.
## 5397 0.1426298     insignificant
## 4437 0.1354776              re_:
## 5029 0.1307438            vote_-
## 6374 0.1289006            ;_thnx
## 9819 0.1287744            is_big
## 7515 0.1264567            ,_much
## 8511 0.1224417          to_prove
## 9610 0.1223285     of_scotland's
## 2528 0.1221726             :_new
## 8292 0.1208279       for_tonight
## 6145 0.1191033          leader_,
## 2094 0.1188151           bank_of
## 8768 0.1186134          @_yougov
## 8765 0.1155230          result_,
## 1121 0.1154849             @_bbc
## 8967 0.1144141 ._congratulations
## 7135 0.1143405         to_labour
## 3184 0.1134104               !_&
## 3183 0.1126346            team_!
## 3317 0.1126004       great_piece
## 3275 0.1118064              /_eu
## 6118 0.1114532          addition
## 9608 0.1102726       compliments

paste(df$word[1:30], collapse=", ")

## [1] "town_hall, !_rt, dt_@, dt, west_green, western, seat_., insignificant, re_:, vote_-, ;_thnx, is_big, ,_much, to_prove, of_scotland's, :_new, for_tonight, leader_,, bank_of, @_yougov, result_,, @_bbc, ._congratulations, to_labour, !_&, team_!, great_piece, /_eu, addition, compliments"

We can easily modify our code to experiment with Lasso or ElasticNet models:

lasso <- cv.glmnet(twdfm[training,], tweets$engaging[training], 
    family="binomial", alpha=1, nfolds=5, parallel=TRUE, intercept=TRUE,
    type.measure="class")

# computing predicted values
preds <- predict(lasso, twdfm[test,], type="class")
# confusion matrix
table(preds, tweets$engaging[test])

##      
## preds   0   1
##     0  42   6
##     1 151 715

# performance metrics (slightly better!)
accuracy(preds, tweets$engaging[test])

## [1] 0.8282276

precision(preds==1, tweets$engaging[test]==1)

## [1] 0.8256351

recall(preds==1, tweets$engaging[test]==1)

## [1] 0.9916782

precision(preds==0, tweets$engaging[test]==0)

## [1] 0.875

recall(preds==0, tweets$engaging[test]==0)

## [1] 0.2176166

best.lambda <- which(lasso$lambda==lasso$lambda.1se)
beta <- lasso$glmnet.fit$beta[,best.lambda]
head(beta)

##         @     thank         !      look       @_@   @_thank 
## 0.6783597 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000

## identifying predictive features
df <- data.frame(coef = as.numeric(beta),
                word = names(beta), stringsAsFactors=F)

df <- df[order(df$coef),]
head(df[,c("coef", "word")], n=30)

##            coef            word
## 676  -1.8190662            on_@
## 9560 -1.8044093     #voteni2014
## 1733 -1.6229837           via_@
## 5549 -1.5450174     #votelabour
## 1572 -1.3048526            to_@
## 297  -1.2989550          with_@
## 1092 -1.2099696           and_@
## 7565 -1.0223143        with_his
## 195  -0.9689625         #ep2014
## 947  -0.8800502        hustings
## 8908 -0.7804123           far_@
## 121  -0.6992853          hacked
## 9724 -0.6860014             (_@
## 570  -0.6475644 #labourdoorstep
## 606  -0.6445173           today
## 1147 -0.6263880            rt_@
## 1588 -0.6211406      #votelab14
## 747  -0.5889406           green
## 7868 -0.5575370            at_@
## 6049 -0.5570636      just_voted
## 402  -0.5563883            @_is
## 2876 -0.5491715              rd
## 1182 -0.5211947       elections
## 754  -0.5132174  #votegreen2014
## 414  -0.5060993             '_s
## 1407 -0.4984036             -_@
## 1724 -0.4602532         meeting
## 2240 -0.3943573             :_"
## 1103 -0.3754607             ;_@
## 182  -0.3648502     campaigning

df <- df[order(df$coef, decreasing=TRUE),]
head(df[,c("coef", "word")], n=30)

##            coef       word
## 503  0.68922257        @_i
## 1    0.67835966          @
## 445  0.46034214     thanks
## 7    0.41625078  thank_you
## 75   0.19778461          ?
## 553  0.17321136     @_good
## 56   0.15185118        :_-
## 447  0.09750679   @_thanks
## 1350 0.07860851      @_yes
## 549  0.07487025       good
## 584  0.01588899      @_you
## 2    0.00000000      thank
## 3    0.00000000          !
## 4    0.00000000       look
## 5    0.00000000        @_@
## 6    0.00000000    @_thank
## 8    0.00000000      you_!
## 9    0.00000000       back
## 10   0.00000000     !_will
## 11   0.00000000  will_have
## 12   0.00000000     have_a
## 13   0.00000000     a_look
## 14   0.00000000   tomorrow
## 15   0.00000000       well
## 16   0.00000000       nice
## 17   0.00000000      daily
## 18   0.00000000   politics
## 19   0.00000000      admit
## 21   0.00000000          -
## 22   0.00000000 supporters

We now see that the coefficients for some features actually became zero.

enet <- cv.glmnet(twdfm[training,], tweets$engaging[training], 
    family="binomial", alpha=0.50, nfolds=5, parallel=TRUE, intercept=TRUE,
    type.measure="class")
# NOTE: this will not cross-validate across values of alpha

# computing predicted values
preds <- predict(enet, twdfm[test,], type="class")
# confusion matrix
table(preds, tweets$engaging[test])

##      
## preds   0   1
##     0  56  10
##     1 137 711

# performance metrics
accuracy(preds, tweets$engaging[test])

## [1] 0.8391685

precision(preds==1, tweets$engaging[test]==1)

## [1] 0.8384434

recall(preds==1, tweets$engaging[test]==1)

## [1] 0.9861304

precision(preds==0, tweets$engaging[test]==0)

## [1] 0.8484848

recall(preds==0, tweets$engaging[test]==0)

## [1] 0.2901554

best.lambda <- which(enet$lambda==enet$lambda.1se)
beta <- enet$glmnet.fit$beta[,best.lambda]
head(beta)

##         @     thank         !      look       @_@   @_thank 
## 0.5962636 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000

## identifying predictive features
df <- data.frame(coef = as.numeric(beta),
                word = names(beta), stringsAsFactors=F)

df <- df[order(df$coef),]
head(df[,c("coef", "word")], n=30)

##            coef              word
## 9560 -1.9397670       #voteni2014
## 676  -1.8435486              on_@
## 2954 -1.7475130              nw_.
## 7565 -1.7256060          with_his
## 5549 -1.7242362       #votelabour
## 2888 -1.4585419 @_#labourdoorstep
## 9724 -1.4446374               (_@
## 1733 -1.3415249             via_@
## 1572 -1.3175980              to_@
## 8908 -1.2490577             far_@
## 297  -1.2335535            with_@
## 947  -1.1967375          hustings
## 1092 -1.1953927             and_@
## 6049 -1.1647580        just_voted
## 121  -1.1537672            hacked
## 99   -1.1458006          password
## 6674 -1.0784753           @_event
## 1588 -1.0753073        #votelab14
## 4673 -1.0619661            only_@
## 195  -0.9850781           #ep2014
## 9283 -0.8968599            meps_:
## 430  -0.8728814        cameron_on
## 1724 -0.8694604           meeting
## 2240 -0.8453640               :_"
## 6203 -0.8273457          that_man
## 1407 -0.8232311               -_@
## 8216 -0.8151511         on_friday
## 606  -0.8048400             today
## 9333 -0.7875681       starting_to
## 7868 -0.7608485              at_@

df <- df[order(df$coef, decreasing=TRUE),]
head(df[,c("coef", "word")], n=30)

##            coef       word
## 503  0.81796307        @_i
## 1    0.59626363          @
## 7    0.58092619  thank_you
## 445  0.46444452     thanks
## 56   0.37833801        :_-
## 553  0.36001448     @_good
## 1350 0.34125432      @_yes
## 447  0.28963311   @_thanks
## 75   0.25662721          ?
## 584  0.25329533      @_you
## 5805 0.20606005 #votegreen
## 1225 0.17915089      @_the
## 2184 0.15845517    @_great
## 767  0.13645318      @_not
## 549  0.13182246       good
## 925  0.10522658  good_luck
## 1531 0.05203757       need
## 1162 0.04144368      @_i'm
## 1438 0.03536373     any_of
## 1107 0.02717977       many
## 569  0.01709664        :_)
## 859  0.01002803   @_please
## 2    0.00000000      thank
## 3    0.00000000          !
## 4    0.00000000       look
## 5    0.00000000        @_@
## 6    0.00000000    @_thank
## 8    0.00000000      you_!
## 9    0.00000000       back
## 10   0.00000000     !_will

Supervised machine learning

Pablo Barbera

July 3, 2018

Regularized regression