Supervised machine learning

Regularized regression

To learn how to build a supervised machine learning model, we will use a classic computer science dataset of movie reviews, (Pang and Lee 2004). The movies corpus has an attribute Sentiment that labels each text as either pos or neg according to the original imdb.com archived newspaper review star rating.

Let’s start by reading the dataset and creating a dummy variable indicating whether each review is positive or negative.

library(quanteda)

## Package version: 3.2.3
## Unicode version: 14.0
## ICU version: 70.1

## Parallel computing: 8 of 8 threads used.

## See https://quanteda.io for tutorials and examples.

movies <- read.csv("../data/movie-reviews.csv", stringsAsFactors = FALSE)
movies$pos <- ifelse(movies$sentiment=="pos", 1, 0)

As we have discussed multiple times in the course, before we can do any type of automated text analysis, we will need to go through several “preprocessing” steps before it can be passed to a statistical model.

corp <- corpus(movies)
summary(corp, n=10)

## Corpus consisting of 2000 documents, showing 10 documents:
## 
##    Text Types Tokens Sentences sentiment pos
##   text1   354    841         9       neg   0
##   text2   156    278         1       neg   0
##   text3   276    553         3       neg   0
##   text4   313    555         2       neg   0
##   text5   380    841         2       neg   0
##   text6   328    747         1       neg   0
##   text7   331    641         5       neg   0
##   text8   325    673         6       neg   0
##   text9   441    794        10       neg   0
##  text10   401    965        23       neg   0

toks <- tokens(corp)
head(toks)

## Tokens consisting of 6 documents and 2 docvars.
## text1 :
##  [1] "plot"    ":"       "two"     "teen"    "couples" "go"      "to"     
##  [8] "a"       "church"  "party"   ","       "drink"  
## [ ... and 829 more ]
## 
## text2 :
##  [1] "the"       "happy"     "bastard's" "quick"     "movie"     "review"   
##  [7] "damn"      "that"      "y2k"       "bug"       "."         "it's"     
## [ ... and 266 more ]
## 
## text3 :
##  [1] "it"       "is"       "movies"   "like"     "these"    "that"    
##  [7] "make"     "a"        "jaded"    "movie"    "viewer"   "thankful"
## [ ... and 541 more ]
## 
## text4 :
##  [1] "\""             "quest"          "for"            "camelot"       
##  [5] "\""             "is"             "warner"         "bros"          
##  [9] "."              "'"              "first"          "feature-length"
## [ ... and 543 more ]
## 
## text5 :
##  [1] "synopsis"      ":"             "a"             "mentally"     
##  [5] "unstable"      "man"           "undergoing"    "psychotherapy"
##  [9] "saves"         "a"             "boy"           "from"         
## [ ... and 829 more ]
## 
## text6 :
##  [1] "capsule" ":"       "in"      "2176"    "on"      "the"     "planet" 
##  [8] "mars"    "police"  "taking"  "into"    "custody"
## [ ... and 735 more ]

We can then convert a corpus into a document-feature matrix using the dfm function. We will then trim it in order to keep only tokens that appear in 2 or more reviews

mdfm <- dfm(toks, verbose=TRUE)

## Creating a dfm from a tokens input...

##  ...lowercasing

##  ...found 2,000 documents, 48,362 features

##  ...complete, elapsed time: 0.348 seconds.

## Finished constructing a 2,000 x 48,362 sparse dfm.

mdfm <- dfm_remove(mdfm, stopwords("english"))
mdfm <- dfm_trim(mdfm, min_docfreq = 2, verbose=TRUE)

## Removing features occurring:

##   - in fewer than 2 documents: 22,178

##   Total features removed: 22,178 (46.0%).

Once we have the DFM, we split it into training and test set. We’ll go with 80% training and 20% set. Note the use of a random seed to make sure our results are replicable.

set.seed(123)
# 80% random sample of rows
training <-  sample(1:nrow(movies), floor(.80 * nrow(movies)))
# the remaining 20% will be test
test <- (1:nrow(movies))[1:nrow(movies) %in% training == FALSE]

Our first step is to train the classifier using cross-validation. There are many packages in R to run machine learning models. For regularized regression, glmnet is in my opinion the best. It’s much faster than caret or mlr (in my experience at least), and it has cross-validation already built-in, so we don’t need to code it from scratch. We’ll start with a lasso regression:

library(glmnet)

## Loading required package: Matrix

## Loaded glmnet 4.1-4

lasso <- cv.glmnet(x=mdfm[training,], y=movies$pos[training], 
    family="binomial", alpha=1, nfolds=5, parallel=TRUE,
    intercept=TRUE, type.measure="class")

## Warning: executing %dopar% sequentially: no parallel backend registered

plot(lasso)

We can now compute the performance metrics on the test set.

## function to compute performance metrics
performance <- function(ypred, y){
    tab <- table(ypred, y)
    accuracy <- (tab[1,1]+tab[2,2])/sum(tab)
    precision <- (tab[2,2])/(tab[2,1]+tab[2,2])
    recall <- tab[2,2]/(tab[1,2]+tab[2,2])
    message("Accuracy = ", round(accuracy, 2), "\n",
            "Precision = ", round(precision, 2), "\n",
            "Recall = ", round(recall, 2))
}
# computing predicted values
preds <- predict(lasso, mdfm[test,], type="class")
# confusion matrix
table(preds, movies$pos[test])

##      
## preds   0   1
##     0 164  37
##     1  31 168

# performance metrics for positive category
performance(preds, movies$pos[test])

## Accuracy = 0.83
## Precision = 0.84
## Recall = 0.82

# performance metrics for negative category
performance(preds==0, movies$pos[test]==0)

## Accuracy = 0.83
## Precision = 0.82
## Recall = 0.84

Something that is often very useful is to look at the actual estimated coefficients and see which of these have the highest or lowest values:

# from the different values of lambda, let's pick the highest one that is
# within one standard error of the best one (why? see "one-standard-error"
# rule -- maximizes parsimony)
best.lambda <- which(lasso$lambda==lasso$lambda.1se)
beta <- lasso$glmnet.fit$beta[,best.lambda]
head(beta)

##        plot           :         two        teen     couples          go 
## -0.02231017  0.00000000  0.00000000  0.00000000  0.00000000  0.00000000

## identifying predictive features
df <- data.frame(coef = as.numeric(beta),
                word = names(beta), stringsAsFactors=F)

df <- df[order(df$coef),]
head(df[,c("coef", "word")], n=30)

##              coef          word
## 1130  -0.62766216    ridiculous
## 503   -0.51222707         awful
## 328   -0.47048894        wasted
## 884   -0.44379135        boring
## 469   -0.42532017        poorly
## 8698  -0.38673896        forgot
## 50    -0.36910698          mess
## 2330  -0.36640677         worst
## 1136  -0.33863465         waste
## 10052 -0.30613024   degenerates
## 36    -0.29335604           bad
## 551   -0.29282454          dull
## 1290  -0.28158498       unfunny
## 2766  -0.25500535          bore
## 2383  -0.25385417          lame
## 1517  -0.24943734 unimaginative
## 1098  -0.23560863      supposed
## 4073  -0.23310665     ludicrous
## 5091  -0.20984502        sloppy
## 1305  -0.20085050        stupid
## 440   -0.19122127 unfortunately
## 5002  -0.18212467         lousy
## 629   -0.17399124          none
## 5079  -0.11888499        random
## 3979  -0.11874039  embarrassing
## 415   -0.11849907       nothing
## 2280  -0.10441423          poor
## 1941  -0.10020055      material
## 2106  -0.09840352        script
## 9346  -0.08378153         fares

df <- df[order(df$coef, decreasing=TRUE),]
head(df[,c("coef", "word")], n=30)

##             coef         word
## 7561  0.58025426       finest
## 18414 0.28356310  outstanding
## 8297  0.27920661       allows
## 4429  0.27452309    perfectly
## 7682  0.24709632    fantastic
## 6047  0.20161028    hilarious
## 487   0.19746314    memorable
## 10551 0.19298947         slip
## 4248  0.18864490    excellent
## 3391  0.17860220        flaws
## 12436 0.17458902     chilling
## 215   0.16907678      overall
## 7353  0.16519511   refreshing
## 2561  0.16205054    effective
## 1526  0.15424044 performances
## 1044  0.14077573        great
## 10176 0.12779893  wonderfully
## 10534 0.11935178 breathtaking
## 1727  0.11620400          fun
## 21    0.10700684         life
## 4317  0.10551512          war
## 2440  0.09587107    wonderful
## 179   0.08198933    different
## 5065  0.08147989     terrific
## 17510 0.07499402        naval
## 1482  0.07477818         true
## 12883 0.07473119     religion
## 240   0.06861690         also
## 786   0.06595291        quite
## 398   0.06214336         best

We can easily modify our code to experiment with Ridge or ElasticNet models:

ridge <- cv.glmnet(x=mdfm[training,], y=movies$pos[training], 
    family="binomial", alpha=0, nfolds=5, parallel=TRUE, intercept=TRUE,
    type.measure="class")

# computing predicted values
preds <- predict(ridge, mdfm[test,], type="class")
# confusion matrix
table(preds, movies$pos[test])

##      
## preds   0   1
##     0 159  61
##     1  36 144

# performance metrics (slightly worse)
performance(preds, movies$pos[test])

## Accuracy = 0.76
## Precision = 0.8
## Recall = 0.7

performance(preds==0, movies$pos[test]==0)

## Accuracy = 0.76
## Precision = 0.72
## Recall = 0.82

best.lambda <- which(ridge$lambda==ridge$lambda.1se)
beta <- ridge$glmnet.fit$beta[,best.lambda]
head(beta)

##          plot             :           two          teen       couples 
## -2.963434e-03 -1.323379e-05  2.523335e-04 -2.289880e-03 -2.345417e-03 
##            go 
##  5.152863e-04

## identifying predictive features - note that here
## none of the coefficients is zero
df <- data.frame(coef = as.numeric(beta),
                word = names(beta), stringsAsFactors=F)

df <- df[order(df$coef),]
head(df[,c("coef", "word")], n=30)

##              coef             word
## 19136 -0.03224894         labour's
## 22529 -0.03208518           julius
## 10440 -0.03080366        obscenely
## 21746 -0.03073966          crudely
## 15294 -0.03072414        wild-eyed
## 12171 -0.03057219            ray's
## 7226  -0.03021453        logistics
## 19524 -0.03015526         enslaved
## 17469 -0.03006717          takeshi
## 15480 -0.02988309         sentient
## 18563 -0.02987381            hazel
## 18568 -0.02985666    unwillingness
## 12046 -0.02963506        terrorise
## 12064 -0.02962371        housemaid
## 12069 -0.02958821          desiree
## 12076 -0.02955843           talbot
## 12077 -0.02953857         rothwell
## 12081 -0.02952221 double-entendres
## 22256 -0.02943862         stinking
## 22257 -0.02942045   indestructible
## 6173  -0.02936606        targetted
## 11895 -0.02935359      condolences
## 19475 -0.02934545              mar
## 19190 -0.02931492       occurances
## 14965 -0.02927048   dehumanization
## 11321 -0.02925344         subtract
## 16413 -0.02921493      accomplices
## 3581  -0.02919908         founding
## 17898 -0.02918410       fistfights
## 16206 -0.02917443          ungodly

df <- df[order(df$coef, decreasing=TRUE),]
head(df[,c("coef", "word")], n=30)

##             coef          word
## 24427 0.03517332     advisable
## 25498 0.03515558     coincides
## 18262 0.03513583       aplenty
## 25280 0.03509691        devane
## 20409 0.03507061        leeper
## 25540 0.03487283        cyborg
## 8659  0.03431398       hackers
## 19844 0.03425050      swooping
## 20533 0.03420042     scapegoat
## 25024 0.03419593    contractor
## 23161 0.03392933       gallons
## 22216 0.03391008       bigelow
## 25544 0.03390200      peddling
## 16055 0.03382480    detonating
## 17049 0.03340608     guitarist
## 1162  0.03327806   specialized
## 16065 0.03327544      amounted
## 20171 0.03321750          hugs
## 23112 0.03312815         delta
## 24198 0.03312370      bogart's
## 19473 0.03304250      finale's
## 20403 0.03302025           jun
## 25577 0.03301965     deceiving
## 23614 0.03295575       rumored
## 23669 0.03284718 complimentary
## 23612 0.03280681    neccessary
## 22759 0.03280587       wire-fu
## 25765 0.03267184  cartoon-like
## 5384  0.03263584     speedboat
## 22110 0.03262776        konner

Supervised machine learning

Pablo Barbera

Regularized regression