To learn how to build a supervised machine learning model, we will
use a classic computer science dataset of movie reviews, (Pang and
Lee 2004). The movies corpus has an attribute Sentiment
that labels each text as either pos
or neg
according to the original imdb.com archived newspaper review star
rating.
Let’s start by reading the dataset and creating a dummy variable indicating whether each review is positive or negative.
library(quanteda)
## Package version: 3.2.3
## Unicode version: 14.0
## ICU version: 70.1
## Parallel computing: 8 of 8 threads used.
## See https://quanteda.io for tutorials and examples.
movies <- read.csv("../data/movie-reviews.csv", stringsAsFactors = FALSE)
movies$pos <- ifelse(movies$sentiment=="pos", 1, 0)
As we have discussed multiple times in the course, before we can do any type of automated text analysis, we will need to go through several “preprocessing” steps before it can be passed to a statistical model.
corp <- corpus(movies)
summary(corp, n=10)
## Corpus consisting of 2000 documents, showing 10 documents:
##
## Text Types Tokens Sentences sentiment pos
## text1 354 841 9 neg 0
## text2 156 278 1 neg 0
## text3 276 553 3 neg 0
## text4 313 555 2 neg 0
## text5 380 841 2 neg 0
## text6 328 747 1 neg 0
## text7 331 641 5 neg 0
## text8 325 673 6 neg 0
## text9 441 794 10 neg 0
## text10 401 965 23 neg 0
toks <- tokens(corp)
head(toks)
## Tokens consisting of 6 documents and 2 docvars.
## text1 :
## [1] "plot" ":" "two" "teen" "couples" "go" "to"
## [8] "a" "church" "party" "," "drink"
## [ ... and 829 more ]
##
## text2 :
## [1] "the" "happy" "bastard's" "quick" "movie" "review"
## [7] "damn" "that" "y2k" "bug" "." "it's"
## [ ... and 266 more ]
##
## text3 :
## [1] "it" "is" "movies" "like" "these" "that"
## [7] "make" "a" "jaded" "movie" "viewer" "thankful"
## [ ... and 541 more ]
##
## text4 :
## [1] "\"" "quest" "for" "camelot"
## [5] "\"" "is" "warner" "bros"
## [9] "." "'" "first" "feature-length"
## [ ... and 543 more ]
##
## text5 :
## [1] "synopsis" ":" "a" "mentally"
## [5] "unstable" "man" "undergoing" "psychotherapy"
## [9] "saves" "a" "boy" "from"
## [ ... and 829 more ]
##
## text6 :
## [1] "capsule" ":" "in" "2176" "on" "the" "planet"
## [8] "mars" "police" "taking" "into" "custody"
## [ ... and 735 more ]
We can then convert a corpus into a document-feature matrix using the dfm function. We will then trim it in order to keep only tokens that appear in 2 or more reviews
mdfm <- dfm(toks, verbose=TRUE)
## Creating a dfm from a tokens input...
## ...lowercasing
## ...found 2,000 documents, 48,362 features
## ...complete, elapsed time: 0.348 seconds.
## Finished constructing a 2,000 x 48,362 sparse dfm.
mdfm <- dfm_remove(mdfm, stopwords("english"))
mdfm <- dfm_trim(mdfm, min_docfreq = 2, verbose=TRUE)
## Removing features occurring:
## - in fewer than 2 documents: 22,178
## Total features removed: 22,178 (46.0%).
Once we have the DFM, we split it into training and test set. We’ll go with 80% training and 20% set. Note the use of a random seed to make sure our results are replicable.
set.seed(123)
# 80% random sample of rows
training <- sample(1:nrow(movies), floor(.80 * nrow(movies)))
# the remaining 20% will be test
test <- (1:nrow(movies))[1:nrow(movies) %in% training == FALSE]
Our first step is to train the classifier using cross-validation. There are many packages in R to run machine learning models. For regularized regression, glmnet is in my opinion the best. It’s much faster than caret or mlr (in my experience at least), and it has cross-validation already built-in, so we don’t need to code it from scratch. We’ll start with a lasso regression:
library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 4.1-4
lasso <- cv.glmnet(x=mdfm[training,], y=movies$pos[training],
family="binomial", alpha=1, nfolds=5, parallel=TRUE,
intercept=TRUE, type.measure="class")
## Warning: executing %dopar% sequentially: no parallel backend registered
plot(lasso)
We can now compute the performance metrics on the test set.
## function to compute performance metrics
performance <- function(ypred, y){
tab <- table(ypred, y)
accuracy <- (tab[1,1]+tab[2,2])/sum(tab)
precision <- (tab[2,2])/(tab[2,1]+tab[2,2])
recall <- tab[2,2]/(tab[1,2]+tab[2,2])
message("Accuracy = ", round(accuracy, 2), "\n",
"Precision = ", round(precision, 2), "\n",
"Recall = ", round(recall, 2))
}
# computing predicted values
preds <- predict(lasso, mdfm[test,], type="class")
# confusion matrix
table(preds, movies$pos[test])
##
## preds 0 1
## 0 164 37
## 1 31 168
# performance metrics for positive category
performance(preds, movies$pos[test])
## Accuracy = 0.83
## Precision = 0.84
## Recall = 0.82
# performance metrics for negative category
performance(preds==0, movies$pos[test]==0)
## Accuracy = 0.83
## Precision = 0.82
## Recall = 0.84
Something that is often very useful is to look at the actual estimated coefficients and see which of these have the highest or lowest values:
# from the different values of lambda, let's pick the highest one that is
# within one standard error of the best one (why? see "one-standard-error"
# rule -- maximizes parsimony)
best.lambda <- which(lasso$lambda==lasso$lambda.1se)
beta <- lasso$glmnet.fit$beta[,best.lambda]
head(beta)
## plot : two teen couples go
## -0.02231017 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000
## identifying predictive features
df <- data.frame(coef = as.numeric(beta),
word = names(beta), stringsAsFactors=F)
df <- df[order(df$coef),]
head(df[,c("coef", "word")], n=30)
## coef word
## 1130 -0.62766216 ridiculous
## 503 -0.51222707 awful
## 328 -0.47048894 wasted
## 884 -0.44379135 boring
## 469 -0.42532017 poorly
## 8698 -0.38673896 forgot
## 50 -0.36910698 mess
## 2330 -0.36640677 worst
## 1136 -0.33863465 waste
## 10052 -0.30613024 degenerates
## 36 -0.29335604 bad
## 551 -0.29282454 dull
## 1290 -0.28158498 unfunny
## 2766 -0.25500535 bore
## 2383 -0.25385417 lame
## 1517 -0.24943734 unimaginative
## 1098 -0.23560863 supposed
## 4073 -0.23310665 ludicrous
## 5091 -0.20984502 sloppy
## 1305 -0.20085050 stupid
## 440 -0.19122127 unfortunately
## 5002 -0.18212467 lousy
## 629 -0.17399124 none
## 5079 -0.11888499 random
## 3979 -0.11874039 embarrassing
## 415 -0.11849907 nothing
## 2280 -0.10441423 poor
## 1941 -0.10020055 material
## 2106 -0.09840352 script
## 9346 -0.08378153 fares
df <- df[order(df$coef, decreasing=TRUE),]
head(df[,c("coef", "word")], n=30)
## coef word
## 7561 0.58025426 finest
## 18414 0.28356310 outstanding
## 8297 0.27920661 allows
## 4429 0.27452309 perfectly
## 7682 0.24709632 fantastic
## 6047 0.20161028 hilarious
## 487 0.19746314 memorable
## 10551 0.19298947 slip
## 4248 0.18864490 excellent
## 3391 0.17860220 flaws
## 12436 0.17458902 chilling
## 215 0.16907678 overall
## 7353 0.16519511 refreshing
## 2561 0.16205054 effective
## 1526 0.15424044 performances
## 1044 0.14077573 great
## 10176 0.12779893 wonderfully
## 10534 0.11935178 breathtaking
## 1727 0.11620400 fun
## 21 0.10700684 life
## 4317 0.10551512 war
## 2440 0.09587107 wonderful
## 179 0.08198933 different
## 5065 0.08147989 terrific
## 17510 0.07499402 naval
## 1482 0.07477818 true
## 12883 0.07473119 religion
## 240 0.06861690 also
## 786 0.06595291 quite
## 398 0.06214336 best
We can easily modify our code to experiment with Ridge or ElasticNet models:
ridge <- cv.glmnet(x=mdfm[training,], y=movies$pos[training],
family="binomial", alpha=0, nfolds=5, parallel=TRUE, intercept=TRUE,
type.measure="class")
# computing predicted values
preds <- predict(ridge, mdfm[test,], type="class")
# confusion matrix
table(preds, movies$pos[test])
##
## preds 0 1
## 0 159 61
## 1 36 144
# performance metrics (slightly worse)
performance(preds, movies$pos[test])
## Accuracy = 0.76
## Precision = 0.8
## Recall = 0.7
performance(preds==0, movies$pos[test]==0)
## Accuracy = 0.76
## Precision = 0.72
## Recall = 0.82
best.lambda <- which(ridge$lambda==ridge$lambda.1se)
beta <- ridge$glmnet.fit$beta[,best.lambda]
head(beta)
## plot : two teen couples
## -2.963434e-03 -1.323379e-05 2.523335e-04 -2.289880e-03 -2.345417e-03
## go
## 5.152863e-04
## identifying predictive features - note that here
## none of the coefficients is zero
df <- data.frame(coef = as.numeric(beta),
word = names(beta), stringsAsFactors=F)
df <- df[order(df$coef),]
head(df[,c("coef", "word")], n=30)
## coef word
## 19136 -0.03224894 labour's
## 22529 -0.03208518 julius
## 10440 -0.03080366 obscenely
## 21746 -0.03073966 crudely
## 15294 -0.03072414 wild-eyed
## 12171 -0.03057219 ray's
## 7226 -0.03021453 logistics
## 19524 -0.03015526 enslaved
## 17469 -0.03006717 takeshi
## 15480 -0.02988309 sentient
## 18563 -0.02987381 hazel
## 18568 -0.02985666 unwillingness
## 12046 -0.02963506 terrorise
## 12064 -0.02962371 housemaid
## 12069 -0.02958821 desiree
## 12076 -0.02955843 talbot
## 12077 -0.02953857 rothwell
## 12081 -0.02952221 double-entendres
## 22256 -0.02943862 stinking
## 22257 -0.02942045 indestructible
## 6173 -0.02936606 targetted
## 11895 -0.02935359 condolences
## 19475 -0.02934545 mar
## 19190 -0.02931492 occurances
## 14965 -0.02927048 dehumanization
## 11321 -0.02925344 subtract
## 16413 -0.02921493 accomplices
## 3581 -0.02919908 founding
## 17898 -0.02918410 fistfights
## 16206 -0.02917443 ungodly
df <- df[order(df$coef, decreasing=TRUE),]
head(df[,c("coef", "word")], n=30)
## coef word
## 24427 0.03517332 advisable
## 25498 0.03515558 coincides
## 18262 0.03513583 aplenty
## 25280 0.03509691 devane
## 20409 0.03507061 leeper
## 25540 0.03487283 cyborg
## 8659 0.03431398 hackers
## 19844 0.03425050 swooping
## 20533 0.03420042 scapegoat
## 25024 0.03419593 contractor
## 23161 0.03392933 gallons
## 22216 0.03391008 bigelow
## 25544 0.03390200 peddling
## 16055 0.03382480 detonating
## 17049 0.03340608 guitarist
## 1162 0.03327806 specialized
## 16065 0.03327544 amounted
## 20171 0.03321750 hugs
## 23112 0.03312815 delta
## 24198 0.03312370 bogart's
## 19473 0.03304250 finale's
## 20403 0.03302025 jun
## 25577 0.03301965 deceiving
## 23614 0.03295575 rumored
## 23669 0.03284718 complimentary
## 23612 0.03280681 neccessary
## 22759 0.03280587 wire-fu
## 25765 0.03267184 cartoon-like
## 5384 0.03263584 speedboat
## 22110 0.03262776 konner