### Regularized regression To learn how to do supervised machine learning applied to social media text, we will use a random sample of nearly 5,000 tweets mentioning the names of the candidates to the 2014 EP elections in the UK. We will be analyzing variable named `communication`, which indicates whether each tweet was hand-coded as being __engaging__ (a tweet that tries to engage with the audience of the account) or __broadcasting__ (just sending a message, without trying to elicit a response). The source of the dataset is an article co-authored with Yannis Theocharis, Zoltan Fazekas, and Sebastian Popa, published in the Journal of Communication. The link is [here](http://onlinelibrary.wiley.com/doi/10.1111/jcom.12259/abstract). Our goal was to understand to what extent candidates are not engaging voters on Twitter because they're exposed to mostly impolite messages. Let's start by reading the dataset and creating a dummy variable indicating whether each tweet is engaging ```{r} library(quanteda) tweets <- read.csv("../data/UK-tweets.csv", stringsAsFactors=F) tweets$engaging <- ifelse(tweets$communication=="engaging", 1, 0) tweets <- tweets[!is.na(tweets$engaging),] ``` We'll do some cleaning as well -- substituting handles with @. Why? We want to provent overfitting. ```{r} tweets$text <- gsub('@[0-9_A-Za-z]+', '@', tweets$text) ``` As we discussed earlier today, before we can do any type of automated text analysis, we will need to go through several "preprocessing" steps before it can be passed to a statistical model. We'll use the quanteda package quanteda here. ```{r} twcorpus <- corpus(tweets$text) summary(twcorpus) ``` We can then convert a corpus into a document-feature matrix using the dfm function. We will then trim it in order to keep only tokens that appear in 2 or more tweets. Note that we keep punctuation -- it turns out it can be quite informative. ```{r} twdfm <- dfm(twcorpus, remove=stopwords("english"), remove_url=TRUE, ngrams=1:2, verbose=TRUE) twdfm <- dfm_trim(twdfm, min_docfreq = 2, verbose=TRUE) ``` Note that other preprocessing options are: - remove_numbers - remove_punct - remove_twitter - remove_symbols - remove_separators You can read more in the `dfm` and `tokens` help pages Once we have the DFM, we split it into training and test set. We'll go with 80% training and 20% set. Note the use of a random seed to make sure our results are replicable. ```{r} set.seed(123) training <- sample(1:nrow(tweets), floor(.80 * nrow(tweets))) test <- (1:nrow(tweets))[1:nrow(tweets) %in% training == FALSE] ``` Our first step is to train the classifier using cross-validation. There are many packages in R to run machine learning models. For regularized regression, glmnet is in my opinion the best. It's much faster than caret or mlr (in my experience at least), and it has cross-validation already built-in, so we don't need to code it from scratch. We'll start with a ridge regression: ```{r} library(glmnet) require(doMC) registerDoMC(cores=3) ridge <- cv.glmnet(twdfm[training,], tweets$engaging[training], family="binomial", alpha=0, nfolds=5, parallel=TRUE, intercept=TRUE, type.measure="class") plot(ridge) ``` We can now compute the performance metrics on the test set. ```{r} ## function to compute accuracy accuracy <- function(ypred, y){ tab <- table(ypred, y) return(sum(diag(tab))/sum(tab)) } # function to compute precision precision <- function(ypred, y){ tab <- table(ypred, y) return((tab[2,2])/(tab[2,1]+tab[2,2])) } # function to compute recall recall <- function(ypred, y){ tab <- table(ypred, y) return(tab[2,2]/(tab[1,2]+tab[2,2])) } # computing predicted values preds <- predict(ridge, twdfm[test,], type="class") # confusion matrix table(preds, tweets$engaging[test]) # performance metrics accuracy(preds, tweets$engaging[test]) precision(preds==1, tweets$engaging[test]==1) recall(preds==1, tweets$engaging[test]==1) precision(preds==0, tweets$engaging[test]==0) recall(preds==0, tweets$engaging[test]==0) ``` Something that is often very useful is to look at the actual estimated coefficients and see which of these have the highest or lowest values: ```{r} # from the different values of lambda, let's pick the highest one that is # within one standard error of the best one (why? see "one-standard-error" # rule -- maximizes parsimony) best.lambda <- which(ridge$lambda==ridge$lambda.1se) beta <- ridge$glmnet.fit$beta[,best.lambda] head(beta) ## identifying predictive features df <- data.frame(coef = as.numeric(beta), word = names(beta), stringsAsFactors=F) df <- df[order(df$coef),] head(df[,c("coef", "word")], n=30) paste(df$word[1:30], collapse=", ") df <- df[order(df$coef, decreasing=TRUE),] head(df[,c("coef", "word")], n=30) paste(df$word[1:30], collapse=", ") ``` We can easily modify our code to experiment with Lasso or ElasticNet models: ```{r} lasso <- cv.glmnet(twdfm[training,], tweets$engaging[training], family="binomial", alpha=1, nfolds=5, parallel=TRUE, intercept=TRUE, type.measure="class") ``` ```{r} # computing predicted values preds <- predict(lasso, twdfm[test,], type="class") # confusion matrix table(preds, tweets$engaging[test]) # performance metrics (slightly better!) accuracy(preds, tweets$engaging[test]) precision(preds==1, tweets$engaging[test]==1) recall(preds==1, tweets$engaging[test]==1) precision(preds==0, tweets$engaging[test]==0) recall(preds==0, tweets$engaging[test]==0) ``` ```{r} best.lambda <- which(lasso$lambda==lasso$lambda.1se) beta <- lasso$glmnet.fit$beta[,best.lambda] head(beta) ## identifying predictive features df <- data.frame(coef = as.numeric(beta), word = names(beta), stringsAsFactors=F) df <- df[order(df$coef),] head(df[,c("coef", "word")], n=30) df <- df[order(df$coef, decreasing=TRUE),] head(df[,c("coef", "word")], n=30) ``` We now see that the coefficients for some features actually became zero. ```{r} enet <- cv.glmnet(twdfm[training,], tweets$engaging[training], family="binomial", alpha=0.50, nfolds=5, parallel=TRUE, intercept=TRUE, type.measure="class") # NOTE: this will not cross-validate across values of alpha # computing predicted values preds <- predict(enet, twdfm[test,], type="class") # confusion matrix table(preds, tweets$engaging[test]) # performance metrics accuracy(preds, tweets$engaging[test]) precision(preds==1, tweets$engaging[test]==1) recall(preds==1, tweets$engaging[test]==1) precision(preds==0, tweets$engaging[test]==0) recall(preds==0, tweets$engaging[test]==0) best.lambda <- which(enet$lambda==enet$lambda.1se) beta <- enet$glmnet.fit$beta[,best.lambda] head(beta) ## identifying predictive features df <- data.frame(coef = as.numeric(beta), word = names(beta), stringsAsFactors=F) df <- df[order(df$coef),] head(df[,c("coef", "word")], n=30) df <- df[order(df$coef, decreasing=TRUE),] head(df[,c("coef", "word")], n=30) ```