While supervised learning is used when we know the categories we want to produce, unsupervised learning (including topic modeling) is used when we do not know the categories. In topic modeling, documents are not assumed to belong to one topic or category, but simultaneously belong to several topics. The topic distributions also vary over documents.
The workhorse function for the topic model is LDA
, which stands for Latent Dirichlet Allocation, the technical name for this particular kind of model.
We will now use a dataset that contains the lead paragraph of around 5,000 articles about the economy published in the New York Times between 1980 and 2014. As before, we will preprocess the text using the standard set of techniques.
The number of topics in a topic model is somewhat arbitrary, so you need to play with the number of topics to see if you get anything more meaningful. We start here with 20 topics.
library(topicmodels)
# reading data and preparing corpus object
nyt <- read.csv("data/nytimes.csv", stringsAsFactors = FALSE)
library(quanteda)
nytcorpus <- corpus(nyt$lead_paragraph)
nytdfm <- dfm(nytcorpus, remove=stopwords("english"), verbose=TRUE,
remove_punct=TRUE, remove_numbers=TRUE)
cdfm <- dfm_trim(nytdfm, min_docfreq = 2)
# we now export to a format that we can run the topic model with
dtm <- convert(nytdfm, to="topicmodels")
# estimate LDA with K topics
K <- 50
lda <- LDA(dtm, k = K, method = "Gibbs",
control = list(verbose=25L, seed = 123, burnin = 100, iter = 500))
## K = 50; V = 21298; M = 5000
## Sampling 600 iterations!
## Iteration 25 ...
## Iteration 50 ...
## Iteration 75 ...
## Iteration 100 ...
## Iteration 125 ...
## Iteration 150 ...
## Iteration 175 ...
## Iteration 200 ...
## Iteration 225 ...
## Iteration 250 ...
## Iteration 275 ...
## Iteration 300 ...
## Iteration 325 ...
## Iteration 350 ...
## Iteration 375 ...
## Iteration 400 ...
## Iteration 425 ...
## Iteration 450 ...
## Iteration 475 ...
## Iteration 500 ...
## Iteration 525 ...
## Iteration 550 ...
## Iteration 575 ...
## Iteration 600 ...
## Gibbs sampling completed!
We can use get_terms
to the top n
terms from the topic model, and get_topics
to predict the top k
topic for each document. This will help us interpret the results of the model.
terms <- get_terms(lda, 15)
terms[,5]
## [1] "business" "companies" "corporate" "investment"
## [5] "american" "executives" "industries" "capital"
## [9] "profits" "many" "pay" "earnings"
## [13] "large" "businesses" "corporations"
topics <- get_topics(lda, 1)
head(topics)
## text1 text2 text3 text4 text5 text6
## 25 23 26 11 41 34
Let’s take a closer look at some of these topics. To help us interpret the output, we can look at the words associated with each topic and take a random sample of documents highly associated with each topic.
# Topic 5
paste(terms[,5], collapse=", ")
## [1] "business, companies, corporate, investment, american, executives, industries, capital, profits, many, pay, earnings, large, businesses, corporations"
sample(nyt$lead_paragraph[topics==5], 1)
## [1] "THE steep recession, which battered the American economy in the final months of the old year, poses the greatest problem for business in the new. And the recession also poses the greatest domestic challenge to President Reagan who, having led the nation into this slump, to his Administration's proclaimed surprise, now hopes to lead it out without reviving inflation or sending interest rates skyward again. For many businesses, the issue is not how to cope but how to survive. Fina ncial disarray and weak and declining demand have imposed a liq uidity squeeze on housing, the thrift industry, the autoindustry, rub ber, chemicals and many other old-line industries. Many companies caught in the liquidity squeeze are trying to escape it by cutting capital spending for new plant and equipment to conserve cash, cutting inventories to avoid heavy interest charges and laying off employees, both blue-collar and white-collar."
# Topic 9
paste(terms[,9], collapse=", ")
## [1] "americans, income, people, poverty, according, years, middle, personal, less, poor, study, living, class, incomes, bureau"
sample(nyt$lead_paragraph[topics==9], 1)
## [1] "To the Editor: William P. O'Hare's arguments on the measurement of poverty (''Poverty's Bottom Line,'' Op-Ed Aug. 19) are reasonable enough, but his recommendation is unnecessarily complex. I suggest that the poverty level be more simply defined as the lowest income decile. Besides simplification, this definition has three other advantages: It would immediately cut poverty by one-third; it would eliminate any job-security fears that welfare administrators and social workers might have, and it would satisfy the old description ''The poor, ye shall always have with ye.'' CLIFTON B. RODES Louisville, Ky., Aug. 19, 1983"
# Topic 14
paste(terms[,14], collapse=", ")
## [1] "mr, president, bush, obama, clinton, campaign, senator, republican, john, presidential, voters, speech, bush's, political, democratic"
sample(nyt$lead_paragraph[topics==14], 1)
## [1] "President Bush on Tuesday urged Saudi Arabia and other members of OPEC to consider the strain the high cost of oil was having on the American economy, addressing an issue that has begun to color the last year of his presidency and dominate the presidential election campaign. Speaking to a group of Saudi entrepreneurs and later to reporters, Mr. Bush expressed concern about the economy in some of his starkest language yet, saying that rising oil costs and gasoline prices were causing hardship for American families. He vowed to raise the issue with the Saudi leader, King Abdullah, during a meeting and dinner at the king's lush horse farm in the desert outside of Riyadh, the capital."
# Topic 16
paste(terms[,16], collapse=", ")
## [1] "jobs, unemployment, workers, labor, job, work, benefits, number, employment, force, statistics, people, added, lost, jobless"
sample(nyt$lead_paragraph[topics==16], 1)
## [1] "Cambridge, Mass. ON Thursday, President Obama will deliver a major speech on America's employment crisis. But too often, what is lost in the call for job creation is a clear idea of what jobs we want to create."
You will that often some topics do not make much sense. They just capture the remaining cluster of words, and often correspond to stopwords. For example:
# Topic 1
paste(terms[,1], collapse=", ")
## [1] "end, america, often, book, recent, economics, made, seen, ideas, however, like, century, line, hand, wrong"
sample(nyt$lead_paragraph[topics==1], 1)
## [1] "For me, the most frightening news in The Times on Sunday was not about North Korea's stepping up its nuclear program, but an article about how American kids are stepping up their use of digital devices: ''Allison Miller, 14, sends and receives 27,000 texts in a month, her fingers clicking at a blistering pace as she carries on as many as seven text conversations at a time. She texts between classes, at the moment soccer practice ends, while being driven to and from school and, often, while studying. But this proficiency comes at a cost: She blames multitasking for the three B's on her recent progress report. ''I'll be reading a book for homework and I'll get a text message and pause my reading and put down the book, pick up the phone to reply to the text message, and then 20 minutes later realize, 'Oh, I forgot to do my homework.' '' I don't want to pick on Miller. I highlight her words only because they're integral to a much larger point: Our unemployment today is not only because of the financial crisis. There are some deeper problems. If we're going to get more Americans back to work, we will need more stimulus from the U.S.G. -- the U.S. government -- from the top down. But we will also need more stimulus from the P.T.A.'s -- the Parent Teacher Associations -- from the bottom up."
# Topic 4
paste(terms[,4], collapse=", ")
## [1] "just, days, day, took, around, decade, seemed, came, part, began, morning, two, office, come, announced"
sample(nyt$lead_paragraph[topics==4], 1)
## [1] "SHORTLY after the planes slammed into the World Trade Center and the Pentagon last week, a reporter called to ask, ''What effect will the terrorist attacks have on the economy?'' ''That's the least of it,'' I replied. How can one even begin to consider the economic costs before the human toll is known? The reporter sheepishly confessed that his assignment was to cover ''the least of it.'' Indeed, he himself was supposed to have been covering a conference in the World Trade Center that fateful morning and was lucky to have overslept and missed it. In the aftermath of the tragedies of Sept. 11, I find myself compelled to address ''the least of it'' as well."
In the case of date with timestamps, looking at the evolution of certain topics over time can also help interpret their meaning. Let’s look for example at Topic 41, which appears to be related to the stock market.
# Topic 41
paste(terms[,41], collapse=", ")
## [1] "market, stock, investors, wall, street, yesterday, stocks, markets, average, points, bond, dow, day, industrial, jones"
sample(nyt$lead_paragraph[topics==41], 1)
## [1] "The stock and bond markets rallied strongly yesterday, halting for now the sharpest decline in years, as some traders were encouraged by new signs that the economy was not growing as rapidly as had been thought. In a surge of buying and bargain hunting, traders bid up stock prices, lifting the Dow Jones industrial average 82.06 points, or 2.3 percent, to 3,675.41, the biggest one-day rally in more than two years."
# add predicted topic to dataset
nyt$pred_topic <- topics
nyt$year <- substr(nyt$datetime, 1, 4) # extract year
# frequency table with articles about stock market, per year
tab <- table(nyt$year[nyt$pred_topic==41])
plot(tab)
But we can actually do better than this. LDA is a probabilistic model, which means that for each document, it actually computes a distribution over topics. In other words, each document is considered to be about a mixture of topics.
This information is included in the matrix gamma
in the LDA object (theta
in the notation we used for the slides). For example, article 1 is 16% about topic 25, 8% about topic 31, 6% about topic 9, and then less than 5% for each of the rest.
round(lda@gamma[1,], 2)
## [1] 0.01 0.01 0.01 0.01 0.01 0.02 0.01 0.02 0.06 0.01 0.02 0.02 0.01 0.01
## [15] 0.01 0.02 0.02 0.02 0.02 0.01 0.01 0.01 0.01 0.01 0.16 0.01 0.01 0.01
## [29] 0.02 0.01 0.08 0.01 0.01 0.01 0.02 0.01 0.01 0.02 0.02 0.01 0.01 0.01
## [43] 0.01 0.01 0.01 0.02 0.02 0.02 0.02 0.01
So we can actually take the information in the matrix and aggregate it to compute the average probability that an article each year is about a particular topic. Let’s now choose Topic 15, which appears to be related to the financial crisis.
# Topic 15
paste(terms[,15], collapse=", ")
## [1] "financial, crisis, banks, credit, markets, debt, bank, system, mortgage, enough, loans, biggest, bankers, firms, bailout"
# add probability to df
nyt$prob_topic_15 <- lda@gamma[,15]
# now aggregate at the year level
agg <- aggregate(nyt$prob_topic_15, by=list(year=nyt$year), FUN=mean)
# and plot it
plot(agg$year, agg$x, type="l", xlab="Year", ylab="Avg. prob. of article about topic 15",
main="Estimated proportion of articles about the financial crisis")
This is the code to generate the figure in the slides. Many moving parts here…
# install.packages("cvTools")
require(cvTools)
## Loading required package: cvTools
## Loading required package: lattice
## Loading required package: robustbase
cvLDA <- function(Ntopics,dtm,K=10) {
folds<-cvFolds(nrow(dtm),K,1)
perplex <- rep(NA,K)
llk <- rep(NA,K)
for(i in unique(folds$which)){
cat(i, " ")
which.test <- folds$subsets[folds$which==i]
which.train <- {1:nrow(dtm)}[-which.test]
dtm.train <- dtm[which.train,]
dtm.test <- dtm[which.test,]
lda.fit <- LDA(dtm.train, k=Ntopics, method="Gibbs",
control=list(verbose=50L, iter=100))
perplex[i] <- perplexity(lda.fit,dtm.test)
llk[i] <- logLik(lda.fit)
}
return(list(K=Ntopics,perplexity=perplex,logLik=llk))
}
K <- c(20, 30, 40, 50, 60, 70, 80)
results <- list()
i = 1
for (k in K){
cat("\n\n\n##########\n ", k, "topics", "\n")
res <- cvLDA(k, dtm)
results[[i]] <- res
i = i + 1
}
##
##
##
## ##########
## 20 topics
## 1 K = 20; V = 21298; M = 4500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## K = 20; V = 21298; M = 500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## 2 K = 20; V = 21298; M = 4500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## K = 20; V = 21298; M = 500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## 3 K = 20; V = 21298; M = 4500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## K = 20; V = 21298; M = 500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## 4 K = 20; V = 21298; M = 4500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## K = 20; V = 21298; M = 500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## 5 K = 20; V = 21298; M = 4500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## K = 20; V = 21298; M = 500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## 6 K = 20; V = 21298; M = 4500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## K = 20; V = 21298; M = 500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## 7 K = 20; V = 21298; M = 4500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## K = 20; V = 21298; M = 500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## 8 K = 20; V = 21298; M = 4500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## K = 20; V = 21298; M = 500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## 9 K = 20; V = 21298; M = 4500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## K = 20; V = 21298; M = 500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## 10 K = 20; V = 21298; M = 4500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## K = 20; V = 21298; M = 500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
##
##
##
## ##########
## 30 topics
## 1 K = 30; V = 21298; M = 4500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## K = 30; V = 21298; M = 500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## 2 K = 30; V = 21298; M = 4500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## K = 30; V = 21298; M = 500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## 3 K = 30; V = 21298; M = 4500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## K = 30; V = 21298; M = 500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## 4 K = 30; V = 21298; M = 4500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## K = 30; V = 21298; M = 500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## 5 K = 30; V = 21298; M = 4500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## K = 30; V = 21298; M = 500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## 6 K = 30; V = 21298; M = 4500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## K = 30; V = 21298; M = 500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## 7 K = 30; V = 21298; M = 4500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## K = 30; V = 21298; M = 500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## 8 K = 30; V = 21298; M = 4500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## K = 30; V = 21298; M = 500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## 9 K = 30; V = 21298; M = 4500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## K = 30; V = 21298; M = 500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## 10 K = 30; V = 21298; M = 4500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## K = 30; V = 21298; M = 500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
##
##
##
## ##########
## 40 topics
## 1 K = 40; V = 21298; M = 4500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## K = 40; V = 21298; M = 500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## 2 K = 40; V = 21298; M = 4500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## K = 40; V = 21298; M = 500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## 3 K = 40; V = 21298; M = 4500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## K = 40; V = 21298; M = 500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## 4 K = 40; V = 21298; M = 4500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## K = 40; V = 21298; M = 500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## 5 K = 40; V = 21298; M = 4500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## K = 40; V = 21298; M = 500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## 6 K = 40; V = 21298; M = 4500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## K = 40; V = 21298; M = 500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## 7 K = 40; V = 21298; M = 4500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## K = 40; V = 21298; M = 500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## 8 K = 40; V = 21298; M = 4500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## K = 40; V = 21298; M = 500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## 9 K = 40; V = 21298; M = 4500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## K = 40; V = 21298; M = 500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## 10 K = 40; V = 21298; M = 4500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## K = 40; V = 21298; M = 500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
##
##
##
## ##########
## 50 topics
## 1 K = 50; V = 21298; M = 4500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## K = 50; V = 21298; M = 500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## 2 K = 50; V = 21298; M = 4500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## K = 50; V = 21298; M = 500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## 3 K = 50; V = 21298; M = 4500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## K = 50; V = 21298; M = 500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## 4 K = 50; V = 21298; M = 4500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## K = 50; V = 21298; M = 500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## 5 K = 50; V = 21298; M = 4500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## K = 50; V = 21298; M = 500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## 6 K = 50; V = 21298; M = 4500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## K = 50; V = 21298; M = 500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## 7 K = 50; V = 21298; M = 4500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## K = 50; V = 21298; M = 500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## 8 K = 50; V = 21298; M = 4500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## K = 50; V = 21298; M = 500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## 9 K = 50; V = 21298; M = 4500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## K = 50; V = 21298; M = 500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## 10 K = 50; V = 21298; M = 4500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## K = 50; V = 21298; M = 500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
##
##
##
## ##########
## 60 topics
## 1 K = 60; V = 21298; M = 4500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## K = 60; V = 21298; M = 500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## 2 K = 60; V = 21298; M = 4500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## K = 60; V = 21298; M = 500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## 3 K = 60; V = 21298; M = 4500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## K = 60; V = 21298; M = 500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## 4 K = 60; V = 21298; M = 4500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## K = 60; V = 21298; M = 500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## 5 K = 60; V = 21298; M = 4500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## K = 60; V = 21298; M = 500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## 6 K = 60; V = 21298; M = 4500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## K = 60; V = 21298; M = 500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## 7 K = 60; V = 21298; M = 4500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## K = 60; V = 21298; M = 500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## 8 K = 60; V = 21298; M = 4500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## K = 60; V = 21298; M = 500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## 9 K = 60; V = 21298; M = 4500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## K = 60; V = 21298; M = 500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## 10 K = 60; V = 21298; M = 4500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## K = 60; V = 21298; M = 500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
##
##
##
## ##########
## 70 topics
## 1 K = 70; V = 21298; M = 4500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## K = 70; V = 21298; M = 500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## 2 K = 70; V = 21298; M = 4500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## K = 70; V = 21298; M = 500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## 3 K = 70; V = 21298; M = 4500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## K = 70; V = 21298; M = 500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## 4 K = 70; V = 21298; M = 4500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## K = 70; V = 21298; M = 500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## 5 K = 70; V = 21298; M = 4500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## K = 70; V = 21298; M = 500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## 6 K = 70; V = 21298; M = 4500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## K = 70; V = 21298; M = 500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## 7 K = 70; V = 21298; M = 4500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## K = 70; V = 21298; M = 500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## 8 K = 70; V = 21298; M = 4500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## K = 70; V = 21298; M = 500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## 9 K = 70; V = 21298; M = 4500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## K = 70; V = 21298; M = 500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## 10 K = 70; V = 21298; M = 4500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## K = 70; V = 21298; M = 500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
##
##
##
## ##########
## 80 topics
## 1 K = 80; V = 21298; M = 4500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## K = 80; V = 21298; M = 500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## 2 K = 80; V = 21298; M = 4500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## K = 80; V = 21298; M = 500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## 3 K = 80; V = 21298; M = 4500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## K = 80; V = 21298; M = 500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## 4 K = 80; V = 21298; M = 4500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## K = 80; V = 21298; M = 500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## 5 K = 80; V = 21298; M = 4500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## K = 80; V = 21298; M = 500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## 6 K = 80; V = 21298; M = 4500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## K = 80; V = 21298; M = 500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## 7 K = 80; V = 21298; M = 4500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## K = 80; V = 21298; M = 500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## 8 K = 80; V = 21298; M = 4500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## K = 80; V = 21298; M = 500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## 9 K = 80; V = 21298; M = 4500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## K = 80; V = 21298; M = 500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## 10 K = 80; V = 21298; M = 4500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## K = 80; V = 21298; M = 500
## Sampling 100 iterations!
## Iteration 50 ...
## Iteration 100 ...
## Gibbs sampling completed!
## plot
df <- data.frame(
k = rep(K, each=10),
perp = unlist(lapply(results, '[[', 'perplexity')),
loglk = unlist(lapply(results, '[[', 'logLik')),
stringsAsFactors=F)
min(df$perp)
## [1] 2424.736
df$ratio_perp <- df$perp / max(df$perp)
df$ratio_lk <- df$loglk / min(df$loglk)
df <- data.frame(cbind(
aggregate(df$ratio_perp, by=list(df$k), FUN=mean),
aggregate(df$ratio_perp, by=list(df$k), FUN=sd)$x,
aggregate(df$ratio_lk, by=list(df$k), FUN=mean)$x,
aggregate(df$ratio_lk, by=list(df$k), FUN=sd)$x),
stringsAsFactors=F)
names(df) <- c("k", "ratio_perp", "sd_perp", "ratio_lk", "sd_lk")
library(reshape)
pd <- melt(df[,c("k","ratio_perp", "ratio_lk")], id.vars="k")
pd2 <- melt(df[,c("k","sd_perp", "sd_lk")], id.vars="k")
pd$sd <- pd2$value
levels(pd$variable) <- c("Perplexity", "LogLikelihood")
library(ggplot2)
library(grid)
p <- ggplot(pd, aes(x=k, y=value, linetype=variable))
pq <- p + geom_line() + geom_point(aes(shape=variable),
fill="white", shape=21, size=1.40) +
geom_errorbar(aes(ymax=value+sd, ymin=value-sd), width=4) +
scale_y_continuous("Ratio wrt worst value") +
scale_x_continuous("Number of topics",
breaks=K) +
theme_bw()
pq
To explore an unsupervised approach to ideological scaling, let’s come back to our previous example of tweets by Members of Congress. Can we recover a latent ideological dimension based on the text of their tweets?
cong <- read.csv("data/congress-tweets.csv", stringsAsFactors=F)
# creating the corpus and dfm objects
ccorpus <- corpus(cong$text)
docnames(ccorpus) <- cong$screen_name
cdfm <- dfm(ccorpus, remove=c(stopwords("english"),
"t.co", "https", "rt", "amp", "http", "t.c", "can"),
remove_punct=TRUE, remove_numbers=TRUE, verbose=TRUE)
## Creating a dfm from a corpus ...
## ... tokenizing texts
## ... lowercasing
## ... found 100 documents, 182,525 features
## ...
## dfm_select removed 179 features and 0 documents, padding 0s for 0 features and 0 documents.
## ... created a 100 x 182,333 sparse dfm
## ... complete.
## Elapsed time: 0.719 seconds.
# note heavy feature selection!
cdfm <- dfm_trim(cdfm, min_docfreq = 25)
# running wordfish
wf <- textmodel(cdfm, dir=c(10, 8), model="wordfish")
wf
## Fitted wordfish model:
## Call:
## textmodel_wordfish(x = x, dir = ..1)
##
## Estimated document positions:
##
## Documents theta SE lower upper
## 1 usreprodney -0.426124133 0.076108127 -0.57529606 -0.27695220
## 2 reprichmond -0.548302711 0.034840093 -0.61658929 -0.48001613
## 3 rephanabusa -0.130297388 0.039962486 -0.20862386 -0.05197092
## 4 repstevestivers 0.883539497 0.022766048 0.83891804 0.92816095
## 5 chrisvanhollen -0.455747072 0.034913891 -0.52417830 -0.38731585
## 6 marshablackburn 1.145505361 0.016321301 1.11351561 1.17749511
## 7 repjohnlewis -1.055334478 0.035426601 -1.12477062 -0.98589834
## 8 repstutzman 1.206978600 0.025364172 1.15726482 1.25669238
## 9 replipinski -0.260334919 0.057378270 -0.37279633 -0.14787351
## 10 jacksonleetx18 -0.891611786 0.027327623 -0.94517393 -0.83804965
## 11 repveasey -1.138069027 0.041005078 -1.21843898 -1.05769907
## 12 senatorisakson 0.375410448 0.043225922 0.29068764 0.46013326
## 13 repmurphyfl -0.893571385 0.041246546 -0.97441462 -0.81272815
## 14 replankford 1.478047098 0.013112687 1.45234623 1.50374797
## 15 juliabrownley26 -1.290442279 0.046946462 -1.38245734 -1.19842721
## 16 repterrisewell -0.950526323 0.033240170 -1.01567706 -0.88537559
## 17 repsusandavis -1.260017504 0.041850609 -1.34204470 -1.17799031
## 18 senatorbaldwin -1.141657387 0.024933110 -1.19052628 -1.09278849
## 19 reploisfrankel -1.412051374 0.033567654 -1.47784398 -1.34625877
## 20 repmarktakano -0.564450155 0.043030859 -0.64879064 -0.48010967
## 21 repjohnsarbanes -1.090940294 0.055986995 -1.20067480 -0.98120578
## 22 vancemcallister 0.126319170 0.129149016 -0.12681290 0.37945124
## 23 reptimmurphy 0.832151864 0.023795186 0.78551330 0.87879043
## 24 senatorwicker 0.615973497 0.025067905 0.56684040 0.66510659
## 25 repgwenmoore -1.286185435 0.019236729 -1.32388942 -1.24848145
## 26 gracenapolitano -0.944060071 0.040630130 -1.02369513 -0.86442502
## 27 senatorleahy -1.034499837 0.020896004 -1.07545600 -0.99354367
## 28 senmikelee 1.143885410 0.015017299 1.11445150 1.17331932
## 29 repdonnaedwards -0.961523156 0.024075897 -1.00871191 -0.91433440
## 30 repcummings -0.167601968 0.034105896 -0.23444952 -0.10075441
## 31 sentoomey 0.633776131 0.019623200 0.59531466 0.67223760
## 32 replindasanchez -1.275276797 0.030019549 -1.33411511 -1.21643848
## 33 repmickmulvaney 1.197141335 0.024601568 1.14892226 1.24536041
## 34 senatorreid -0.569640265 0.020711617 -0.61023503 -0.52904549
## 35 blumenauermedia 0.331738284 0.061451671 0.21129301 0.45218356
## 36 darrellissa 1.696121586 0.009600592 1.67730443 1.71493875
## 37 reptomprice 1.440636580 0.010582005 1.41989585 1.46137731
## 38 repcartwright -0.673807970 0.031087970 -0.73474039 -0.61287555
## 39 jefffortenberry 0.867274201 0.024837471 0.81859276 0.91595564
## 40 repcardenas -0.907145731 0.024882986 -0.95591638 -0.85837508
## 41 repjohnconyers -0.960955490 0.035803546 -1.03113044 -0.89078054
## 42 senorrinhatch 1.080661123 0.019506398 1.04242858 1.11889366
## 43 repwalberg 1.373107406 0.018046083 1.33773708 1.40847773
## 44 repraulgrijalva -0.270526256 0.026253319 -0.32198276 -0.21906975
## 45 senatorcollins -0.220175484 0.035496364 -0.28974836 -0.15060261
## 46 howardcoble 0.554092945 0.043482817 0.46886662 0.63931927
## 47 repkevincramer 0.797122902 0.037076095 0.72445376 0.86979205
## 48 rephultgren 0.792875401 0.020064507 0.75354897 0.83220183
## 49 sendonnelly -0.453100365 0.031010895 -0.51388172 -0.39231901
## 50 drphilroe 1.146751634 0.021805261 1.10401332 1.18948995
## 51 johnboozman 0.769705634 0.024997152 0.72071122 0.81870005
## 52 greggharper 1.073561856 0.017602255 1.03906144 1.10806228
## 53 repjaredpolis -1.025949305 0.038355360 -1.10112581 -0.95077280
## 54 kencalvert 1.231657224 0.014690019 1.20286479 1.26044966
## 55 repderekkilmer -0.813649702 0.054917730 -0.92128845 -0.70601095
## 56 jasoninthehouse 1.372412699 0.013863603 1.34524004 1.39958536
## 57 senatorcardin -1.224528203 0.019387436 -1.26252758 -1.18652883
## 58 senbillnelson 0.280439522 0.046905326 0.18850508 0.37237396
## 59 toddrokita 0.931270636 0.016291225 0.89933983 0.96320144
## 60 repgarrett 1.186863516 0.034411106 1.11941775 1.25430928
## 61 buckmckeon 1.196273193 0.019726934 1.15760840 1.23493798
## 62 replarrybucshon 1.204912859 0.014273291 1.17693721 1.23288851
## 63 repdinatitus -1.368530166 0.024615984 -1.41677750 -1.32028284
## 64 mariodb 0.098971478 0.026679976 0.04667872 0.15126423
## 65 repgarypeters -1.110375955 0.029909034 -1.16899766 -1.05175425
## 66 repbrianhiggins -0.916847829 0.022566531 -0.96107823 -0.87261743
## 67 repvisclosky -0.311605772 0.034370473 -0.37897190 -0.24423965
## 68 repdavid 0.922041920 0.016697795 0.88931424 0.95476960
## 69 billowensny 0.205960882 0.025904370 0.15518832 0.25673345
## 70 reptipton 0.914725350 0.026256360 0.86326288 0.96618781
## 71 repthomasmassie 0.990004590 0.026208750 0.93863544 1.04137374
## 72 repstephenlynch -0.317879238 0.046056259 -0.40814951 -0.22760897
## 73 repmikemichaud -0.002881386 0.045721135 -0.09249481 0.08673204
## 74 robert_aderholt 1.089802448 0.024464723 1.04185159 1.13775330
## 75 senatorhagan -0.724802264 0.021776046 -0.76748331 -0.68212121
## 76 repmikecoffman 1.234157587 0.018042675 1.19879394 1.26952123
## 77 repkaygranger 0.918620990 0.025547309 0.86854826 0.96869372
## 78 repsandylevin -0.682259657 0.031711180 -0.74441357 -0.62010574
## 79 repfitzpatrick 0.266732040 0.019828800 0.22786759 0.30559649
## 80 repandybarr 0.675198624 0.024482776 0.62721238 0.72318486
## 81 repchriscollins 1.011163145 0.018571672 0.97476267 1.04756362
## 82 waxmanclimate -3.422404806 0.003516834 -3.42929780 -3.41551181
## 83 reptomgraves 1.230606116 0.015676924 1.19987934 1.26133289
## 84 repmullin 0.984745619 0.026829964 0.93215889 1.03733235
## 85 senblumenthal -1.160633749 0.020936240 -1.20166878 -1.11959872
## 86 reptimgriffin 1.130330871 0.014822558 1.10127866 1.15938308
## 87 repbobbyscott -0.448188815 0.038291817 -0.52324078 -0.37313685
## 88 cbrangel -0.938018772 0.022632294 -0.98237807 -0.89365947
## 89 repmarkpocan -0.978710060 0.029754455 -1.03702879 -0.92039133
## 90 repjoecrowley -1.263367033 0.030592837 -1.32332899 -1.20340507
## 91 benniegthompson -0.027375959 0.065576478 -0.15590586 0.10115394
## 92 jimlangevin -1.005576593 0.024563816 -1.05372167 -0.95743151
## 93 repsamfarr -0.824325640 0.037380292 -0.89759101 -0.75106027
## 94 jimpressoffice 1.150565099 0.063672941 1.02576613 1.27536406
## 95 repduckworth -0.593662457 0.041908539 -0.67580319 -0.51152172
## 96 judgecarter 1.314609000 0.012435863 1.29023471 1.33898329
## 97 repdanmaffei -0.334249149 0.049757720 -0.43177428 -0.23672402
## 98 repkarenbass -0.741376355 0.028225149 -0.79669765 -0.68605506
## 99 senatorharkin -0.687621948 0.025185975 -0.73698646 -0.63825744
## 100 repandyharrismd 1.054354483 0.024020001 1.00727528 1.10143369
##
## Estimated feature scores: showing first 30 beta-hats for features
##
## today will w house great bill
## -0.03491423 0.07314492 -0.22382823 0.24866013 -0.05227409 0.01502278
## thanks act new now bit.ly day
## 0.03734939 -0.13553328 -0.06716741 -0.09353060 0.50125026 -0.07769135
## us jobs work health time support
## -0.18287372 0.01663018 -0.24379526 -0.01749536 -0.12278660 -0.15145292
## watch live senate just need thank
## 0.11606432 0.27016978 0.01819367 0.18916267 -0.16116420 -0.10678310
## congress help budget president vote americans
## -0.07537493 -0.19394215 0.53347016 0.29168371 -0.08675822 -0.07394686
# let's look at the most discriminant words (note same notation as in slides)
sw <- data.frame(beta=wf@beta, word=wf@features)
sw <- sw[order(sw$beta),]
head(sw, n=20)
## beta word
## 653 -4.5400662 #climatechange
## 792 -3.2939157 #actonclimate
## 2772 -2.2112347 degrees
## 1924 -1.2130178 pollution
## 1921 -0.8781342 carbon
## 2897 -0.8734829 @lcvoters
## 837 -0.7865659 climate
## 3007 -0.7479897 normal
## 3349 -0.7363159 emissions
## 3059 -0.7151820 @repgaramendi
## 2605 -0.6882437 @janschakowsky
## 1916 -0.6483134 @epa
## 1822 -0.6469373 impacts
## 2221 -0.6452384 @keithellison
## 1381 -0.6293128 #womensucceed
## 268 -0.6214409 ben
## 1278 -0.6046273 #sandy
## 1143 -0.6025320 #lgbt
## 2103 -0.6006890 @repbarbaralee
## 352 -0.5967864 #vawa
tail(sw, n=20)
## beta word
## 119 1.526579 #4jobs
## 2724 1.530607 documents
## 1944 1.569117 #trainwreck
## 369 1.581810 #benghazi
## 1889 1.583890 #fairnessforall
## 3532 1.600301 comply
## 704 1.602333 j.mp
## 447 1.629601 @gopconference
## 2306 1.631309 contempt
## 2685 1.680262 bureaucrats
## 805 1.698821 @gopwhip
## 2450 1.704245 house-passed
## 1050 1.754599 @waysandmeansgop
## 238 1.904688 #irs
## 1180 2.104537 @republicanstudy
## 1043 2.546412 @reptomprice
## 1540 2.582810 emails
## 1500 2.980649 lois
## 382 3.281559 @darrellissa
## 373 4.739343 @gopoversight
# and now we can compare the estimate positions with the ideal points...
plot(wf@theta, cong$idealPoint)
cor(wf@theta, cong$idealPoint)
## [1] 0.8697747
cor(wf@theta[cong$party=="R"], cong$idealPoint[cong$party=="R"])
## [1] 0.4302438
cor(wf@theta[cong$party=="D"], cong$idealPoint[cong$party=="D"])
## [1] 0.3097333