While supervised learning is used when we know the categories we want to produce, unsupervised learning (including topic modeling) is used when we do not know the categories. In topic modeling, documents are not assumed to belong to one topic or category, but simultaneously belong to several topics. The topic distributions also vary over documents.
The workhorse function for the topic model is LDA
, which
stands for Latent Dirichlet Allocation, the technical name for this
particular kind of model.
We will now use a dataset that contains the lead paragraph of around 5,000 articles about the economy published in the New York Times between 1980 and 2014. As before, we will preprocess the text using the standard set of techniques.
The number of topics in a topic model is somewhat arbitrary, so you need to play with the number of topics to see if you get anything more meaningful. We start here with 30 topics.
library(topicmodels)
library(quanteda)
# reading data and converting to DFM
nyt <- read.csv("../data/nytimes.csv", stringsAsFactors = FALSE)
nyt_corpus <- corpus(nyt, text_field = "lead_paragraph")
toks <- tokens(nyt_corpus,
remove_punct = TRUE, remove_numbers = TRUE)
nyt_dfm <- dfm(toks)
nyt_dfm <- dfm_remove(nyt_dfm, stopwords("english"))
cdfm <- dfm_trim(nyt_dfm, min_docfreq = 2)
# estimate LDA with K topics
K <- 30
lda <- LDA(cdfm, k = K, method = "Gibbs",
control = list(verbose=25L, seed = 123,
burnin = 100, iter = 500))
## K = 30; V = 11143; M = 5000
## Sampling 600 iterations!
## Iteration 25 ...
## Iteration 50 ...
## Iteration 75 ...
## Iteration 100 ...
## Iteration 125 ...
## Iteration 150 ...
## Iteration 175 ...
## Iteration 200 ...
## Iteration 225 ...
## Iteration 250 ...
## Iteration 275 ...
## Iteration 300 ...
## Iteration 325 ...
## Iteration 350 ...
## Iteration 375 ...
## Iteration 400 ...
## Iteration 425 ...
## Iteration 450 ...
## Iteration 475 ...
## Iteration 500 ...
## Iteration 525 ...
## Iteration 550 ...
## Iteration 575 ...
## Iteration 600 ...
## Gibbs sampling completed!
We can use get_terms
to the top n
terms
from the topic model, and get_topics
to predict the top
k
topic for each document. This will help us interpret the
results of the model.
terms <- get_terms(lda, 15)
terms[,1]
## [1] "economic" "government" "policy" "can" "public"
## [6] "washington" "take" "important" "must" "problems"
## [11] "current" "power" "issues" "much" "put"
topics <- get_topics(lda, 1)
head(topics)
## text1 text2 text3 text4 text5 text6
## 27 5 15 3 25 21
Let’s take a closer look at some of these topics. To help us interpret the output, we can look at the words associated with each topic and take a random sample of documents highly associated with each topic.
# Topic 2
paste(terms[,2], collapse=", ")
## [1] "tax, budget, spending, cuts, cut, taxes, fiscal, deficit, plan, deficits, military, congress, administration, federal, security"
sample(nyt$lead_paragraph[topics==2], 1)
## [1] "Republican negotiators from Congress and the White House agreed today to seek $1.35 trillion in tax cuts over the next 11 years as part of a compromise budget, less than what President Bush proposed but still by far the largest tax reduction in the last 20 years. The negotiators remained deadlocked today over spending, the other component of the budget plan. Senators want to permit much more to be spent on domestic programs next year than the president and leaders in the House are willing to accept."
# Topic 3
paste(terms[,3], collapse=", ")
## [1] "jobs, unemployment, workers, labor, job, work, americans, employment, statistics, people, lost, added, benefits, force, millions"
sample(nyt$lead_paragraph[topics==3], 1)
## [1] "SCHAUMBURG, Ill. -- After hemorrhaging jobs for months, the economy is finally starting to add them. Yet the unemployment rate is not really budging because of people like Regina Myles. Ms. Myles, 51, has been out of work for three years. After a grueling job search yielded 150 interviews but no offers, she simply stopped looking last fall. Then this spring, with a $3,000 government-funded grant to help pay for a training course at a local beauty school in this Chicago suburb, she began applying for jobs online and in stores again."
# Topic 4
paste(terms[,4], collapse=", ")
## [1] "percent, rate, quarter, first, year, economy, second, annual, growth, today, third, government, grew, productivity, expected"
sample(nyt$lead_paragraph[topics==4], 1)
## [1] "Businesses and households went on a shopping spree this spring, causing warehouse stockpiles to shrink and setting the stage for faster economic growth during the rest of the year. The spending helped extend a solid if unspectacular expansion in the second quarter, when the economy grew at an annual rate of 3.4 percent, the Commerce Department reported yesterday, down from 3.8 percent in the first quarter. Companies took advantage of the weaker dollar to ship more goods abroad, while imports reversed course and fell slightly. It was the first time since 1991 that exports rose and imports dropped in the same quarter."
# Topic 14
paste(terms[,14], collapse=", ")
## [1] "economic, growth, economy, recovery, economists, said, recession, strong, recent, high, signs, still, forecast, next, evidence"
sample(nyt$lead_paragraph[topics==14], 1)
## [1] "The Asian crisis may alarm Alan Greenspan and assorted other economists, but to the American people the economic outlook could not be rosier. That is what they tell the pollsters, who put consumer confidence at virtually the highest level in 30 years. More important, consumers act on that belief. Some economists thought the second-quarter numbers would show a decline in the American economy. But instead, the estimates issued by the Government last Friday showed continued growth, albeit at a reduced level. That was because Americans -- both individuals and businesses -- are buying at a very high, and ultimately unsustainable, pace. Adjusted for inflation, purchases by Americans are up 5.6 percent over the past year. Such rapid growth would be expected if the economy were coming out of a recession and pent-up demand were bursting forth. But this is the eighth year of economic expansion."
You will that often some topics do not make much sense. They just capture the remaining cluster of words, and often correspond to stopwords. For example:
# Topic 12
paste(terms[,12], collapse=", ")
## [1] "people, can, get, one, going, just, right, way, think, now, us, things, know, something, question"
sample(nyt$lead_paragraph[topics==12], 1)
## [1] "ON a traditional farm, when winter comes and there's no need for planting, fertilizing or harvesting, it's time for infrastructure projects. Farmers fix their barns, build fences or dig wells -- important tasks that could be done in any season if there weren't more pressing jobs to do. If the winter is unusually long and cold, planting time is delayed and additional projects are undertaken. It's all very simple and sensible: the idea is not to let people sit around idle, and to use down time to get important things done."
# Topic 19
paste(terms[,19], collapse=", ")
## [1] "two, years, time, even, now, first, weeks, less, likely, past, next, become, long, days, say"
sample(nyt$lead_paragraph[topics==19], 1)
## [1] "To the Editor: Doris Lessing ends \"Language and the Lunatic Fringe\" (Op-Ed, June 26) with a warning that those in former Communist countries now search, \"perhaps not even knowing it,\" for another dogma. They need look no further than the United States. While lavishing praise upon ourselves for having prevailed over Communism, we too have become enslaved by catch phrases and empty slogans -- and, not surprisingly, few of us realize it. I offer a few examples of some of the dead language of orthodox capitalism:"
In the case of date with timestamps, looking at the evolution of certain topics over time can also help interpret their meaning. Let’s look for example at Topic 25, which appears to be related to the stock market.
# Topic 25
paste(terms[,25], collapse=", ")
## [1] "market, stock, yesterday, investors, stocks, markets, average, interest, points, rates, day, bond, dow, treasury, trading"
sample(nyt$lead_paragraph[topics==25], 1)
## [1] "The dollar ended lower against other leading currencies yesterday, hurt by growing pessimism that today's employment report for April would reveal more unsettling weakness in the American economy. Money brokers said there was some dollar buying early in the day. But later on, many traders reduced their dollar holdings as they became concerned that the Labor Department's employment report would show anemic job creation. Such weakness would put more pressure on the Federal Reserve to keep interest rates low, thereby making dollar-dominated investments less attractive."
# add predicted topic to dataset
nyt$pred_topic <- topics
nyt$year <- substr(nyt$datetime, 1, 4) # extract year
# frequency table with articles about stock market, per year
tab <- table(nyt$year[nyt$pred_topic==2])
plot(tab)
But we can actually do better than this. LDA is a probabilistic model, which means that for each document, it actually computes a distribution over topics. In other words, each document is considered to be about a mixture of topics.
This information is included in the matrix gamma
in the
LDA object (theta
in the notation we used for the slides).
For example, article 1 is 21% about topic 27, 7% about topic 29, 4%
about topic 2, and then less than 4% for each of the rest.
round(lda@gamma[1,], 2)
## [1] 0.02 0.04 0.04 0.02 0.03 0.02 0.02 0.02 0.02 0.03 0.02 0.02 0.04 0.02 0.02
## [16] 0.02 0.02 0.03 0.04 0.02 0.03 0.02 0.02 0.02 0.03 0.03 0.21 0.03 0.07 0.02
So we can actually take the information in the matrix and aggregate it to compute the average probability that an article each year is about a particular topic. Let’s now choose Topic 23, which appears to be related to the financial crisis.
# Topic 23
paste(terms[,23], collapse=", ")
## [1] "financial, money, banks, crisis, credit, markets, bank, debt, nation's, system, loans, funds, supply, fund, savings"
# add probability to df
nyt$prob_topic <- lda@gamma[,23]
# now aggregate at the year level
agg <- aggregate(nyt$prob_topic, by=list(year=nyt$year), FUN=mean)
# and plot it
plot(agg$year, agg$x, type="l", xlab="Year", ylab="Avg. prob. of article about topic 23",
main="Estimated proportion of articles about the financial crisis")