While supervised learning is used when we know the categories we want to produce, unsupervised learning (including topic modeling) is used when we do not know the categories. In topic modeling, documents are not assumed to belong to one topic or category, but simultaneously belong to several topics. The topic distributions also vary over documents.
The workhorse function for the topic model is LDA
, which stands for Latent Dirichlet Allocation, the technical name for this particular kind of model.
We will now use a dataset that contains the lead paragraph of around 5,000 articles about the economy published in the New York Times between 1980 and 2014. As before, we will preprocess the text using the standard set of techniques.
The number of topics in a topic model is somewhat arbitrary, so you need to play with the number of topics to see if you get anything more meaningful. We start here with 30 topics.
library(topicmodels)
# reading data and preparing corpus object
nyt <- read.csv("../data/nytimes.csv", stringsAsFactors = FALSE)
library(quanteda)
## Warning: package 'quanteda' was built under R version 3.4.4
nytcorpus <- corpus(nyt$lead_paragraph)
nytdfm <- dfm(nytcorpus, remove=stopwords("english"), verbose=TRUE,
remove_punct=TRUE, remove_numbers=TRUE)
cdfm <- dfm_trim(nytdfm, min_docfreq = 2)
# estimate LDA with K topics
K <- 30
lda <- LDA(cdfm, k = K, method = "Gibbs",
control = list(verbose=25L, seed = 123, burnin = 100, iter = 500))
## K = 30; V = 11109; M = 5000
## Sampling 600 iterations!
## Iteration 25 ...
## Iteration 50 ...
## Iteration 75 ...
## Iteration 100 ...
## Iteration 125 ...
## Iteration 150 ...
## Iteration 175 ...
## Iteration 200 ...
## Iteration 225 ...
## Iteration 250 ...
## Iteration 275 ...
## Iteration 300 ...
## Iteration 325 ...
## Iteration 350 ...
## Iteration 375 ...
## Iteration 400 ...
## Iteration 425 ...
## Iteration 450 ...
## Iteration 475 ...
## Iteration 500 ...
## Iteration 525 ...
## Iteration 550 ...
## Iteration 575 ...
## Iteration 600 ...
## Gibbs sampling completed!
We can use get_terms
to the top n
terms from the topic model, and get_topics
to predict the top k
topic for each document. This will help us interpret the results of the model.
terms <- get_terms(lda, 15)
terms[,1]
## [1] "mr" "obama" "president" "republican"
## [5] "democrats" "republicans" "bill" "house"
## [9] "senate" "senator" "campaign" "political"
## [13] "democratic" "presidential" "election"
topics <- get_topics(lda, 1)
head(topics)
## text1 text2 text3 text4 text5 text6
## 10 26 7 30 2 23
Let’s take a closer look at some of these topics. To help us interpret the output, we can look at the words associated with each topic and take a random sample of documents highly associated with each topic.
# Topic 2
paste(terms[,2], collapse=", ")
## [1] "market, stock, yesterday, investors, markets, stocks, average, points, bond, dow, high, trading, treasury, day, bonds"
sample(nyt$lead_paragraph[topics==2], 1)
## [1] "JITTERY investors set aside concerns about inflation and put their faith in profits this week, sending stock indexes higher for a couple of days as FedEx, Morgan Stanley and other companies posted better-than-expected earnings before the markets slumped later in the week."
# Topic 3
paste(terms[,3], collapse=", ")
## [1] "prices, inflation, price, oil, higher, consumer, energy, rising, index, costs, rise, labor, food, government, producer"
sample(nyt$lead_paragraph[topics==3], 1)
## [1] "The Labor Department reported today that prices paid to American producers rose in September by the biggest amount in eight months, but economists dismissed the jump as the result of only temporary factors and quirks, not accelerated inflation. The Producer Price Index for finished goods rose three-tenths of 1 percent, but economists said that it mainly reflected short-term surges in the cost of vegetables and other food as well as the effect of seasonal adjustment."
# Topic 7
paste(terms[,7], collapse=", ")
## [1] "federal, rates, reserve, interest, fed, policy, chairman, greenspan, committee, rate, cut, meeting, central, alan, monetary"
sample(nyt$lead_paragraph[topics==7], 1)
## [1] "Despite concerns over increased wage pressures, Federal Reserve officials voted 10 to 0 to keep interest rates unchanged at the end of the third quarter, according to minutes issued today. The Fed also kept rates unchanged at this week's policy meeting, minutes of which will not be released until after the next meeting of the Federal Open Market Committee in December."
# Topic 12
paste(terms[,12], collapse=", ")
## [1] "business, companies, company, american, industry, investment, businesses, corporate, general, executives, small, industries, corporation, technology, plans"
sample(nyt$lead_paragraph[topics==12], 1)
## [1] "The top leaders of steel unions and companies are breathing easier now that local union presidents have ratified a new 41-month contract. The beleaguered companies gained significant concessions. The unions obtained a face-saving promise that saved monies would go to modernize facilities. Both sides avoided a strike in August that might have cost them billions in lost sales."
You will that often some topics do not make much sense. They just capture the remaining cluster of words, and often correspond to stopwords. For example:
# Topic 4
paste(terms[,4], collapse=", ")
## [1] "economy, much, news, good, back, war, enough, might, times, bad, view, even, least, latest, turn"
sample(nyt$lead_paragraph[topics==4], 1)
## [1] "The economy seems to be stabilizing, and this has prompted a shift in the public mood. Raw fear has given way to anxiety that the recovery will be feeble and drab. Companies are hoarding cash. Banks aren't lending to small businesses. Private research spending is drifting downward. People are asking anxious questions about America's future. Will it take years before the animal spirits revive? Can the economy rebalance so that it relies less on consumption and debt and more on innovation and export? Have we entered a period of relative decline?"
# Topic 14
paste(terms[,14], collapse=", ")
## [1] "many, now, one, long, s, country, great, nation, end, america, every, decade, american, depression, problems"
sample(nyt$lead_paragraph[topics==14], 1)
## [1] "The economy is suffering the worst recession since 1937. Washington has practiced overkill in the war against inflation. Yet policymakers are paralyzed by memories of recent inflation peaks and the fear of future ones. They are unwilling to ease credit or increase consumer spending power. America is hurting, yet the only policy the Reagan Administration can find is more pain. That policy is wrong; worse, it is unnecessary. Any initiative against recession risks unleashing inflation. But that risk is not as immediate as is commonly believed. The recession has created so much excess capacity that stimulating consumer and business demand could not have much impact on inflation for years to come. That's all the more reason to act, now."
In the case of date with timestamps, looking at the evolution of certain topics over time can also help interpret their meaning. Let’s look for example at Topic 2, which appears to be related to the stock market.
# Topic 2
paste(terms[,2], collapse=", ")
## [1] "market, stock, yesterday, investors, markets, stocks, average, points, bond, dow, high, trading, treasury, day, bonds"
sample(nyt$lead_paragraph[topics==2], 1)
## [1] "Stocks fell yesterday, snapping the Dow Jones industrial average's two-week winning streak, after higher oil prices and a weaker dollar produced concerns that holiday sales and economic growth might falter. Home Depot, the home improvement retailer, led a decline among merchants as their busiest period began. Johnson & Johnson, the maker of health products from Band-Aids to schizophrenia pills, and Citigroup, the biggest United States bank, led a retreat in the Dow industrials."
# add predicted topic to dataset
nyt$pred_topic <- topics
nyt$year <- substr(nyt$datetime, 1, 4) # extract year
# frequency table with articles about stock market, per year
tab <- table(nyt$year[nyt$pred_topic==2])
plot(tab)
But we can actually do better than this. LDA is a probabilistic model, which means that for each document, it actually computes a distribution over topics. In other words, each document is considered to be about a mixture of topics.
This information is included in the matrix gamma
in the LDA object (theta
in the notation we used for the slides). For example, article 1 is 9% about topic 10, 7% about topic 14, 5% about topic 26, and then less than 5% for each of the rest.
round(lda@gamma[1,], 2)
## [1] 0.04 0.02 0.02 0.04 0.02 0.02 0.02 0.04 0.02 0.09 0.02 0.02 0.03 0.07
## [15] 0.02 0.02 0.03 0.04 0.04 0.04 0.03 0.02 0.03 0.02 0.04 0.05 0.03 0.04
## [29] 0.02 0.03
So we can actually take the information in the matrix and aggregate it to compute the average probability that an article each year is about a particular topic. Let’s now choose Topic 15, which appears to be related to the financial crisis.
# Topic 15
paste(terms[,15], collapse=", ")
## [1] "financial, money, banks, credit, crisis, billion, bank, debt, nation's, markets, loans, supply, system, funds, announced"
# add probability to df
nyt$prob_topic <- lda@gamma[,15]
# now aggregate at the year level
agg <- aggregate(nyt$prob_topic, by=list(year=nyt$year), FUN=mean)
# and plot it
plot(agg$year, agg$x, type="l", xlab="Year", ylab="Avg. prob. of article about topic 15",
main="Estimated proportion of articles about the financial crisis")