To understand how APIs work, we’ll take the New York Times API as an example. This API allows users to search articles by string and dates, and returns counts of articles and a short description of each article (but not the full text). You can find the documentation here. Get a new API token and paste it here:
apikey <- 'MFmx9sy8TZb36dnVOXtAtv7rXnE9GWGm'
The fist step is to identify the base url and the parameters that we
can use to query the API. Now we can do a first API call using the
httr
package. (You can use my API key for now, let’s hope
we don’t hit the rate limit!)
base_url <- "http://api.nytimes.com/svc/search/v2/articlesearch.json"
library(httr)
r <- GET(base_url, query=list(q="inequality","api-key"=apikey))
r
## Response [http://api.nytimes.com/svc/search/v2/articlesearch.json?q=inequality&api-key=MFmx9sy8TZb36dnVOXtAtv7rXnE9GWGm]
## Date: 2022-09-10 15:44
## Status: 200
## Content-Type: application/json
## Size: 222 kB
From the output of r
, we can see that the query was
successful (Status: 200
), the content is in
json
format, and its size is 226kB
.
To extract the text returned by this API call, you can use
content
. You can write it to a file to take a look at
it.
content(r, 'text')
writeLines(content(r, 'text'), con=file("nyt.json"))
## No encoding supplied: defaulting to UTF-8.
We can save the output into an object in R to learn more about its structure.
json <- content(r, 'parsed')
class(json); names(json) # list with 3 elements
## [1] "list"
## [1] "status" "copyright" "response"
json$status # this should be "OK"
## [1] "OK"
names(json$response) # the actual data
## [1] "docs" "meta"
json$response$meta # metadata
## $hits
## [1] 24957
##
## $offset
## [1] 0
##
## $time
## [1] 22
If we check the documentation, we find that we can subset by date
with the begin_date
and end_date
parameters.
Let’s see how this works…
r <- GET(base_url, query=list(q="inequality",
"api-key"=apikey,
"begin_date"=20200101,
"end_date"=20201231))
json <- content(r, 'parsed')
json$response$meta
## $hits
## [1] 1821
##
## $offset
## [1] 0
##
## $time
## [1] 21
Between these two dates, there were 1861 articles in the NYT mentioning “inequality”.
Now imagine we want to look at the evolution of mentions of this word over time. Following the best coding practices we introduced earlier, we want to write a function that will take a word and a set of dates as arguments and return the counts of articles.
This would be a first draft of that function:
nyt_count <- function(q, date1, date2){
r <- GET(base_url, query=list(q=q,
"api-key"=apikey,
"begin_date"=date1,
"end_date"=date2))
json <- content(r, "parsed")
return(json$response$meta$hits)
}
nyt_count(q="inequality", date1=20200101, date2=20201231)
## [1] 1821
Ok, so this seems to work. But we want to run this function multiple times, so let’s write another function that helps us do that.
nyt_years_count <- function(q, yearinit, yearend){
# sequence of years to loop over
years <- seq(yearinit, yearend)
counts <- rep(NA, length(years))
# loop over periods
for (i in 1:length(years)){
# information message to track progress
message(years[i])
# retrieve count
counts[i] <- nyt_count(q=q, date1=paste0(years[i], "0101"),
date2=paste0(years[i], "1231"))
}
return(counts)
}
# and let's see what happens...
nyt_years_count(q="inequality", yearinit=1980, yearend=2020)
Oops! What happened? Why the error? We’re querying the API too fast.
Let’s modify the function to add a while
loop that will
wait a couple of seconds in case there’s an error:
nyt_count <- function(q, date1, date2){
r <- GET(base_url, query=list(q=q,
"api-key"=apikey,
"begin_date"=date1,
"end_date"=date2))
json <- content(r, "parsed")
## if there is no response
while (r$status_code!=200){
Sys.sleep(2) # wait a couple of seconds
# try again:
r <- GET(base_url, query=list(q=q,
"api-key"=apikey,
"begin_date"=date1,
"end_date"=date2))
json <- content(r, "parsed")
}
return(json$response$meta$hits)
}
And let’s see if this does the trick…
counts <- nyt_years_count(q="inequality", yearinit=1980, yearend=2020)
## 1980
## 1981
## 1982
## 1983
## 1984
## 1985
## 1986
## 1987
## 1988
## 1989
## 1990
## 1991
## 1992
## 1993
## 1994
## 1995
## 1996
## 1997
## 1998
## 1999
## 2000
## 2001
## 2002
## 2003
## 2004
## 2005
## 2006
## 2007
## 2008
## 2009
## 2010
## 2011
## 2012
## 2013
## 2014
## 2015
## 2016
## 2017
## 2018
## 2019
## 2020
plot(1980:2020, counts, type="l", main="Mentions of inequality on the NYT, by year",
xlab="Year", ylab="Article count")
Let’s try to generalize the function even more so that it works with any date interval, not just years:
nyt_dates_count <- function(q, init, end, by){
# sequence of dates to loop over
dates <- seq(from=init, to=end, by=by)
dates <- format(dates, "%Y%m%d") # changing format to match NYT API format
counts <- rep(NA, length(dates)-1)
# loop over periods
for (i in 1:(length(dates)-1)){ ## note the -1 here
# information message to track progress
message(dates[i])
# retrieve count
counts[i] <- nyt_count(q=q, date1=dates[i],
date2=dates[i+1])
}
# improving this as well so that it returns a data frame
df <- data.frame(date = as.Date(dates[-length(dates)], format="%Y%m%d"), count = counts)
return(df)
}
And now we can count articles at the month level…
counts <- nyt_dates_count(q="trump", init = as.Date("2015/01/01"), end = as.Date("2022/08/31"), by="month")
## 20150101
## 20150201
## 20150301
## 20150401
## 20150501
## 20150601
## 20150701
## 20150801
## 20150901
## 20151001
## 20151101
## 20151201
## 20160101
## 20160201
## 20160301
## 20160401
## 20160501
## 20160601
## 20160701
## 20160801
## 20160901
## 20161001
## 20161101
## 20161201
## 20170101
## 20170201
## 20170301
## 20170401
## 20170501
## 20170601
## 20170701
## 20170801
## 20170901
## 20171001
## 20171101
## 20171201
## 20180101
## 20180201
## 20180301
## 20180401
## 20180501
## 20180601
## 20180701
## 20180801
## 20180901
## 20181001
## 20181101
## 20181201
## 20190101
## 20190201
## 20190301
## 20190401
## 20190501
## 20190601
## 20190701
## 20190801
## 20190901
## 20191001
## 20191101
## 20191201
## 20200101
## 20200201
## 20200301
## 20200401
## 20200501
## 20200601
## 20200701
## 20200801
## 20200901
## 20201001
## 20201101
## 20201201
## 20210101
## 20210201
## 20210301
## 20210401
## 20210501
## 20210601
## 20210701
## 20210801
## 20210901
## 20211001
## 20211101
## 20211201
## 20220101
## 20220201
## 20220301
## 20220401
## 20220501
## 20220601
## 20220701
plot(counts$date, counts$count, type="l", main="Mentions of 'Trump' in the NYT, by month",
xlab="Month", ylab="Article count")