We will start with basic string manipulation with R.
Our running example will be a random sample of 10,000 tweets mentioning the names of the candidates to the 2014 EP elections in the UK. We’ll save the text of these tweets as a vector called `text’
tweets <- read.csv("data/EP-elections-tweets.csv", stringsAsFactors=F)
head(tweets)
## text
## 1 @NSinclaireMEP Knew that Lib Dems getting into bed with Tories would end like this. They might never get another bite of the cherry.
## 2 @Steven_Woolfe hi Steven, would you be free to join @LBC on the phone for 5 minutes after 3.30 at all?
## 3 @TrevorWAllman @Nigel_Farage The clock is ticking ....
## 4 @llexanderaamb got the badges.
## 5 @Peebi @AnujaPrashar @Angel4theNorth thanks
## 6 Well said @Nigel_Farage , poor Paxo had to change the subject. #Gettingstuffed
## screen_name id polite
## 1 martinwedge 4.730558e+17 impolite
## 2 WillGav 4.696599e+17 polite
## 3 CathyWood55 4.676486e+17 polite
## 4 CStephenssnp 4.701751e+17 polite
## 5 sanchia4europe 4.693720e+17 polite
## 6 EnglandsAce 4.685071e+17 polite
text <- tweets$text
R stores the basic string in a character vector. length
gets the number of items in the vector, while nchar
is the number of characters in the vector.
length(text)
## [1] 10000
text[1]
## [1] "@NSinclaireMEP Knew that Lib Dems getting into bed with Tories would end like this. They might never get another bite of the cherry."
nchar(text[1])
## [1] 132
Note that we can work with multiple strings at once.
nchar(text[1:10])
## [1] 132 102 54 30 43 79 140 137 94 139
sum(nchar(text[1:10]))
## [1] 950
max(nchar(text[1:10]))
## [1] 140
We can merge different strings into one using paste
:
paste(text[1], text[2], sep='--')
## [1] "@NSinclaireMEP Knew that Lib Dems getting into bed with Tories would end like this. They might never get another bite of the cherry.--@Steven_Woolfe hi Steven, would you be free to join @LBC on the phone for 5 minutes after 3.30 at all?"
Charcter vectors can be compared using the ==
and %in%
operators:
tweets$screen_name[1]=="martinwedge"
## [1] TRUE
"DavidCoburnUKip" %in% tweets$screen_name
## [1] TRUE
As we will see later, it is often convenient to convert all words to lowercase or uppercase.
tolower(text[1])
## [1] "@nsinclairemep knew that lib dems getting into bed with tories would end like this. they might never get another bite of the cherry."
toupper(text[1])
## [1] "@NSINCLAIREMEP KNEW THAT LIB DEMS GETTING INTO BED WITH TORIES WOULD END LIKE THIS. THEY MIGHT NEVER GET ANOTHER BITE OF THE CHERRY."
We can grab substrings with substr
. The first argument is the string, the second is the beginning index (starting from 1), and the third is final index.
substr(text[1], 1, 2)
## [1] "@N"
substr(text[1], 1, 10)
## [1] "@NSinclair"
This is useful when working with date strings as well:
dates <- c("2015/01/01", "2014/12/01")
substr(dates, 1, 4) # years
## [1] "2015" "2014"
substr(dates, 6, 7) # months
## [1] "01" "12"
We can split up strings by a separator using strsplit
. If we choose space as the separator, this is in most cases equivalent to splitting into words.
strsplit(text[1], " ")
## [[1]]
## [1] "@NSinclaireMEP" "Knew" "that" "Lib"
## [5] "Dems" "getting" "into" "bed"
## [9] "with" "Tories" "would" "end"
## [13] "like" "this." "They" "might"
## [17] "never" "get" "another" "bite"
## [21] "of" "the" "cherry."
Let’s dit into the data a little bit more. Given the construction of the dataset, we can expect that there will be many tweets mentioning the names of the candidates, such as @Nigel_Farage, We can use the grep
command to identify these. grep
returns the index where the word occurs.
grep('@Nigel_Farage', text[1:10])
## [1] 3 6 9 10
grepl
returns TRUE
or FALSE
, indicating whether each element of the character vector contains that particular pattern.
grepl('@Nigel_Farage', text[1:10])
## [1] FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE TRUE
Going back to the full dataset, we can use the results of grep
to get particular rows. First, check how many tweets mention the handle “@Nigel_Farage”.
nrow(tweets)
## [1] 10000
grep('@Nigel_Farage', tweets$text[1:10])
## [1] 3 6 9 10
length(grep('@Nigel_Farage', tweets$text))
## [1] 1512
It is important to note that matching is case-sensitive. You can use the ignore.case
argument to match to a lowercase version.
nrow(tweets)
## [1] 10000
length(grep('@Nigel_Farage', tweets$text))
## [1] 1512
length(grep('@Nigel_Farage', tweets$text, ignore.case = TRUE))
## [1] 1535
Another useful tool to work with text data is called “regular expression”. You can learn more about regular expressions here. Regular expressions let us develop complicated rules for both matching strings and extracting elements from them.
For example, we could look at tweets that mention more than one handle using the operator “|” (equivalent to “OR”)
nrow(tweets)
## [1] 10000
length(grep('@Nigel_Farage|@UKIP', tweets$text, ignore.case=TRUE))
## [1] 1739
We can also use question marks to indicate optional characters.
nrow(tweets)
## [1] 10000
length(grep('MEP?', tweets$text, ignore.case=TRUE))
## [1] 4461
This will match MEP, MEPs, etc.
Other common expression patterns are:
.
matches any character, ^
and $
match the beginning and end of a string.{3}
, *
, +
is matched exactly 3 times, 0 or more times, 1 or more times.[0-9]
, [a-zA-Z]
, [:alnum:]
match any digit, any letter, or any digit and letter..
, \
, (
or )
must be preceded by a backslash.?regex
for more details.For example, how many tweets are direct replies to @Nigel_Farage? How many tweets are retweets? How many tweets mention any username?
length(grep('^@Nigel_Farage', tweets$text, ignore.case=TRUE))
## [1] 376
length(grep('^RT @', tweets$text, ignore.case=TRUE))
## [1] 47
length(grep('@[A-Za-z0-9]+ ', tweets$text, ignore.case=TRUE))
## [1] 7834
Another function that we will use is gsub
, which replaces a pattern (or a regular expression) with another string:
gsub('@[0-9_A-Za-z]+', 'USERNAME', text[1])
## [1] "USERNAME Knew that Lib Dems getting into bed with Tories would end like this. They might never get another bite of the cherry."
To extract a pattern, and not just replace, use parentheses and choose the option repl="\\1"
:
gsub('.*@([0-9_A-Za-z]+) .*', text[1], repl="\\1")
## [1] "NSinclaireMEP"
You can make this a bit more complex using gregexpr
, which will extract the location of the matches, and then regmatches
handles <- gregexpr('@([0-9_A-Za-z]+)', text)
handles <- regmatches(text, handles)
handles <- unlist(handles)
head(sort(table(handles), decreasing=TRUE), n=25)
## handles
## @Nigel_Farage @nickgriffinmep @UKIP @DavidCoburnUKip
## 1514 375 357 356
## @JaniceUKIP @RogerHelmerMEP @DanHannanMEP @ClaudeMoraesMEP
## 241 234 228 123
## @SebDance @marcuschown @Lucy4MEP @IvanaBartoletti
## 119 110 102 98
## @Michael_Heaver @TasminaSheikh @maryhoneyball @GreenJeanMEP
## 98 89 88 82
## @TheGreenParty @Tim_Aker @JimAllister @MEPStandingUp4U
## 78 72 71 66
## @GlenisWillmott @sanchia4europe @SLATUKIP @davidmartinmep
## 62 62 61 57
## @KamaljeetJandu
## 54
# now with hashtags...
hashtags <- regmatches(text, gregexpr("#(\\d|\\w)+",text))
hashtags <- unlist(hashtags)
head(sort(table(hashtags), decreasing=TRUE), n=25)
## hashtags
## #UKIP #EP2014 #VoteGreen2014
## 261 146 75
## #EU #bbcqt #ukip
## 57 49 44
## #labourdoorstep #europeanelections #votegreen2014
## 40 20 20
## #VoteLab14 #WhyImVotingUkip #ep2014
## 17 17 16
## #Eurovision #indyref #Vote2014
## 15 15 15
## #bbcsp #Labour #votelabour
## 14 14 14
## #TUSC #votegreen #voteUKIP
## 13 13 13
## #London #VoteLabour #youth
## 12 12 12
## #SNP
## 11
Now let’s try to identify what tweets are related to UKIP and try to extract them. How would we do it? First, let’s create a new column to the data frame that has value TRUE
for tweets that mention this keyword and FALSE
otherwise. Then, we can keep the rows with value TRUE
.
tweets$ukip <- grepl('ukip|farage', tweets$text, ignore.case=TRUE)
table(tweets$ukip)
##
## FALSE TRUE
## 6968 3032
ukip.tweets <- tweets[tweets$ukip==TRUE, ]
As we discussed earlier, before we can do any type of automated text analysis, we will need to go through several “preprocessing” steps before it can be passed to a statistical model. We’ll use the quanteda
package quanteda here.
The basic unit of work for the quanteda
package is called a corpus
, which represents a collection of text documents with some associated metadata. Documents are the subunits of a corpus. You can use summary
to get some information about your corpus.
library(quanteda)
## quanteda version 0.9.9.65
## Using 3 of 4 cores for parallel computing
##
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
##
## View
twcorpus <- corpus(tweets$text)
summary(twcorpus)
## Corpus consisting of 10000 documents, showing 100 documents.
##
## Text Types Tokens Sentences
## text1 24 25 2
## text2 22 22 1
## text3 7 10 1
## text4 5 5 1
## text5 4 4 1
## text6 13 13 2
## text7 25 26 3
## text8 21 23 1
## text9 16 18 2
## text10 26 29 1
## text11 22 24 3
## text12 7 8 1
## text13 23 24 1
## text14 17 18 1
## text15 26 27 3
## text16 24 26 2
## text17 20 29 1
## text18 15 16 1
## text19 20 22 2
## text20 20 20 2
## text21 27 31 2
## text22 21 23 1
## text23 10 10 1
## text24 20 22 2
## text25 13 13 1
## text26 12 12 1
## text27 9 11 1
## text28 3 3 1
## text29 16 27 1
## text30 10 10 2
## text31 4 4 1
## text32 22 24 6
## text33 3 3 1
## text34 9 9 1
## text35 12 14 1
## text36 17 18 2
## text37 14 14 2
## text38 20 25 1
## text39 17 18 1
## text40 11 12 2
## text41 2 2 1
## text42 23 28 2
## text43 17 17 1
## text44 7 7 1
## text45 24 27 1
## text46 14 15 1
## text47 22 24 2
## text48 19 21 1
## text49 16 17 1
## text50 15 15 1
## text51 16 18 2
## text52 5 5 1
## text53 9 9 1
## text54 15 19 1
## text55 16 21 2
## text56 13 14 1
## text57 18 20 1
## text58 23 31 3
## text59 9 11 1
## text60 3 3 1
## text61 13 15 1
## text62 20 23 4
## text63 4 4 1
## text64 15 16 2
## text65 21 21 1
## text66 10 12 1
## text67 12 13 1
## text68 24 25 3
## text69 10 10 1
## text70 12 12 1
## text71 12 13 1
## text72 19 23 1
## text73 9 9 1
## text74 12 12 1
## text75 14 18 1
## text76 16 16 1
## text77 15 15 1
## text78 22 24 1
## text79 16 19 1
## text80 12 15 1
## text81 26 27 2
## text82 15 16 2
## text83 9 12 1
## text84 22 25 1
## text85 15 15 1
## text86 10 10 1
## text87 15 18 1
## text88 20 25 2
## text89 27 32 1
## text90 15 16 1
## text91 20 23 1
## text92 19 23 2
## text93 16 22 2
## text94 25 29 2
## text95 20 22 1
## text96 22 24 4
## text97 13 17 2
## text98 16 18 2
## text99 25 28 3
## text100 23 26 1
##
## Source: /Users/pablobarbera/git/big-data-upf/* on x86_64 by pablobarbera
## Created: Wed Jun 28 21:33:43 2017
## Notes:
A useful feature of corpus objects is keywords in context, which returns all the appearances of a word (or combination of words) in its immediate context.
kwic(twcorpus, "brexit", window=10)
##
## [text7905, 7] @2cvdolly1 I'll make the case for | Brexit |
##
## to the best of my ability. I genuinely believe
kwic(twcorpus, "miliband", window=10)
##
## [text1072, 2]
## [text2859, 6]
## [text3757, 4]
## [text4300, 3]
## [text4863, 2]
## [text5150, 6]
## [text5251, 4]
## [text5294, 4]
## [text5376, 10]
## [text6042, 1]
## [text6089, 4]
## [text6291, 2]
## [text7458, 12]
## [text7988, 12]
## [text8075, 28]
## [text9746, 2]
##
## Ed
## RT"@steverichards14: It's
## Watch out Ed
## Here's what
## Ed
## @petercoles44@DavidCoburnUKip@Ed_Miliband To me
## Watch out Ed
## Watch out Ed
## @PrzSkwirczynski@State_Control@WinstonMcK@croydon_oldtown@suzanneshine@hopenothate@SLATUKIP@SquareBiz_T@bieneosa
##
## Watch out Ed
## Ed
## @UKIP_Bolton@Nigel_Farage UKIP ain't racist Mr's Cameron/ Clegg/
## @UKIP_Bolton@Nigel_Farage UKIP ain't racist Mr's Cameron/ Clegg/
## imagine what he can do to cameron& amp;
## Ed
##
## | Miliband | , please cap gym prices http:// t.co
## | Miliband | , not Farage, who's breaking with tradition and upsetting
## | Miliband | : Nigel Farage and Ukip targets Labour( via@daily_express
## | Miliband | had to say about standing up to UKIP- http
## | Miliband | , please cap gym prices http:// t.co
## | Miliband | looks a shifty character with normal people
## | Miliband | : Nigel Farage and Ukip targets Labour( via@daily_express
## | Miliband | : Nigel Farage and Ukip targets Labour( via@daily_express
## | Miliband | apologise!
## | Miliband | saying he wants to hear where#UKIP stands on key
## | Miliband | : Nigel Farage and Ukip targets Labour( via@daily_express
## | Miliband | is heading for disaster as Labour MPs say party leaders
## | Miliband | Sticks& amp; Stones Two VOTES ukip http:
## | Miliband | Sticks& amp; Stones Two VOTES ukip http:
## | miliband |
## | Miliband | is heading for disaster as Labour MPs say party leaders
kwic(twcorpus, "eu referendum", window=10)
##
## [text5316, 18:19]
## [text5756, 18:19]
## [text6906, 12:13]
## [text9038, 12:13]
##
## , why wait for 2017 for an in/ out |
## , why wait for 2017 for an in/ out |
## @Nigel_Farage What happened to#Cameron's cast iron guarantee of an |
## @Nigel_Farage What happened to#Cameron's cast iron guarantee of an |
##
## EU referendum | - If EU renegotiation impossible?
## EU referendum | - If EU renegotiation impossible?
## EU referendum | before the last election?#bbcsp
## EU referendum | before the last election?#bbcsp
We can then convert a corpus into a document-feature matrix using the dfm
function.
twdfm <- dfm(twcorpus, verbose=TRUE)
## Creating a dfm from a corpus ...
## ... tokenizing texts
## ... lowercasing
## ... found 10,000 documents, 16,513 features
## ... created a 10,000 x 16,513 sparse dfm
## ... complete.
## Elapsed time: 0.001 seconds.
twdfm
## Document-feature matrix of: 10,000 documents, 16,513 features (99.9% sparse).
dfm
has many useful options. Let’s actually use it to stem the text, extract n-grams, remove punctuation, keep Twitter features…
?dfm
twdfm <- dfm(twcorpus, tolower=TRUE, stem=TRUE, remove_punct = TRUE, ngrams=1:3, verbose=TRUE)
## Creating a dfm from a corpus ...
## ... tokenizing texts
## ... lowercasing
## ... found 10,000 documents, 154,722 features
## ... stemming features (English)
## , trimmed 5111 feature variants
## ... created a 10,000 x 149,611 sparse dfm
## ... complete.
## Elapsed time: 5.82 seconds.
twdfm
## Document-feature matrix of: 10,000 documents, 149,611 features (100% sparse).
Note that here we use ngrams – this will extract all combinations of one, two, and three words (e.g. it will consider both “human”, “rights”, and “human rights” as tokens in the matrix).
Stemming relies on the SnowballC
package’s implementation of the Porter stemmer:
tokenize(tweets$text[1])
## tokenizedTexts from 1 document.
## Component 1 :
## [1] "@NSinclaireMEP" "Knew" "that" "Lib"
## [5] "Dems" "getting" "into" "bed"
## [9] "with" "Tories" "would" "end"
## [13] "like" "this" "." "They"
## [17] "might" "never" "get" "another"
## [21] "bite" "of" "the" "cherry"
## [25] "."
tokens_wordstem(tokenize(tweets$text[1]))
## tokenizedTexts from 1 document.
## Component 1 :
## [1] "@NSinclaireMEP" "Knew" "that" "Lib"
## [5] "Dem" "get" "into" "bed"
## [9] "with" "Tori" "would" "end"
## [13] "like" "thi" "." "Thei"
## [17] "might" "never" "get" "anoth"
## [21] "bite" "of" "the" "cherri"
## [25] "."
In a large corpus like this, many features often only appear in one or two documents. In some case it’s a good idea to remove those features, to speed up the analysis or because they’re not relevant. We can trim
the dfm:
twdfm <- dfm_trim(twdfm, min_docfreq=3, verbose=TRUE)
## Removing features occurring:
## - in fewer than 3 document: 132,761
## Total features removed: 132,761 (88.7%).
It’s often a good idea to take a look at a wordcloud of the most frequent features to see if there’s anything weird.
textplot_wordcloud(twdfm, rot.per=0, scale=c(3.5, .75), max.words=100)
What is going on? We probably want to remove words and symbols which are not of interest to our data, such as http here. This class of words which is not relevant are called stopwords. These are words which are common connectors in a given language (e.g. “a”, “the”, “is”). We can also see the list using topFeatures
topfeatures(twdfm, 25)
## the to a you of
## 3731 3090 2259 2136 1950
## in t.co http http_t.co and
## 1950 1863 1744 1744 1718
## for i @nigel_farag is it
## 1706 1616 1535 1475 1452
## on that be thank not
## 1277 1099 919 849 828
## are have with ukip vote
## 800 790 719 690 684
We can remove the stopwords when we create the dfm
object:
twdfm <- dfm(twcorpus, remove_punct = TRUE, remove=c(
stopwords("english"), "t.co", "https", "rt", "amp", "http", "t.c", "can", "u"), verbose=TRUE)
## Creating a dfm from a corpus ...
## ... tokenizing texts
## ... lowercasing
## ... found 10,000 documents, 16,634 features
## ...
## dfm_select removed 177 features and 0 documents, padding 0s for 0 features and 0 documents.
## ... created a 10,000 x 16,457 sparse dfm
## ... complete.
## Elapsed time: 0.107 seconds.
textplot_wordcloud(twdfm, rot.per=0, scale=c(3.5, .75), max.words=100)
One nice feature of quanteda is that we can easily add metadata to the corpus object.
docvars(twcorpus) <- data.frame(screen_name=tweets$screen_name, polite=tweets$polite)
summary(twcorpus)
## Corpus consisting of 10000 documents, showing 100 documents.
##
## Text Types Tokens Sentences screen_name polite
## text1 24 25 2 martinwedge impolite
## text2 22 22 1 WillGav polite
## text3 7 10 1 CathyWood55 polite
## text4 5 5 1 CStephenssnp polite
## text5 4 4 1 sanchia4europe polite
## text6 13 13 2 EnglandsAce polite
## text7 25 26 3 MikeGreenUKIP polite
## text8 21 23 1 Anothergreen polite
## text9 16 18 2 kell901 polite
## text10 26 29 1 BranimiraMachev polite
## text11 22 24 3 NorseFired polite
## text12 7 8 1 CharlesTannock polite
## text13 23 24 1 GoodallGiles polite
## text14 17 18 1 francisdolarhy2 polite
## text15 26 27 3 CuinnUiNeill polite
## text16 24 26 2 HenryMcMorrow polite
## text17 20 29 1 DavidCoburnUKip polite
## text18 15 16 1 ajcdeane polite
## text19 20 22 2 jackbuckby polite
## text20 20 20 2 kvmarthur polite
## text21 27 31 2 YOURvoiceParty polite
## text22 21 23 1 101flyboy polite
## text23 10 10 1 CharlesTannock polite
## text24 20 22 2 DuncanRights polite
## text25 13 13 1 skepticalvoter polite
## text26 12 12 1 DavidCoburnUKip polite
## text27 9 11 1 scrapperduncan polite
## text28 3 3 1 zander469 polite
## text29 16 27 1 DavidWickham3 polite
## text30 10 10 2 Green_Caroline polite
## text31 4 4 1 GucciAirbag_ polite
## text32 22 24 6 Comrade58 impolite
## text33 3 3 1 DugaldMacMillan polite
## text34 9 9 1 Shyman33 polite
## text35 12 14 1 jackbuckby polite
## text36 17 18 2 cymroynewrop polite
## text37 14 14 2 danielrhamilton polite
## text38 20 25 1 Green_DannyB polite
## text39 17 18 1 GoodallGiles polite
## text40 11 12 2 PascaleLamb polite
## text41 2 2 1 helena_pigott polite
## text42 23 28 2 EnzaFerreri polite
## text43 17 17 1 NSinclaireMEP polite
## text44 7 7 1 NSinclaireMEP polite
## text45 24 27 1 garrodt polite
## text46 14 15 1 DavidCoburnUKip polite
## text47 22 24 2 dannyyoung35 polite
## text48 19 21 1 DavidCoburnUKip polite
## text49 16 17 1 GreggatQuest polite
## text50 15 15 1 Wise64 polite
## text51 16 18 2 FionaRadic polite
## text52 5 5 1 CStephenssnp polite
## text53 9 9 1 Kevinmorosky polite
## text54 15 19 1 sanchia4europe polite
## text55 16 21 2 ScrumpyNed polite
## text56 13 14 1 JosephMcShane polite
## text57 18 20 1 SarahLudfordMEP polite
## text58 23 31 3 GinaDowding polite
## text59 9 11 1 DavidWickham3 polite
## text60 3 3 1 katrinamurray71 polite
## text61 13 15 1 IainMcGill polite
## text62 20 23 4 SchaduwStaten polite
## text63 4 4 1 Rory_Palmer polite
## text64 15 16 2 PercyBlakeney63 polite
## text65 21 21 1 DanHannanMEP polite
## text66 10 12 1 jennyknight2014 polite
## text67 12 13 1 Steven_Woolfe polite
## text68 24 25 3 JamesJimCarver polite
## text69 10 10 1 DavidCoburnUKip polite
## text70 12 12 1 waddesdonbaz polite
## text71 12 13 1 Cumpedelibero polite
## text72 19 23 1 Green_DannyB polite
## text73 9 9 1 F1andyD polite
## text74 12 12 1 graham_pointer polite
## text75 14 18 1 veganfishcake polite
## text76 16 16 1 peterlfoster polite
## text77 15 15 1 DavidCoburnUKip polite
## text78 22 24 1 londonstatto polite
## text79 16 19 1 TurfShifter polite
## text80 12 15 1 suzanneshine polite
## text81 26 27 2 GoodallGiles polite
## text82 15 16 2 Mauginog polite
## text83 9 12 1 rivermagic123 polite
## text84 22 25 1 SHKMEP polite
## text85 15 15 1 GrillingKippers polite
## text86 10 10 1 Zoidybear polite
## text87 15 18 1 CulliganPA polite
## text88 20 25 2 globalrichard polite
## text89 27 32 1 davenellist polite
## text90 15 16 1 AlynSmithMEP polite
## text91 20 23 1 suzanneshine polite
## text92 19 23 2 ssilverwavess polite
## text93 16 22 2 GlobalYawning polite
## text94 25 29 2 CulliganPA polite
## text95 20 22 1 CllrChrisPain polite
## text96 22 24 4 dennisterrey impolite
## text97 13 17 2 Anothergreen polite
## text98 16 18 2 FionaRadic polite
## text99 25 28 3 cristian7897 polite
## text100 23 26 1 jennyknight2014 polite
##
## Source: /Users/pablobarbera/git/big-data-upf/* on x86_64 by pablobarbera
## Created: Wed Jun 28 21:33:43 2017
## Notes:
We can then use this metadata to subset the dataset:
polite.tweets <- corpus_subset(twcorpus, polite=="impolite")
And then extract the text:
mytexts <- texts(polite.tweets)
We’ll come back later to this dataset.
There are different ways to read text into R
and create a corpus
object with quanteda
. We have already seen the most common way, importing the text from a csv file and then adding the metadata, but quanteda
has a built-in function to help with this:
library(readtext)
tweets <- readtext(file='data/EP-elections-tweets.csv')
twcorpus <- corpus(tweets)
This function will also work with text in multiple files. To do this, we use the textfile command, and use the ‘glob’ operator ’*’ to indicate that we want to load multiple files:
myCorpus <- readtext(file='data/inaugural/*.txt')
inaugCorpus <- corpus(myCorpus)