Detecting the encoding of your system.
Sys.getlocale(category = "LC_CTYPE")
## [1] "en_US.UTF-8"
Let’s now consider some text in German that contains a non-ASCII character.
# some text in German
de <- "Einbahnstraße"
# all good!
message(de)
## Einbahnstraße
de
## [1] "Einbahnstraße"
This worked because the file where we are writing is in UTF-8 encoding so it automatically recognizes the encoding.
Encoding(de)
## [1] "unknown"
But what if the file is not in UTF-8 and when we save it and re-open it, it looks like this? As long as we set the right encoding, we can switch back and forth.
de <- "Einbahnstra\u00dfe"
Encoding(de)
## [1] "UTF-8"
message(de)
## Einbahnstraße
# this is the wrong encoding
Encoding(de) <- "latin1"
message(de)
## EinbahnstraÃe
# now back to the right encoding
Encoding(de) <- "UTF-8"
message(de)
## Einbahnstraße
We can also use the stringi package to fix this
library(stringi)
stri_unescape_unicode("Einbahnstra\u00dfe")
## [1] "Einbahnstraße"
If you want to translate a string from one encoding scheme to another in a single line of code, you can use iconv
:
de <- "Einbahnstra\xdfe"
iconv(de, from="windows-1252", to="UTF-8")
## [1] "Einbahnstraße"
de <- "Einbahnstra\u00dfe"
iconv(de, from="UTF-8", to="latin1")
## [1] "Einbahnstraße"
You’re probably wondering now - how do we know the encoding of some text we want to analyze? Good question! Turns out it’s a hard problem, but we can use the guess_encoding
question in the rvest
package (which uses stri_enc_detect
in the stringi
package) to try to figure that out…
library(rvest)
## Loading required package: xml2
de <- "Einbahnstra\xdfe"
stri_enc_detect(de)
## [[1]]
## Encoding Language Confidence
## 1 ISO-8859-1 de 0.21
## 2 UTF-16BE 0.10
## 3 UTF-16LE 0.10
## 4 Shift_JIS ja 0.10
## 5 GB18030 zh 0.10
## 6 Big5 zh 0.10
guess_encoding(de)
## encoding language confidence
## 1 ISO-8859-1 de 0.21
## 2 UTF-16BE 0.10
## 3 UTF-16LE 0.10
## 4 Shift_JIS ja 0.10
## 5 GB18030 zh 0.10
## 6 Big5 zh 0.10
iconv(de, from="ISO-8859-1", to="UTF-8")
## [1] "Einbahnstraße"
de <- "Einbahnstra\u00dfe"
stri_enc_detect(de)
## [[1]]
## Encoding Language Confidence
## 1 UTF-8 0.8
## 2 windows-1252 de 0.2
## 3 UTF-16BE 0.1
## 4 UTF-16LE 0.1
## 5 Shift_JIS ja 0.1
## 6 GB18030 zh 0.1
## 7 Big5 zh 0.1
guess_encoding(de)
## encoding language confidence
## 1 UTF-8 0.8
## 2 windows-1252 de 0.2
## 3 UTF-16BE 0.1
## 4 UTF-16LE 0.1
## 5 Shift_JIS ja 0.1
## 6 GB18030 zh 0.1
## 7 Big5 zh 0.1
message(de) # no need for translation!
## Einbahnstraße
The same applies to websites… (Although you can also check the <meta>
tag for clues.)
url <- "http://www.presidency.ucsb.edu/ws/index.php?pid=96348"
guess_encoding(url)
## encoding language confidence
## 1 ISO-8859-1 en 0.19
## 2 UTF-8 0.15
## 3 ISO-8859-9 tr 0.12
## 4 UTF-16BE 0.10
## 5 UTF-16LE 0.10
## 6 Shift_JIS ja 0.10
## 7 GB18030 zh 0.10
## 8 EUC-JP ja 0.10
## 9 EUC-KR ko 0.10
## 10 Big5 zh 0.10
## 11 ISO-8859-2 cs 0.06
## 12 IBM420_rtl ar 0.05
url <- "http://www.spiegel.de"
guess_encoding(url)
## encoding language confidence
## 1 ISO-8859-1 es 0.45
## 2 ISO-8859-2 ro 0.30
## 3 ISO-8859-9 tr 0.30
## 4 UTF-8 0.15
## 5 UTF-16BE 0.10
## 6 UTF-16LE 0.10
## 7 Shift_JIS ja 0.10
## 8 GB18030 zh 0.10
## 9 EUC-JP ja 0.10
## 10 EUC-KR ko 0.10
## 11 Big5 zh 0.10
url <- "http://www.elpais.es"
guess_encoding(url)
## encoding language confidence
## 1 ISO-8859-1 es 0.63
## 2 ISO-8859-2 hu 0.31
## 3 UTF-8 0.15
## 4 UTF-16BE 0.10
## 5 UTF-16LE 0.10
## 6 Shift_JIS ja 0.10
## 7 GB18030 zh 0.10
## 8 EUC-JP ja 0.10
## 9 EUC-KR ko 0.10
## 10 Big5 zh 0.10
Unicode text can take different, and somewhat complicated, forms when you scrape it from the web. Here we’ll see some of the most common and how to avoid getting errors when we parse text scraped from the web. We’ll be using the stringi
package for some of the code here.
# what if it looks like this? (Unicode characters as HEX/bite codes)
# see: http://www.fileformat.info/info/unicode/char/00df/index.htm
de <- "Einbahnstra<c3><9f>e"
# this will not work:
guess_encoding(de)
## encoding language confidence
## 1 ISO-8859-1 it 0.50
## 2 ISO-8859-2 ro 0.16
## 3 UTF-8 0.15
## 4 UTF-16BE 0.10
## 5 UTF-16LE 0.10
## 6 Shift_JIS ja 0.10
## 7 GB18030 zh 0.10
## 8 EUC-JP ja 0.10
## 9 EUC-KR ko 0.10
## 10 Big5 zh 0.10
iconv(de, from="ISO-8859-1", to="UTF-8")
## [1] "Einbahnstra<c3><9f>e"
stri_unescape_unicode(de)
## [1] "Einbahnstra<c3><9f>e"
# one solution from stack overflow:
# https://stackoverflow.com/questions/25468716/convert-byte-encoding-to-unicode
m <- gregexpr("<[0-9a-f]{2}>", de)
codes <- regmatches(de,m)
chars <- lapply(codes, function(x) {
rawToChar(as.raw(strtoi(paste0("0x",substr(x,2,3)))), multiple=T)
})
regmatches(de,m) <- chars
de
## [1] "Einbahnstraße"
# what is happening here? We're replacing:
codes
## [[1]]
## [1] "<c3>" "<9f>"
# with:
chars
## [[1]]
## [1] "\xc3" "\x9f"
# switching to a different language...
# what if it looks like this?
example <- c(
"SAD DA POMOGNU RJE<U+0160>AVANJE POLITI<U+010C>KE KRIZE",
"PROBLEME GURAJU POD TEPIH",
"ODAO PRIZNANJE DR<U+017D>AVI")
# different representation of Unicode characters, e.g.:
# http://www.fileformat.info/info/unicode/char/0160/index.htm
# this will not work either:
guess_encoding(example)
## encoding language confidence
## 1 ISO-8859-2 cs 0.42
## 2 ISO-8859-1 pt 0.27
## 3 UTF-8 0.15
## 4 IBM424_ltr he 0.11
## 5 UTF-16BE 0.10
## 6 UTF-16LE 0.10
## 7 Shift_JIS ja 0.10
## 8 GB18030 zh 0.10
## 9 EUC-JP ja 0.10
## 10 EUC-KR ko 0.10
## 11 Big5 zh 0.10
## 12 ISO-8859-9 tr 0.09
## 13 IBM424_rtl he 0.08
iconv(example, from="ISO-8859-2", to="UTF-8")
## [1] "SAD DA POMOGNU RJE<U+0160>AVANJE POLITI<U+010C>KE KRIZE"
## [2] "PROBLEME GURAJU POD TEPIH"
## [3] "ODAO PRIZNANJE DR<U+017D>AVI"
# Things get even more complicated...
# One solution here:
# https://stackoverflow.com/questions/28248457/gsub-in-r-with-unicode-replacement-give-different-results-under-windows-compared
# we're basically going to convert to regular Unicode characters that
# R will be able to parse
trueunicode.hack <- function(string){
m <- gregexpr("<U\\+[0-9A-F]{4}>", string)
if(-1==m[[1]][1])
return(string)
codes <- unlist(regmatches(string, m))
replacements <- codes
N <- length(codes)
for(i in 1:N){
replacements[i] <- intToUtf8(strtoi(paste0("0x", substring(codes[i], 4, 7))))
}
# if the string doesn't start with a unicode, the copy its initial part
# until first occurrence of unicode
if(1!=m[[1]][1]){
y <- substring(string, 1, m[[1]][1]-1)
y <- paste0(y, replacements[1])
}else{
y <- replacements[1]
}
# if more than 1 unicodes in the string
if(1<N){
for(i in 2:N){
s <- gsub("<U\\+[0-9A-F]{4}>", replacements[i],
substring(string, m[[1]][i-1]+8, m[[1]][i]+7))
Encoding(s) <- "UTF-8"
y <- paste0(y, s)
}
}
# get the trailing contents, if any
if( nchar(string)>(m[[1]][N]+8) )
y <- paste0( y, substring(string, m[[1]][N]+8, nchar(string)) )
y
}
trueunicode.hack(example[1])
## [1] "SAD DA POMOGNU RJEŠAVANJE POLITIČKE KRIZE"
trueunicode.hack(example[2])
## [1] "PROBLEME GURAJU POD TEPIH"
trueunicode.hack(example[3])
## [1] "ODAO PRIZNANJE DRŽAVI"
# and here's how we would convert back and forth...
# same text in Croatian
example <- "SAD DA POMOGNU RJEŠAVANJE POLITIČKE KRIZE"
Encoding(example) # UTF-8
## [1] "unknown"
# convert to ASCII and delete non-ASCII characters
iconv(example, "UTF-8", "ASCII", sub="")
## [1] "SAD DA POMOGNU RJEAVANJE POLITIKE KRIZE"
# convert to latin1 and substitute to byte characters
(lat <- iconv(example, "UTF-8", "latin1", sub="byte"))
## [1] "SAD DA POMOGNU RJE<c5><a0>AVANJE POLITI<c4><8c>KE KRIZE"
m <- gregexpr("<[0-9a-f]{2}>", lat)
codes <- regmatches(lat,m)
chars <- lapply(codes, function(x) {
rawToChar(as.raw(strtoi(paste0("0x",substr(x,2,3)))), multiple=T)
})
regmatches(lat,m) <- chars
lat
## [1] "SAD DA POMOGNU RJEŠAVANJE POLITIČKE KRIZE"
And one final example…
example <- "\U0001F602 \U0001F64C \U0001F602" # extended unicode character
message(example)
## 😂 🙌 😂
# you can search for the unicode representations of all these characters online
And now we can use this to search for these characters on Twitter!
load("~/my_oauth")
library(tweetscores)
## Loading required package: R2WinBUGS
## Loading required package: coda
## Loading required package: boot
## ##
## ## tweetscores: tools for the analysis of Twitter data
## ## Pablo Barbera (LSE)
## ## www.tweetscores.com
## ##
library(streamR)
## Loading required package: RCurl
## Loading required package: bitops
## Loading required package: rjson
## Loading required package: ndjson
message("\U0001F926")
searchTweets(q="\U0001F926",
filename="~/data/unicode-tweets.json",
n=1000,
oauth=my_oauth)
tweets <- parseTweets("~/data/unicode-tweets.json")
## 1000 tweets have been parsed.
message(sample(tweets$text, 5))
## RT @vwayano: 起きたら (もはや住んでいる時間帯違うww) #生きろ 9/12 発売のメール来てた 😭😭 しかもアニバーサリーの週に合わせてるし 💜💖💛💚
##
## 何よりクリアファイルとポスターが付いてるだと⁉️買わなきゃじゃん🤦♀️ しかもスペシャルBOXもある⁉️…とりあえず、今月のミュと来月に向けてジョギング再開しよう…🤦♀️
## 最近、暴飲暴食が過ぎてたからな〜بمجرد ماقالت له "اوبا" طلعت حركاته الي مدري من وينها 🤦♂️ كوميدية هي سون للحين نادره😂 https://t.co/caZfRp2u5u@boobblegums Im so changing my emoji 🤦♂️ i am loyal to you@34mk10 @mountain45_ でも知念姫がいらっしゃったらやまだくんこっち見向きもしないぜ…それはそれでいいか。ちびーずの戯れを微笑ましく見学させていただこう(趣旨が変わる)
## 待って真正面は無理死んでまう……🤦♀️
# reading into R
tweets <- streamR::parseTweets("~/data/japanese-tweets.json", simplify=TRUE)
## 176 tweets have been parsed.
library(quanteda)
## Package version: 1.3.0
## Parallel computing: 2 of 4 threads used.
## See https://quanteda.io for tutorials and examples.
##
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
##
## View
tw <- corpus(tweets$text)
twdfm <- dfm_select(dfm(tw, remove_punct = TRUE, verbose=TRUE, remove_url=TRUE),
min_nchar=2)
## Creating a dfm from a corpus input...
## ... lowercasing
## ... found 176 documents, 1,321 features
## ... created a 176 x 1,321 sparse dfm
## ... complete.
## Elapsed time: 0.08 seconds.
topfeatures(twdfm, n=25)
## rt って から ない さん
## 86 35 26 22 21
## です する した フォロー プレゼント
## 18 17 17 15 15
## ます 抽選 ローリー ショコラ 締切
## 15 14 14 14 12
## @mayla_classic amp 応募 完了 この
## 11 11 11 11 11
## ちょっと これ 可愛い くだ さい
## 11 10 10 10 10
What doesn’t work:
textplot_wordcloud(twdfm, rot.per=0, scale=c(3, .75), max.words=100)
## Warning: scalemax.wordsrot.per is deprecated; use min_size and
## max_sizemax_wordsrotation instead
But this should now work:
pdf("wordcloud.pdf", family="Japan1")
textplot_wordcloud(twdfm, rot.per=0, scale=c(3, .75), max.words=100)
## Warning: scalemax.wordsrot.per is deprecated; use min_size and
## max_sizemax_wordsrotation instead
dev.off()
## png
## 2
How to choose the family font? See ?postscriptFonts
.