Encoding issues

Basics of character encoding in R

Detecting the encoding of your system.

Sys.getlocale(category = "LC_CTYPE")

## [1] "en_US.UTF-8"

Let’s now consider some text in German that contains a non-ASCII character.

# some text in German
de <- "Einbahnstraße"
# all good!
message(de)

## Einbahnstraße

de

## [1] "Einbahnstraße"

This worked because the file where we are writing is in UTF-8 encoding so it automatically recognizes the encoding.

Encoding(de)

## [1] "unknown"

But what if the file is not in UTF-8 and when we save it and re-open it, it looks like this? As long as we set the right encoding, we can switch back and forth.

de <- "Einbahnstra\u00dfe"
Encoding(de)

## [1] "UTF-8"

message(de)

## Einbahnstraße

# this is the wrong encoding
Encoding(de) <- "latin1"
message(de)

## EinbahnstraÃe

# now back to the right encoding
Encoding(de) <- "UTF-8"
message(de)

## Einbahnstraße

We can also use the stringi package to fix this

library(stringi)
stri_unescape_unicode("Einbahnstra\u00dfe")

## [1] "Einbahnstraße"

If you want to translate a string from one encoding scheme to another in a single line of code, you can use iconv:

de <- "Einbahnstra\xdfe"
iconv(de, from="windows-1252", to="UTF-8")

## [1] "Einbahnstraße"

de <- "Einbahnstra\u00dfe"
iconv(de, from="UTF-8", to="latin1")

## [1] "Einbahnstraße"

You’re probably wondering now - how do we know the encoding of some text we want to analyze? Good question! Turns out it’s a hard problem, but we can use the guess_encoding question in the rvest package (which uses stri_enc_detect in the stringi package) to try to figure that out…

library(rvest)

## Loading required package: xml2

de <- "Einbahnstra\xdfe"
stri_enc_detect(de)

## [[1]]
##     Encoding Language Confidence
## 1 ISO-8859-1       de       0.21
## 2   UTF-16BE                0.10
## 3   UTF-16LE                0.10
## 4  Shift_JIS       ja       0.10
## 5    GB18030       zh       0.10
## 6       Big5       zh       0.10

guess_encoding(de)

##     encoding language confidence
## 1 ISO-8859-1       de       0.21
## 2   UTF-16BE                0.10
## 3   UTF-16LE                0.10
## 4  Shift_JIS       ja       0.10
## 5    GB18030       zh       0.10
## 6       Big5       zh       0.10

iconv(de, from="ISO-8859-1", to="UTF-8")

## [1] "Einbahnstraße"

de <- "Einbahnstra\u00dfe"
stri_enc_detect(de)

## [[1]]
##       Encoding Language Confidence
## 1        UTF-8                 0.8
## 2 windows-1252       de        0.2
## 3     UTF-16BE                 0.1
## 4     UTF-16LE                 0.1
## 5    Shift_JIS       ja        0.1
## 6      GB18030       zh        0.1
## 7         Big5       zh        0.1

guess_encoding(de)

##       encoding language confidence
## 1        UTF-8                 0.8
## 2 windows-1252       de        0.2
## 3     UTF-16BE                 0.1
## 4     UTF-16LE                 0.1
## 5    Shift_JIS       ja        0.1
## 6      GB18030       zh        0.1
## 7         Big5       zh        0.1

message(de) # no need for translation!

## Einbahnstraße

The same applies to websites… (Although you can also check the <meta> tag for clues.)

url <- "http://www.presidency.ucsb.edu/ws/index.php?pid=96348"
guess_encoding(url)

##      encoding language confidence
## 1  ISO-8859-1       en       0.19
## 2       UTF-8                0.15
## 3  ISO-8859-9       tr       0.12
## 4    UTF-16BE                0.10
## 5    UTF-16LE                0.10
## 6   Shift_JIS       ja       0.10
## 7     GB18030       zh       0.10
## 8      EUC-JP       ja       0.10
## 9      EUC-KR       ko       0.10
## 10       Big5       zh       0.10
## 11 ISO-8859-2       cs       0.06
## 12 IBM420_rtl       ar       0.05

url <- "http://www.spiegel.de"
guess_encoding(url)

##      encoding language confidence
## 1  ISO-8859-1       es       0.45
## 2  ISO-8859-2       ro       0.30
## 3  ISO-8859-9       tr       0.30
## 4       UTF-8                0.15
## 5    UTF-16BE                0.10
## 6    UTF-16LE                0.10
## 7   Shift_JIS       ja       0.10
## 8     GB18030       zh       0.10
## 9      EUC-JP       ja       0.10
## 10     EUC-KR       ko       0.10
## 11       Big5       zh       0.10

url <- "http://www.elpais.es"
guess_encoding(url)

##      encoding language confidence
## 1  ISO-8859-1       es       0.63
## 2  ISO-8859-2       hu       0.31
## 3       UTF-8                0.15
## 4    UTF-16BE                0.10
## 5    UTF-16LE                0.10
## 6   Shift_JIS       ja       0.10
## 7     GB18030       zh       0.10
## 8      EUC-JP       ja       0.10
## 9      EUC-KR       ko       0.10
## 10       Big5       zh       0.10

Dealing with Unicode headaches

Unicode text can take different, and somewhat complicated, forms when you scrape it from the web. Here we’ll see some of the most common and how to avoid getting errors when we parse text scraped from the web. We’ll be using the stringi package for some of the code here.

# what if it looks like this? (Unicode characters as HEX/bite codes)
# see: http://www.fileformat.info/info/unicode/char/00df/index.htm
de <- "Einbahnstra<c3><9f>e"
# this will not work:
guess_encoding(de)

##      encoding language confidence
## 1  ISO-8859-1       it       0.50
## 2  ISO-8859-2       ro       0.16
## 3       UTF-8                0.15
## 4    UTF-16BE                0.10
## 5    UTF-16LE                0.10
## 6   Shift_JIS       ja       0.10
## 7     GB18030       zh       0.10
## 8      EUC-JP       ja       0.10
## 9      EUC-KR       ko       0.10
## 10       Big5       zh       0.10

iconv(de, from="ISO-8859-1", to="UTF-8")

## [1] "Einbahnstra<c3><9f>e"

stri_unescape_unicode(de)

## [1] "Einbahnstra<c3><9f>e"

# one solution from stack overflow:
# https://stackoverflow.com/questions/25468716/convert-byte-encoding-to-unicode
m <- gregexpr("<[0-9a-f]{2}>", de)
codes <- regmatches(de,m)
chars <- lapply(codes, function(x) {
    rawToChar(as.raw(strtoi(paste0("0x",substr(x,2,3)))), multiple=T)
})
regmatches(de,m) <- chars
de

## [1] "Einbahnstraße"

# what is happening here? We're replacing:
codes

## [[1]]
## [1] "<c3>" "<9f>"

# with:
chars

## [[1]]
## [1] "\xc3" "\x9f"

# switching to a different language...
# what if it looks like this?
example <- c(
  "SAD DA POMOGNU RJE<U+0160>AVANJE POLITI<U+010C>KE KRIZE", 
  "PROBLEME GURAJU POD TEPIH", 
  "ODAO PRIZNANJE DR<U+017D>AVI")
# different representation of Unicode characters, e.g.:
# http://www.fileformat.info/info/unicode/char/0160/index.htm
# this will not work either:
guess_encoding(example)

##      encoding language confidence
## 1  ISO-8859-2       cs       0.42
## 2  ISO-8859-1       pt       0.27
## 3       UTF-8                0.15
## 4  IBM424_ltr       he       0.11
## 5    UTF-16BE                0.10
## 6    UTF-16LE                0.10
## 7   Shift_JIS       ja       0.10
## 8     GB18030       zh       0.10
## 9      EUC-JP       ja       0.10
## 10     EUC-KR       ko       0.10
## 11       Big5       zh       0.10
## 12 ISO-8859-9       tr       0.09
## 13 IBM424_rtl       he       0.08

iconv(example, from="ISO-8859-2", to="UTF-8")

## [1] "SAD DA POMOGNU RJE<U+0160>AVANJE POLITI<U+010C>KE KRIZE"
## [2] "PROBLEME GURAJU POD TEPIH"                              
## [3] "ODAO PRIZNANJE DR<U+017D>AVI"

# Things get even more complicated...
# One solution here:
# https://stackoverflow.com/questions/28248457/gsub-in-r-with-unicode-replacement-give-different-results-under-windows-compared
# we're basically going to convert to regular Unicode characters that
# R will be able to parse

trueunicode.hack <- function(string){
    m <- gregexpr("<U\\+[0-9A-F]{4}>", string)
    if(-1==m[[1]][1])
        return(string)

    codes <- unlist(regmatches(string, m))
    replacements <- codes
    N <- length(codes)
    for(i in 1:N){
        replacements[i] <- intToUtf8(strtoi(paste0("0x", substring(codes[i], 4, 7))))
    }

    # if the string doesn't start with a unicode, the copy its initial part
    # until first occurrence of unicode
    if(1!=m[[1]][1]){
        y <- substring(string, 1, m[[1]][1]-1)
        y <- paste0(y, replacements[1])
    }else{
        y <- replacements[1]
    }

    # if more than 1 unicodes in the string
    if(1<N){
        for(i in 2:N){
            s <- gsub("<U\\+[0-9A-F]{4}>", replacements[i], 
                      substring(string, m[[1]][i-1]+8, m[[1]][i]+7))
            Encoding(s) <- "UTF-8"
            y <- paste0(y, s)
        }
    }

    # get the trailing contents, if any
    if( nchar(string)>(m[[1]][N]+8) )
        y <- paste0( y, substring(string, m[[1]][N]+8, nchar(string)) )
    y
}

trueunicode.hack(example[1])

## [1] "SAD DA POMOGNU RJEŠAVANJE POLITIČKE KRIZE"

trueunicode.hack(example[2])

## [1] "PROBLEME GURAJU POD TEPIH"

trueunicode.hack(example[3])

## [1] "ODAO PRIZNANJE DRŽAVI"

# and here's how we would convert back and forth...
# same text in Croatian
example <- "SAD DA POMOGNU RJEŠAVANJE POLITIČKE KRIZE"
Encoding(example) # UTF-8

## [1] "unknown"

# convert to ASCII and delete non-ASCII characters
iconv(example, "UTF-8", "ASCII", sub="")

## [1] "SAD DA POMOGNU RJEAVANJE POLITIKE KRIZE"

# convert to latin1 and substitute to byte characters
(lat <- iconv(example, "UTF-8", "latin1", sub="byte"))

## [1] "SAD DA POMOGNU RJE<c5><a0>AVANJE POLITI<c4><8c>KE KRIZE"

m <- gregexpr("<[0-9a-f]{2}>", lat)
codes <- regmatches(lat,m)
chars <- lapply(codes, function(x) {
    rawToChar(as.raw(strtoi(paste0("0x",substr(x,2,3)))), multiple=T)
})
regmatches(lat,m) <- chars
lat

## [1] "SAD DA POMOGNU RJEŠAVANJE POLITIČKE KRIZE"

And one final example…

example <- "\U0001F602 \U0001F64C \U0001F602" # extended unicode character
message(example)

## 😂 🙌 😂

# you can search for the unicode representations of all these characters online

And now we can use this to search for these characters on Twitter!

load("~/my_oauth")
library(tweetscores)

## Loading required package: R2WinBUGS

## Loading required package: coda

## Loading required package: boot

## ##
## ## tweetscores: tools for the analysis of Twitter data

## ## Pablo Barbera (LSE)

## ## www.tweetscores.com
## ##

library(streamR)

## Loading required package: RCurl

## Loading required package: bitops

## Loading required package: rjson

## Loading required package: ndjson

message("\U0001F926")
searchTweets(q="\U0001F926",
  filename="~/data/unicode-tweets.json",
  n=1000,
  oauth=my_oauth)

tweets <- parseTweets("~/data/unicode-tweets.json")

## 1000 tweets have been parsed.

message(sample(tweets$text, 5))

## RT @vwayano: 起きたら (もはや住んでいる時間帯違うww) #生きろ 9/12 発売のメール来てた 😭😭 しかもアニバーサリーの週に合わせてるし 💜💖💛💚
## 
## 何よりクリアファイルとポスターが付いてるだと⁉️買わなきゃじゃん🤦‍♀️ しかもスペシャルBOXもある⁉️…とりあえず、今月のミュと来月に向けてジョギング再開しよう…🤦‍♀️
## 最近、暴飲暴食が過ぎてたからな〜بمجرد ماقالت له "اوبا" طلعت حركاته الي مدري من وينها 🤦‍♂️ كوميدية هي سون للحين نادره😂 https://t.co/caZfRp2u5u@boobblegums Im so changing  my emoji 🤦‍♂️ i am loyal to you@34mk10 @mountain45_ でも知念姫がいらっしゃったらやまだくんこっち見向きもしないぜ…それはそれでいいか。ちびーずの戯れを微笑ましく見学させていただこう（趣旨が変わる）
## 待って真正面は無理死んでまう……🤦‍♀️

Wordclouds with Japanese, Korean, and Chinese characters

# reading into R
tweets <- streamR::parseTweets("~/data/japanese-tweets.json", simplify=TRUE)

## 176 tweets have been parsed.

library(quanteda)

## Package version: 1.3.0

## Parallel computing: 2 of 4 threads used.

## See https://quanteda.io for tutorials and examples.

## 
## Attaching package: 'quanteda'

## The following object is masked from 'package:utils':
## 
##     View

tw <- corpus(tweets$text)
twdfm <- dfm_select(dfm(tw, remove_punct = TRUE, verbose=TRUE, remove_url=TRUE),
                    min_nchar=2)

## Creating a dfm from a corpus input...

##    ... lowercasing

##    ... found 176 documents, 1,321 features

##    ... created a 176 x 1,321 sparse dfm
##    ... complete. 
## Elapsed time: 0.08 seconds.

topfeatures(twdfm, n=25)

##             rt           って           から           ない           さん 
##             86             35             26             22             21 
##           です           する           した       フォロー     プレゼント 
##             18             17             17             15             15 
##           ます           抽選       ローリー       ショコラ           締切 
##             15             14             14             14             12 
## @mayla_classic            amp           応募           完了           この 
##             11             11             11             11             11 
##       ちょっと           これ         可愛い           くだ           さい 
##             11             10             10             10             10

What doesn’t work:

textplot_wordcloud(twdfm, rot.per=0, scale=c(3, .75), max.words=100)

## Warning: scalemax.wordsrot.per is deprecated; use min_size and
## max_sizemax_wordsrotation instead

But this should now work:

pdf("wordcloud.pdf", family="Japan1")
textplot_wordcloud(twdfm, rot.per=0, scale=c(3, .75), max.words=100)

## Warning: scalemax.wordsrot.per is deprecated; use min_size and
## max_sizemax_wordsrotation instead

dev.off()

## png 
##   2

How to choose the family font? See ?postscriptFonts.

Encoding issues

Pablo Barbera

August 3, 2018

Basics of character encoding in R

Dealing with Unicode headaches

Wordclouds with Japanese, Korean, and Chinese characters