In this guided coding session we will be using a small dataset to illustrate how to identify latent communities in networks. The dataset corresponds to the Twitter ego network of POIR – each node is another Twitter account that the USC POIR account follows, and the edges indicate whether each of those accounts in turn follow each other. (See at the end of this script for the code on how I put together this network.) Edges are thus directed.
The first step is to read the list of edges and nodes in this network:
edges <- read.csv("../data/poir-edges.csv", stringsAsFactors=FALSE)
head(edges)
## Source Target
## 1 102803242 1.110664e+18
## 2 102803242 1.148658e+08
## 3 102803242 1.170012e+09
## 4 102803242 1.204460e+07
## 5 102803242 1.225174e+08
## 6 102803242 1.265032e+09
nodes <- read.csv("../data/poir-nodes.csv", stringsAsFactors=FALSE)
head(nodes)
## Id Label name
## 1 1.028032e+08 USCCareerCenter USC Career Center
## 2 1.032744e+18 Dr_Wanda_Austin Wanda Austin
## 3 1.059724e+09 JoeySaraceno Joey Saraceno
## 4 1.091604e+18 EdwardG74882741 Edward Gonzalez
## 5 1.110664e+18 PresidentFolt Carol Folt
## 6 1.124483e+08 BJPolS British Jnl Poli Sci
## description
## 1 Providing exceptional career services to all members of the #USC Trojan Family.
## 2 Archive of tweets from Dr. Wanda M. Austin, sent during her time as Interim President of @USC. This account is no longer active.
## 3 @Meta UX Researcher | @USC PhD l @UCLA Alum | he/him 🏳️🌈
## 4 USC POIR PhD Candidate. Student of political philosophy, comparative politics, and international relations. Future professor of international relations.
## 5 President of the University of Southern California
## 6 British Journal of Political Science from Cambridge University Press.
## followers_count statuses_count friends_count created_at
## 1 8988 8319 1225 Thu Jan 07 22:00:57 +0000 2010
## 2 1146 225 92 Thu Aug 23 21:40:06 +0000 2018
## 3 222 648 243 Fri Jan 04 06:40:15 +0000 2013
## 4 193 896 248 Sat Feb 02 07:47:02 +0000 2019
## 5 12383 1872 137 Tue Mar 26 22:03:10 +0000 2019
## 6 18810 7332 1498 Mon Feb 08 14:52:14 +0000 2010
## location
## 1 Los Angeles, CA
## 2 Los Angeles, CA
## 3 Los Angeles, CA
## 4 California, USA
## 5 Los Angeles, CA
## 6 Cambridge
Now, how do we create the igraph object? We can use the
graph_from_data_frame
function, which takes two arguments:
d
, the data frame with the edge list in the first two
columns; and vertices
, a data frame with node data with the
node label in the first column. (Note that igraph calls the nodes
vertices
, but it’s exactly the same thing.)
library(igraph)
g <- graph_from_data_frame(d=edges, vertices=nodes, directed=FALSE)
g
## IGRAPH f239715 UN-- 219 7750 --
## + attr: name (v/c), Label (v/c), description (v/c), followers_count
## | (v/n), statuses_count (v/n), friends_count (v/n), created_at (v/c),
## | location (v/c)
## + edges from f239715 (vertex names):
## [1] USC Career Center--Carol Folt
## [2] USC Career Center--USC Gould Law
## [3] USC Career Center--CESR
## [4] USC Career Center--Arnold
## [5] USC Career Center--USC Dornsife
## [6] USC Career Center--USC Sociology
## + ... omitted several edges
What does it mean? - U
means undirected
- N
means named graph
- 219
is the number of nodes
- 7750
is the number of edges
- name (v/c)
means name is a node attribute and
it’s a character
Networks often have different clusters or communities of nodes that are more densely connected to each other than to the rest of the network. Let’s cover some of the different existing methods to identify these communities.
The most straightforward way to partition a network is into connected components. Each component is a group of nodes that are connected to each other, but not to the rest of the nodes. In this case, this network has a single component (also called a giant component) containing 114 nodes, which means that you can get from any node to any other node.
comp <- components(g)
comp$no # number of components
## [1] 1
max(comp$csize) # size of largest component
## [1] 219
Most networks have a single giant connected component that includes most or all nodes. Most studies of networks actually focus on the giant component, which is what we will do here. One important reason is due to feasibility: e.g. the shortest path between nodes in a network with two or more component is Inf!.
giant <- decompose(g, mode="strong")[[1]]
giant
## IGRAPH 1ea64df UN-- 219 7750 --
## + attr: name (v/c), Label (v/c), description (v/c), followers_count
## | (v/n), statuses_count (v/n), friends_count (v/n), created_at (v/c),
## | location (v/c)
## + edges from 1ea64df (vertex names):
## [1] USC Career Center--Carol Folt
## [2] USC Career Center--USC Gould Law
## [3] USC Career Center--CESR
## [4] USC Career Center--Arnold
## [5] USC Career Center--USC Dornsife
## [6] USC Career Center--USC Sociology
## + ... omitted several edges
Components can be weakly connected (in undirected networks) or strongly connected (in directed networks, where there is an edge that ends in every single node of that component).
weakly <- decompose(g, mode="weak")[[1]]
weakly
## IGRAPH e4e4f88 UN-- 219 7750 --
## + attr: name (v/c), Label (v/c), description (v/c), followers_count
## | (v/n), statuses_count (v/n), friends_count (v/n), created_at (v/c),
## | location (v/c)
## + edges from e4e4f88 (vertex names):
## [1] USC Career Center--Carol Folt
## [2] USC Career Center--USC Gould Law
## [3] USC Career Center--CESR
## [4] USC Career Center--Arnold
## [5] USC Career Center--USC Dornsife
## [6] USC Career Center--USC Sociology
## + ... omitted several edges
And now we keep only the giant component:
g <- decompose(g, mode="strong")[[1]]
nodes <- nodes[nodes$Label %in% V(g)$Label,]
Again, in practice here because we only have a single component, the output is the same as the original network. For most other applications you will end up dropping part of the network at this step.
Even within a giant component, there can be different subsets of the network that are more connected to each other than to the rest of the network. The goal of community detection algorithms is to identify these subsets.
There are a few different algorithms, each following a different logic.
The walktrap algorithm finds communities through a series of short random walks. The idea is that these random walks tend to stay within the same community. The length of these random walks is 4 edges by default, but you may want to experiment with different values (longer random walks will lead to fewer communities). The goal of this algorithm is to identify the partition that maximizes a modularity score.
cluster_walktrap(g) # default is 4 steps
## IGRAPH clustering walktrap, groups: 3, mod: 0.36
## + groups:
## $`1`
## [1] "Joey Saraceno"
## [2] "Edward Gonzalez"
## [3] "British Jnl Poli Sci"
## [4] "AJPS"
## [5] "Raquel Centeno"
## [6] "The Monkey Cage (TMC)"
## [7] "Dave Ebner, Ph.D."
## [8] "Jeff Jenkins"
## [9] "European Political Science Review"
## + ... omitted several groups/vertices
cluster_walktrap(g, steps=2)
## IGRAPH clustering walktrap, groups: 5, mod: 0.36
## + groups:
## $`1`
## [1] "Joey Saraceno"
## [2] "Edward Gonzalez"
## [3] "British Jnl Poli Sci"
## [4] "AJPS"
## [5] "Raquel Centeno"
## [6] "The Monkey Cage (TMC)"
## [7] "Dave Ebner, Ph.D."
## [8] "Jeff Jenkins"
## [9] "European Political Science Review"
## + ... omitted several groups/vertices
cluster_walktrap(g, steps=5)
## IGRAPH clustering walktrap, groups: 2, mod: 0.36
## + groups:
## $`1`
## [1] "Joey Saraceno"
## [2] "Edward Gonzalez"
## [3] "British Jnl Poli Sci"
## [4] "AJPS"
## [5] "Raquel Centeno"
## [6] "The Monkey Cage (TMC)"
## [7] "Dave Ebner, Ph.D."
## [8] "Jeff Jenkins"
## [9] "European Political Science Review"
## + ... omitted several groups/vertices
cluster_walktrap(g, steps=10)
## IGRAPH clustering walktrap, groups: 2, mod: 0.36
## + groups:
## $`1`
## [1] "Joey Saraceno"
## [2] "Edward Gonzalez"
## [3] "British Jnl Poli Sci"
## [4] "AJPS"
## [5] "Raquel Centeno"
## [6] "The Monkey Cage (TMC)"
## [7] "Dave Ebner, Ph.D."
## [8] "Jeff Jenkins"
## [9] "European Political Science Review"
## + ... omitted several groups/vertices
Other methods are:
cluster_infomap(g)
cluster_edge_betweenness(g)
cluster_label_prop(g)
cluster_louvain(g)
The choice of one or other algorithm may depend on substantive or practical reasons, as always. For now, let’s pick the Louvain algorithm, which identifies three clusters.
comm <- cluster_louvain(g)
nodes$cluster <- membership(comm)
Who’s in each cluster?
head(nodes$Label[nodes$cluster==1], n=10)
## [1] "USCCareerCenter" "Dr_Wanda_Austin" "PresidentFolt" "USCGouldLaw"
## [5] "CESRUSC" "USCDornsife" "DornsifePhDAcad" "USCSociology"
## [9] "Trojan_Knights" "USCAlumni"
head(nodes$Label[nodes$cluster==2], n=10)
## [1] "JoeySaraceno" "EdwardG74882741" "BJPolS" "AJPS_Editor"
## [5] "RaqCenteno" "monkeycageblog" "shallow__state" "jaj7d"
## [9] "EPSRjournal" "InternatlTheory"
head(nodes$Label[nodes$cluster==3], n=10)
## [1] "Schwarzenegger" "GovInnovations" "ACLU" "SenatorBoxer"
## [5] "USATODAY" "NCSLorg" "jeffreyfields" "latimes"
## [9] "maristpoll" "thehill"
An even better way to understand the content of each cluster is to combine what we’ve learnt about text analysis so far. Let’s see what that reveals:
library(quanteda)
## Package version: 3.2.3
## Unicode version: 14.0
## ICU version: 70.1
## Parallel computing: 8 of 8 threads used.
## See https://quanteda.io for tutorials and examples.
library(quanteda.textstats)
# most frequent features
for (i in 1:3){
message("Cluster ", i)
dfm <- dfm(tokens(nodes$description[nodes$cluster==i],
remove_punct=TRUE)) %>%
dfm_remove(stopwords("english"))
print(topfeatures(dfm, n=25))
}
## Cluster 1
## usc official university southern school research twitter
## 51 20 19 17 14 13 13
## california account @usc public department us |
## 12 11 10 10 9 9 9
## news education follow global office relations center
## 9 8 8 6 6 6 6
## student tweets students study
## 6 5 5 5
## Cluster 2
## | political science politics phd
## 76 52 36 26 20
## @uscpoir @usc professor international assistant
## 20 16 16 14 12
## journal prof association policy usc
## 11 11 11 10 9
## relations social research candidate public
## 9 9 9 8 8
## scientist american ph.d study studies
## 8 7 7 7 7
## Cluster 3
## | news california account now u.s state
## 13 6 4 4 4 3 3
## writer usc home poll public opinion political
## 3 3 3 3 3 3 3
## follow analysis former governor sign twitter government
## 3 3 2 2 2 2 2
## aclu senator nation official
## 2 2 2 2
# most distinctive features
poir <- dfm(tokens(corpus(nodes[,c("description", "cluster")], text_field="description")))
for (i in 1:3){
print(
head(textstat_keyness(poir, target=docvars(poir)$cluster==i,
measure="lr"), n=20)
)
}
## feature G2 p n_target n_reference
## 1 usc 42.889479 5.792133e-11 51 12
## 2 school 25.042146 5.609077e-07 14 0
## 3 southern 19.732475 8.907478e-06 17 2
## 4 official 18.371244 1.817808e-05 20 4
## 5 university 14.687150 1.269086e-04 19 5
## 6 ! 14.164808 1.674737e-04 17 4
## 7 the 14.045728 1.784187e-04 93 76
## 8 twitter 13.567173 2.301758e-04 13 2
## 9 of 11.950661 5.462801e-04 85 71
## 10 to 9.333696 2.249781e-03 27 15
## 11 department 8.311490 3.939506e-03 9 1
## 12 we 6.980804 8.238858e-03 10 3
## 13 account 6.469026 1.097706e-02 11 4
## 14 alumni 5.311052 2.119055e-02 5 0
## 15 california's 5.311052 2.119055e-02 5 0
## 16 students 5.311052 2.119055e-02 5 0
## 17 through 5.311052 2.119055e-02 5 0
## 18 california 4.843307 2.775362e-02 12 6
## 19 education 4.782101 2.875702e-02 8 2
## 20 for 4.746743 2.935380e-02 31 25
## feature G2 p n_target n_reference
## 1 political 61.310055 4.884981e-15 52 4
## 2 science 51.017426 9.154899e-13 36 1
## 3 | 46.224946 1.054257e-11 76 22
## 4 @uscpoir 32.720134 1.064290e-08 20 0
## 5 politics 26.762931 2.300075e-07 26 3
## 6 phd 25.839549 3.710081e-07 20 1
## 7 ; 15.697021 7.434116e-05 18 3
## 8 assistant 13.717803 2.124312e-04 12 1
## 9 journal 13.664190 2.185833e-04 11 0
## 10 prof 13.664190 2.185833e-04 11 0
## 11 international 13.149650 2.875729e-04 14 2
## 12 association 12.248265 4.656910e-04 11 1
## 13 ( 11.595165 6.612349e-04 20 6
## 14 ) 11.595165 6.612349e-04 20 6
## 15 professor 10.791488 1.019679e-03 16 4
## 16 candidate 9.085837 2.575965e-03 8 0
## 17 • 9.085837 2.575965e-03 8 0
## 18 , 6.554083 1.046438e-02 111 99
## 19 fellow 6.116844 1.338997e-02 6 0
## 20 🏳🌈 6.116844 1.338997e-02 6 0
## feature G2 p n_target n_reference
## 1 our 7.961307 0.004778789 7 7
## 2 home 7.027937 0.008024774 3 0
## 3 poll 7.027937 0.008024774 3 0
## 4 now 6.050306 0.013903889 4 2
## 5 for 5.481894 0.019214432 15 41
## 6 opinion 4.757318 0.029173980 3 1
## 7 state 4.757318 0.029173980 3 1
## 8 writer 4.757318 0.029173980 3 1
## 9 news 4.381017 0.036341309 6 9
## 10 is 4.015542 0.045082732 9 20
## 11 aclu 3.651438 0.056020812 2 0
## 12 breaking 3.651438 0.056020812 2 0
## 13 fmr 3.651438 0.056020812 2 0
## 14 living 3.651438 0.056020812 2 0
## 15 nation 3.651438 0.056020812 2 0
## 16 senator 3.651438 0.056020812 2 0
## 17 tell 3.651438 0.056020812 2 0
## 18 voters 3.651438 0.056020812 2 0
## 19 u.s 3.436947 0.063753387 3 2
## 20 i 3.127068 0.077002104 4 5
# location
poir <- dfm(tokens(corpus(nodes[,c("location", "cluster")], text_field="location")))
for (i in 1:3){
print(
head(textstat_keyness(poir, target=docvars(poir)$cluster==i,
measure="lr"), n=20)
)
}
## feature G2 p n_target n_reference
## 1 angeles 25.409998118 4.635096e-07 77 32
## 2 los 25.409998118 4.635096e-07 77 32
## 3 ca 10.475571816 1.209633e-03 57 31
## 4 usc 6.460045506 1.103268e-02 7 0
## 5 - 0.575741058 4.479865e-01 2 0
## 6 / 0.001237177 9.719414e-01 1 1
## 7 beach 0.001237177 9.719414e-01 1 1
## 8 southern 0.001237177 9.719414e-01 1 1
## 9 241 0.000618370 9.801610e-01 1 0
## 10 300 0.000618370 9.801610e-01 1 0
## 11 314 0.000618370 9.801610e-01 1 0
## 12 3520 0.000618370 9.801610e-01 1 0
## 13 90089 0.000618370 9.801610e-01 1 0
## 14 alumni 0.000618370 9.801610e-01 1 0
## 15 arm 0.000618370 9.801610e-01 1 0
## 16 ca-washington 0.000618370 9.801610e-01 1 0
## 17 ciudad 0.000618370 9.801610e-01 1 0
## 18 de 0.000618370 9.801610e-01 1 0
## 19 dml 0.000618370 9.801610e-01 1 0
## 20 hall 0.000618370 9.801610e-01 1 0
## feature G2 p n_target n_reference
## 1 usa 3.96043396 0.04658169 5 1
## 2 california 2.06961838 0.15025912 8 7
## 3 new 1.46837773 0.22560183 5 3
## 4 alto 1.31198943 0.25203477 2 0
## 5 ct 1.31198943 0.25203477 2 0
## 6 is 1.31198943 0.25203477 2 0
## 7 london 1.31198943 0.25203477 2 0
## 8 palo 1.31198943 0.25203477 2 0
## 9 the 1.31198943 0.25203477 2 0
## 10 uk 1.31198943 0.25203477 2 0
## 11 york 0.63706547 0.42477543 4 3
## 12 states 0.27135246 0.60242597 2 1
## 13 united 0.27135246 0.60242597 2 1
## 14 va 0.27135246 0.60242597 2 1
## 15 ny 0.24391922 0.62138967 4 4
## 16 / 0.18062487 0.67083678 1 1
## 17 southern 0.18062487 0.67083678 1 1
## 18 + 0.09017494 0.76395487 1 0
## 19 ; 0.09017494 0.76395487 1 0
## 20 a 0.09017494 0.76395487 1 0
## feature G2 p n_target n_reference
## 1 washington 9.5883408 0.001958168 6 3
## 2 dc 5.7306605 0.016671211 5 4
## 3 . 5.6545002 0.017410635 4 2
## 4 d.c 4.4560731 0.034777359 3 1
## 5 ny 3.5163309 0.060766833 4 4
## 6 city 3.4535114 0.063117544 2 0
## 7 global 3.4535114 0.063117544 2 0
## 8 • 3.4535114 0.063117544 2 0
## 9 la 3.1575041 0.075578652 3 2
## 10 | 1.8867540 0.169568496 2 1
## 11 york 1.6468182 0.199392543 3 4
## 12 new 1.1760688 0.278157758 3 5
## 13 50 0.6263598 0.428693707 1 0
## 14 bologna 0.6263598 0.428693707 1 0
## 15 cdmx 0.6263598 0.428693707 1 0
## 16 charleston 0.6263598 0.428693707 1 0
## 17 co 0.6263598 0.428693707 1 0
## 18 connecticut 0.6263598 0.428693707 1 0
## 19 denver 0.6263598 0.428693707 1 0
## 20 detroit 0.6263598 0.428693707 1 0
In case you’re curious, here’s the code I used to collect the data:
library(tweetscores)
options(stringsAsFactors=F)
oauth_folder = "~/Dropbox/credentials/twitter/"
accounts <- getFriends("uscpoir", oauth=oauth_folder)
# creating folder where data will be stored (if it does not exists)
try(dir.create("../data/friends"))
# first check if there's any list of friends already downloaded to 'outfolder'
accounts.done <- gsub(".rdata", "", list.files("../data/friends"))
accounts.left <- accounts[accounts %in% accounts.done == FALSE]
accounts.left <- accounts.left[!is.na(accounts.left)]
# loop over the rest of accounts, downloading friend lists from API
while (length(accounts.left) > 0){
# sample randomly one account to get friends
new.user <- sample(accounts.left, 1)
#new.user <- accounts.left[1]
cat(new.user, "---", length(accounts.left), " accounts left!\n")
# download followers (with some exception handling...)
error <- tryCatch(friends <- getFriends(user_id=new.user,
oauth=oauth_folder, sleep=0.5, verbose=FALSE), error=function(e) e)
if (inherits(error, 'error')) {
cat("Error! On to the next one...")
accounts.left <- accounts.left[-which(accounts.left %in% new.user)]
next
}
# save to file and remove from lists of "accounts.left"
file.name <- paste0("../data/friends/", new.user, ".rdata")
save(friends, file=file.name)
accounts.left <- accounts.left[-which(accounts.left %in% new.user)]
}
# keeping only those that we were able to download
accounts <- gsub(".rdata", "", list.files("../data/friends"))
# reading and creating edge list for network
edges <- list()
for (i in 1:length(accounts)){
file.name <- paste0("../data/friends/", accounts[i], ".rdata")
load(file.name)
if (length(friends)==0){ next }
chosen <- accounts[accounts %in% friends]
if (length(chosen)==0){ next }
edges[[i]] <- data.frame(
source = accounts[i], target = chosen)
}
edges <- do.call(rbind, edges)
# creating list of nodes (unique Twitter accounts in edge list)
nodes <- data.frame(id_str=unique(c(edges$source, edges$target)))
# adding user meta data
users <- getUsersBatch(ids=nodes$id_str, oauth=oauth_folder)
nodes <- merge(nodes, users)
# create network object and export to .csv file
library(igraph)
g <- graph_from_data_frame(d=edges, vertices=nodes, directed=TRUE)
g
names(nodes)[1:2] <- c("Id", "Label")
names(edges)[1:2] <- c("Source", "Target")
write.csv(nodes, file="../data/poir-nodes.csv", row.names=FALSE)
write.csv(edges, file="../data/poir-edges.csv", row.names=FALSE)