Community detection

Importing network data into R

In this guided coding session we will be using a small dataset to illustrate how to identify latent communities in networks. The dataset corresponds to the Twitter ego network of POIR – each node is another Twitter account that the USC POIR account follows, and the edges indicate whether each of those accounts in turn follow each other. (See at the end of this script for the code on how I put together this network.) Edges are thus directed.

The first step is to read the list of edges and nodes in this network:

edges <- read.csv("../data/poir-edges.csv", stringsAsFactors=FALSE)
head(edges)

##      Source       Target
## 1 102803242 1.110664e+18
## 2 102803242 1.148658e+08
## 3 102803242 1.170012e+09
## 4 102803242 1.204460e+07
## 5 102803242 1.225174e+08
## 6 102803242 1.265032e+09

nodes <- read.csv("../data/poir-nodes.csv", stringsAsFactors=FALSE)
head(nodes)

##             Id           Label                 name
## 1 1.028032e+08 USCCareerCenter    USC Career Center
## 2 1.032744e+18 Dr_Wanda_Austin         Wanda Austin
## 3 1.059724e+09    JoeySaraceno        Joey Saraceno
## 4 1.091604e+18 EdwardG74882741      Edward Gonzalez
## 5 1.110664e+18   PresidentFolt           Carol Folt
## 6 1.124483e+08          BJPolS British Jnl Poli Sci
##                                                                                                                                                description
## 1                                                                          Providing exceptional career services to all members of the #USC Trojan Family.
## 2                         Archive of tweets from Dr. Wanda M. Austin, sent during her time as Interim President of @USC. This account is no longer active.
## 3                                                                                                 @Meta UX Researcher | @USC PhD l @UCLA Alum | he/him 🏳️‍🌈
## 4 USC POIR PhD Candidate. Student of political philosophy, comparative politics, and international relations. Future professor of international relations.
## 5                                                                                                       President of the University of Southern California
## 6                                                                                    British Journal of Political Science from Cambridge University Press.
##   followers_count statuses_count friends_count                     created_at
## 1            8988           8319          1225 Thu Jan 07 22:00:57 +0000 2010
## 2            1146            225            92 Thu Aug 23 21:40:06 +0000 2018
## 3             222            648           243 Fri Jan 04 06:40:15 +0000 2013
## 4             193            896           248 Sat Feb 02 07:47:02 +0000 2019
## 5           12383           1872           137 Tue Mar 26 22:03:10 +0000 2019
## 6           18810           7332          1498 Mon Feb 08 14:52:14 +0000 2010
##          location
## 1 Los Angeles, CA
## 2 Los Angeles, CA
## 3 Los Angeles, CA
## 4 California, USA
## 5 Los Angeles, CA
## 6       Cambridge

Now, how do we create the igraph object? We can use the graph_from_data_frame function, which takes two arguments: d, the data frame with the edge list in the first two columns; and vertices, a data frame with node data with the node label in the first column. (Note that igraph calls the nodes vertices, but it’s exactly the same thing.)

library(igraph)
g <- graph_from_data_frame(d=edges, vertices=nodes, directed=FALSE)
g

## IGRAPH f239715 UN-- 219 7750 -- 
## + attr: name (v/c), Label (v/c), description (v/c), followers_count
## | (v/n), statuses_count (v/n), friends_count (v/n), created_at (v/c),
## | location (v/c)
## + edges from f239715 (vertex names):
##  [1] USC Career Center--Carol Folt            
##  [2] USC Career Center--USC Gould Law         
##  [3] USC Career Center--CESR                  
##  [4] USC Career Center--Arnold                
##  [5] USC Career Center--USC Dornsife          
##  [6] USC Career Center--USC Sociology         
## + ... omitted several edges

What does it mean? - U means undirected
- N means named graph
- 219 is the number of nodes
- 7750 is the number of edges
- name (v/c) means name is a node attribute and it’s a character

Network communities

Networks often have different clusters or communities of nodes that are more densely connected to each other than to the rest of the network. Let’s cover some of the different existing methods to identify these communities.

The most straightforward way to partition a network is into connected components. Each component is a group of nodes that are connected to each other, but not to the rest of the nodes. In this case, this network has a single component (also called a giant component) containing 114 nodes, which means that you can get from any node to any other node.

comp <- components(g)
comp$no # number of components

## [1] 1

max(comp$csize) # size of largest component

## [1] 219

Most networks have a single giant connected component that includes most or all nodes. Most studies of networks actually focus on the giant component, which is what we will do here. One important reason is due to feasibility: e.g. the shortest path between nodes in a network with two or more component is Inf!.

giant <- decompose(g, mode="strong")[[1]]
giant

## IGRAPH 1ea64df UN-- 219 7750 -- 
## + attr: name (v/c), Label (v/c), description (v/c), followers_count
## | (v/n), statuses_count (v/n), friends_count (v/n), created_at (v/c),
## | location (v/c)
## + edges from 1ea64df (vertex names):
##  [1] USC Career Center--Carol Folt            
##  [2] USC Career Center--USC Gould Law         
##  [3] USC Career Center--CESR                  
##  [4] USC Career Center--Arnold                
##  [5] USC Career Center--USC Dornsife          
##  [6] USC Career Center--USC Sociology         
## + ... omitted several edges

Components can be weakly connected (in undirected networks) or strongly connected (in directed networks, where there is an edge that ends in every single node of that component).

weakly <- decompose(g, mode="weak")[[1]]
weakly

## IGRAPH e4e4f88 UN-- 219 7750 -- 
## + attr: name (v/c), Label (v/c), description (v/c), followers_count
## | (v/n), statuses_count (v/n), friends_count (v/n), created_at (v/c),
## | location (v/c)
## + edges from e4e4f88 (vertex names):
##  [1] USC Career Center--Carol Folt            
##  [2] USC Career Center--USC Gould Law         
##  [3] USC Career Center--CESR                  
##  [4] USC Career Center--Arnold                
##  [5] USC Career Center--USC Dornsife          
##  [6] USC Career Center--USC Sociology         
## + ... omitted several edges

And now we keep only the giant component:

g <- decompose(g, mode="strong")[[1]]
nodes <- nodes[nodes$Label %in% V(g)$Label,]

Again, in practice here because we only have a single component, the output is the same as the original network. For most other applications you will end up dropping part of the network at this step.

Even within a giant component, there can be different subsets of the network that are more connected to each other than to the rest of the network. The goal of community detection algorithms is to identify these subsets.

There are a few different algorithms, each following a different logic.

The walktrap algorithm finds communities through a series of short random walks. The idea is that these random walks tend to stay within the same community. The length of these random walks is 4 edges by default, but you may want to experiment with different values (longer random walks will lead to fewer communities). The goal of this algorithm is to identify the partition that maximizes a modularity score.

cluster_walktrap(g) # default is 4 steps

## IGRAPH clustering walktrap, groups: 3, mod: 0.36
## + groups:
##   $`1`
##     [1] "Joey Saraceno"                                  
##     [2] "Edward Gonzalez"                                
##     [3] "British Jnl Poli Sci"                           
##     [4] "AJPS"                                           
##     [5] "Raquel Centeno"                                 
##     [6] "The Monkey Cage (TMC)"                          
##     [7] "Dave Ebner, Ph.D."                              
##     [8] "Jeff Jenkins"                                   
##     [9] "European Political Science Review"              
##   + ... omitted several groups/vertices

cluster_walktrap(g, steps=2)

## IGRAPH clustering walktrap, groups: 5, mod: 0.36
## + groups:
##   $`1`
##    [1] "Joey Saraceno"                                  
##    [2] "Edward Gonzalez"                                
##    [3] "British Jnl Poli Sci"                           
##    [4] "AJPS"                                           
##    [5] "Raquel Centeno"                                 
##    [6] "The Monkey Cage (TMC)"                          
##    [7] "Dave Ebner, Ph.D."                              
##    [8] "Jeff Jenkins"                                   
##    [9] "European Political Science Review"              
##   + ... omitted several groups/vertices

cluster_walktrap(g, steps=5)

## IGRAPH clustering walktrap, groups: 2, mod: 0.36
## + groups:
##   $`1`
##     [1] "Joey Saraceno"                                  
##     [2] "Edward Gonzalez"                                
##     [3] "British Jnl Poli Sci"                           
##     [4] "AJPS"                                           
##     [5] "Raquel Centeno"                                 
##     [6] "The Monkey Cage (TMC)"                          
##     [7] "Dave Ebner, Ph.D."                              
##     [8] "Jeff Jenkins"                                   
##     [9] "European Political Science Review"              
##   + ... omitted several groups/vertices

cluster_walktrap(g, steps=10)

## IGRAPH clustering walktrap, groups: 2, mod: 0.36
## + groups:
##   $`1`
##     [1] "Joey Saraceno"                                  
##     [2] "Edward Gonzalez"                                
##     [3] "British Jnl Poli Sci"                           
##     [4] "AJPS"                                           
##     [5] "Raquel Centeno"                                 
##     [6] "The Monkey Cage (TMC)"                          
##     [7] "Dave Ebner, Ph.D."                              
##     [8] "Jeff Jenkins"                                   
##     [9] "European Political Science Review"              
##   + ... omitted several groups/vertices

Other methods are:

The infomap method attempts to map the flow of information in a network, and the different clusters in which information may get remain for longer periods. Similar to walktrap, but not necessarily maximizing modularity, but rather the so-called “map equation”.
The edge-betweenness method iteratively removes edges with high betweenness, with the idea that they are likely to connect different parts of the network. Here betweenness (gatekeeping potential) applies to edges, but the intuition is the same.
The label propagation method labels each node with unique labels, and then updates these labels by choosing the label assigned to the majority of their neighbors, and repeat this iteratively until each node has the most common labels among its neighbors.
The Louvain algorithm initially assigns each node to its own community; nodes are then sequentially assigned to the community that increases modularity (if any) so that communities are merged; this merging process continues until modularity cannot increase or only one community remains.

cluster_infomap(g)
cluster_edge_betweenness(g)
cluster_label_prop(g)
cluster_louvain(g)

The choice of one or other algorithm may depend on substantive or practical reasons, as always. For now, let’s pick the Louvain algorithm, which identifies three clusters.

comm <- cluster_louvain(g)
nodes$cluster <- membership(comm)

Who’s in each cluster?

head(nodes$Label[nodes$cluster==1], n=10)

##  [1] "USCCareerCenter" "Dr_Wanda_Austin" "PresidentFolt"   "USCGouldLaw"    
##  [5] "CESRUSC"         "USCDornsife"     "DornsifePhDAcad" "USCSociology"   
##  [9] "Trojan_Knights"  "USCAlumni"

head(nodes$Label[nodes$cluster==2], n=10)

##  [1] "JoeySaraceno"    "EdwardG74882741" "BJPolS"          "AJPS_Editor"    
##  [5] "RaqCenteno"      "monkeycageblog"  "shallow__state"  "jaj7d"          
##  [9] "EPSRjournal"     "InternatlTheory"

head(nodes$Label[nodes$cluster==3], n=10)

##  [1] "Schwarzenegger" "GovInnovations" "ACLU"           "SenatorBoxer"  
##  [5] "USATODAY"       "NCSLorg"        "jeffreyfields"  "latimes"       
##  [9] "maristpoll"     "thehill"

An even better way to understand the content of each cluster is to combine what we’ve learnt about text analysis so far. Let’s see what that reveals:

library(quanteda)

## Package version: 3.2.3
## Unicode version: 14.0
## ICU version: 70.1

## Parallel computing: 8 of 8 threads used.

## See https://quanteda.io for tutorials and examples.

library(quanteda.textstats)

# most frequent features
for (i in 1:3){
  message("Cluster ", i)
  dfm <- dfm(tokens(nodes$description[nodes$cluster==i],
             remove_punct=TRUE)) %>% 
    dfm_remove(stopwords("english"))
  print(topfeatures(dfm, n=25))
}

## Cluster 1

##        usc   official university   southern     school   research    twitter 
##         51         20         19         17         14         13         13 
## california    account       @usc     public department         us          | 
##         12         11         10         10          9          9          9 
##       news  education     follow     global     office  relations     center 
##          9          8          8          6          6          6          6 
##    student     tweets   students      study 
##          6          5          5          5

## Cluster 2

##             |     political       science      politics           phd 
##            76            52            36            26            20 
##      @uscpoir          @usc     professor international     assistant 
##            20            16            16            14            12 
##       journal          prof   association        policy           usc 
##            11            11            11            10             9 
##     relations        social      research     candidate        public 
##             9             9             9             8             8 
##     scientist      american          ph.d         study       studies 
##             8             7             7             7             7

## Cluster 3

##          |       news california    account        now        u.s      state 
##         13          6          4          4          4          3          3 
##     writer        usc       home       poll     public    opinion  political 
##          3          3          3          3          3          3          3 
##     follow   analysis     former   governor       sign    twitter government 
##          3          3          2          2          2          2          2 
##       aclu    senator     nation   official 
##          2          2          2          2

# most distinctive features
poir <- dfm(tokens(corpus(nodes[,c("description", "cluster")], text_field="description")))
for (i in 1:3){
    print(
      head(textstat_keyness(poir, target=docvars(poir)$cluster==i,
                      measure="lr"), n=20)
    )
}

##         feature        G2            p n_target n_reference
## 1           usc 42.889479 5.792133e-11       51          12
## 2        school 25.042146 5.609077e-07       14           0
## 3      southern 19.732475 8.907478e-06       17           2
## 4      official 18.371244 1.817808e-05       20           4
## 5    university 14.687150 1.269086e-04       19           5
## 6             ! 14.164808 1.674737e-04       17           4
## 7           the 14.045728 1.784187e-04       93          76
## 8       twitter 13.567173 2.301758e-04       13           2
## 9            of 11.950661 5.462801e-04       85          71
## 10           to  9.333696 2.249781e-03       27          15
## 11   department  8.311490 3.939506e-03        9           1
## 12           we  6.980804 8.238858e-03       10           3
## 13      account  6.469026 1.097706e-02       11           4
## 14       alumni  5.311052 2.119055e-02        5           0
## 15 california's  5.311052 2.119055e-02        5           0
## 16     students  5.311052 2.119055e-02        5           0
## 17      through  5.311052 2.119055e-02        5           0
## 18   california  4.843307 2.775362e-02       12           6
## 19    education  4.782101 2.875702e-02        8           2
## 20          for  4.746743 2.935380e-02       31          25
##          feature        G2            p n_target n_reference
## 1      political 61.310055 4.884981e-15       52           4
## 2        science 51.017426 9.154899e-13       36           1
## 3              | 46.224946 1.054257e-11       76          22
## 4       @uscpoir 32.720134 1.064290e-08       20           0
## 5       politics 26.762931 2.300075e-07       26           3
## 6            phd 25.839549 3.710081e-07       20           1
## 7              ; 15.697021 7.434116e-05       18           3
## 8      assistant 13.717803 2.124312e-04       12           1
## 9        journal 13.664190 2.185833e-04       11           0
## 10          prof 13.664190 2.185833e-04       11           0
## 11 international 13.149650 2.875729e-04       14           2
## 12   association 12.248265 4.656910e-04       11           1
## 13             ( 11.595165 6.612349e-04       20           6
## 14             ) 11.595165 6.612349e-04       20           6
## 15     professor 10.791488 1.019679e-03       16           4
## 16     candidate  9.085837 2.575965e-03        8           0
## 17             •  9.085837 2.575965e-03        8           0
## 18             ,  6.554083 1.046438e-02      111          99
## 19        fellow  6.116844 1.338997e-02        6           0
## 20           🏳‍🌈  6.116844 1.338997e-02        6           0
##     feature       G2           p n_target n_reference
## 1       our 7.961307 0.004778789        7           7
## 2      home 7.027937 0.008024774        3           0
## 3      poll 7.027937 0.008024774        3           0
## 4       now 6.050306 0.013903889        4           2
## 5       for 5.481894 0.019214432       15          41
## 6   opinion 4.757318 0.029173980        3           1
## 7     state 4.757318 0.029173980        3           1
## 8    writer 4.757318 0.029173980        3           1
## 9      news 4.381017 0.036341309        6           9
## 10       is 4.015542 0.045082732        9          20
## 11     aclu 3.651438 0.056020812        2           0
## 12 breaking 3.651438 0.056020812        2           0
## 13      fmr 3.651438 0.056020812        2           0
## 14   living 3.651438 0.056020812        2           0
## 15   nation 3.651438 0.056020812        2           0
## 16  senator 3.651438 0.056020812        2           0
## 17     tell 3.651438 0.056020812        2           0
## 18   voters 3.651438 0.056020812        2           0
## 19      u.s 3.436947 0.063753387        3           2
## 20        i 3.127068 0.077002104        4           5

# location
poir <- dfm(tokens(corpus(nodes[,c("location", "cluster")], text_field="location")))
for (i in 1:3){
    print(
      head(textstat_keyness(poir, target=docvars(poir)$cluster==i,
                      measure="lr"), n=20)
    )
}

##          feature           G2            p n_target n_reference
## 1        angeles 25.409998118 4.635096e-07       77          32
## 2            los 25.409998118 4.635096e-07       77          32
## 3             ca 10.475571816 1.209633e-03       57          31
## 4            usc  6.460045506 1.103268e-02        7           0
## 5              -  0.575741058 4.479865e-01        2           0
## 6              /  0.001237177 9.719414e-01        1           1
## 7          beach  0.001237177 9.719414e-01        1           1
## 8       southern  0.001237177 9.719414e-01        1           1
## 9            241  0.000618370 9.801610e-01        1           0
## 10           300  0.000618370 9.801610e-01        1           0
## 11           314  0.000618370 9.801610e-01        1           0
## 12          3520  0.000618370 9.801610e-01        1           0
## 13         90089  0.000618370 9.801610e-01        1           0
## 14        alumni  0.000618370 9.801610e-01        1           0
## 15           arm  0.000618370 9.801610e-01        1           0
## 16 ca-washington  0.000618370 9.801610e-01        1           0
## 17        ciudad  0.000618370 9.801610e-01        1           0
## 18            de  0.000618370 9.801610e-01        1           0
## 19           dml  0.000618370 9.801610e-01        1           0
## 20          hall  0.000618370 9.801610e-01        1           0
##       feature         G2          p n_target n_reference
## 1         usa 3.96043396 0.04658169        5           1
## 2  california 2.06961838 0.15025912        8           7
## 3         new 1.46837773 0.22560183        5           3
## 4        alto 1.31198943 0.25203477        2           0
## 5          ct 1.31198943 0.25203477        2           0
## 6          is 1.31198943 0.25203477        2           0
## 7      london 1.31198943 0.25203477        2           0
## 8        palo 1.31198943 0.25203477        2           0
## 9         the 1.31198943 0.25203477        2           0
## 10         uk 1.31198943 0.25203477        2           0
## 11       york 0.63706547 0.42477543        4           3
## 12     states 0.27135246 0.60242597        2           1
## 13     united 0.27135246 0.60242597        2           1
## 14         va 0.27135246 0.60242597        2           1
## 15         ny 0.24391922 0.62138967        4           4
## 16          / 0.18062487 0.67083678        1           1
## 17   southern 0.18062487 0.67083678        1           1
## 18          + 0.09017494 0.76395487        1           0
## 19          ; 0.09017494 0.76395487        1           0
## 20          a 0.09017494 0.76395487        1           0
##        feature        G2           p n_target n_reference
## 1   washington 9.5883408 0.001958168        6           3
## 2           dc 5.7306605 0.016671211        5           4
## 3            . 5.6545002 0.017410635        4           2
## 4          d.c 4.4560731 0.034777359        3           1
## 5           ny 3.5163309 0.060766833        4           4
## 6         city 3.4535114 0.063117544        2           0
## 7       global 3.4535114 0.063117544        2           0
## 8            • 3.4535114 0.063117544        2           0
## 9           la 3.1575041 0.075578652        3           2
## 10           | 1.8867540 0.169568496        2           1
## 11        york 1.6468182 0.199392543        3           4
## 12         new 1.1760688 0.278157758        3           5
## 13          50 0.6263598 0.428693707        1           0
## 14     bologna 0.6263598 0.428693707        1           0
## 15        cdmx 0.6263598 0.428693707        1           0
## 16  charleston 0.6263598 0.428693707        1           0
## 17          co 0.6263598 0.428693707        1           0
## 18 connecticut 0.6263598 0.428693707        1           0
## 19      denver 0.6263598 0.428693707        1           0
## 20     detroit 0.6263598 0.428693707        1           0

Appendix

In case you’re curious, here’s the code I used to collect the data:

library(tweetscores)
options(stringsAsFactors=F)
oauth_folder = "~/Dropbox/credentials/twitter/"

accounts <- getFriends("uscpoir", oauth=oauth_folder)

# creating folder where data will be stored (if it does not exists)
try(dir.create("../data/friends"))

# first check if there's any list of friends already downloaded to 'outfolder'
accounts.done <- gsub(".rdata", "", list.files("../data/friends"))
accounts.left <- accounts[accounts %in% accounts.done == FALSE]
accounts.left <- accounts.left[!is.na(accounts.left)]

# loop over the rest of accounts, downloading friend lists from API
while (length(accounts.left) > 0){

    # sample randomly one account to get friends
    new.user <- sample(accounts.left, 1)
    #new.user <- accounts.left[1]
    cat(new.user, "---", length(accounts.left), " accounts left!\n")    
    
    # download followers (with some exception handling...) 
    error <- tryCatch(friends <- getFriends(user_id=new.user,
        oauth=oauth_folder, sleep=0.5, verbose=FALSE), error=function(e) e)
    if (inherits(error, 'error')) {
        cat("Error! On to the next one...")
        accounts.left <- accounts.left[-which(accounts.left %in% new.user)]
        next
    }
    
    # save to file and remove from lists of "accounts.left"
    file.name <- paste0("../data/friends/", new.user, ".rdata")
    save(friends, file=file.name)
    accounts.left <- accounts.left[-which(accounts.left %in% new.user)]

}

# keeping only those that we were able to download
accounts <- gsub(".rdata", "", list.files("../data/friends"))

# reading and creating edge list for network
edges <- list()
for (i in 1:length(accounts)){
    file.name <- paste0("../data/friends/", accounts[i], ".rdata")
    load(file.name)
    if (length(friends)==0){ next }
    chosen <- accounts[accounts %in% friends]
    if (length(chosen)==0){ next }
    edges[[i]] <- data.frame(
        source = accounts[i], target = chosen)
}

edges <- do.call(rbind, edges)

# creating list of nodes (unique Twitter accounts in edge list)
nodes <- data.frame(id_str=unique(c(edges$source, edges$target)))

# adding user meta data
users <- getUsersBatch(ids=nodes$id_str, oauth=oauth_folder)
nodes <- merge(nodes, users)

# create network object and export to .csv file
library(igraph)
g <- graph_from_data_frame(d=edges, vertices=nodes, directed=TRUE)
g

names(nodes)[1:2] <- c("Id", "Label")
names(edges)[1:2] <- c("Source", "Target")
write.csv(nodes, file="../data/poir-nodes.csv", row.names=FALSE)
write.csv(edges, file="../data/poir-edges.csv", row.names=FALSE)

Community detection

Pablo Barbera

Importing network data into R

Network communities

Appendix