Community detection

Importing network data into R

In this training session we will be using a small dataset to illustrate how to identify latent communities in networks. The dataset corresponds to the Twitter ego network of ECPR – each node is another Twitter account that the ECPR follows, and the edges indicate whether each of those accounts in turn follow each other. (See at the end of this script for the code on how I put together this network.) Edges are thus directed.

The first step is to read the list of edges and nodes in this network:

edges <- read.csv("~/data/ecpr-edges.csv", stringsAsFactors=FALSE)
head(edges)

##       Source     Target
## 1 1001408503  102062058
## 2 1001408503  106836014
## 3 1001408503 1080956450
## 4 1001408503  108631068
## 5 1001408503  112729477
## 6 1001408503 1241258612

nodes <- read.csv("~/data/ecpr-nodes.csv", stringsAsFactors=FALSE)
head(nodes)

##           Id           Label                 name
## 1 1001408503       UoMPolicy    Policy@Manchester
## 2  100367386     IndianaUniv   Indiana University
## 3 1011441108 Sciences_Po_Aix      Sciences Po Aix
## 4  102062058   PrfAndrwRssll       Andrew Russell
## 5 1021697672      PoLIS_Bath         PoLIS - Bath
## 6 1022158776 UniEssexLibrary Uni of Essex Library
##                                                                                                                                                        description
## 1 Influencers and shapers of public policy, based at @officialuom. Follow for robust insight, expertise and highlights from (arguably) the UK's biggest thinktank.
## 2                Established in 1820, Indiana University has 7 campuses: @IUBloomington, @IUEast, @IUKokomo, @IUNorthwest, @IUPUI, @IUSouthBend, and @IUSoutheast.
## 3                                                            Sciences Po Aix - compte officiel de l'Institut d'Etudes Politiques d'Aix-en-Provence\n#SciencesPoAix
## 4 Formerly Poliblogmanc. Professor of Politics, University of Manchester (soon Liverpool) Pols, parlm, elections, Coventry City & England cricket. seldom succinct
## 5           The Department of Politics, Languages and International Studies at the @UniofBath, aimed at bringing news to the wider audience. RTs not endorsements.
## 6                                   We provide help, support and access to print and online resources for all students and researchers at the University of Essex.
##   followers_count statuses_count friends_count
## 1           10027           8157          4377
## 2           66533          12817           392
## 3            2114            774            64
## 4            3988           5068          2705
## 5            1114           1841           125
## 6            1534           2973           156
##                       created_at                       location lang
## 1 Mon Dec 10 10:38:42 +0000 2012                                  en
## 2 Wed Dec 30 01:19:44 +0000 2009                        Indiana   en
## 3 Fri Dec 14 16:13:19 +0000 2012                Aix-en-Provence   fr
## 4 Tue Jan 05 13:34:02 +0000 2010           Manchester/Liverpool   en
## 5 Wed Dec 19 09:19:17 +0000 2012             Bath, Somerset, UK   en
## 6 Wed Dec 19 14:15:54 +0000 2012 Colchester, Loughton, Southend   en
##                    time_zone status.id_str              status.created_at
## 1                 Casablanca  8.952762e+17 Wed Aug 09 13:31:01 +0000 2017
## 2 Eastern Time (US & Canada)  8.953914e+17 Wed Aug 09 21:08:37 +0000 2017
## 3                       <NA>  8.843515e+17 Mon Jul 10 10:00:08 +0000 2017
## 4                     London  8.950706e+17 Tue Aug 08 23:54:06 +0000 2017
## 5                 Casablanca  8.910120e+17 Fri Jul 28 19:06:19 +0000 2017
## 6                     London  8.953190e+17 Wed Aug 09 16:20:54 +0000 2017
##                                                                                                                                     status.text
## 1  RT @UoMNews: Watch @profbuchan discuss the north/south divide with @lucianaberger and @mattfrei on yesterday's @Channel4News https://t.co/f…
## 2  RT @IUNewsroom: .@IUMedSchool's William J. Wright Scholarship is helping prepare future cancer researchers: https://t.co/jeXD8lq0Mh https:/…
## 3 Cher.e.s étudiant.e.s, \n\nSciences Po Aix vous souhaite de bonnes vacances! \n\n L'I.E.P fermera ses portes le 21... https://t.co/2UWRzLT9Q5
## 4        Glen Campbell so many great renditions (esp of Jimmy Webb songs) but this 2008 Green Day cover remains special https://t.co/6FOLx7GpeC
## 5  RT @UniofBath: Pakistan Supreme Court disqualifies Prime Minister Nawaz Sharif - comments from @PoLIS_Bath 's @WaliAslam for @CNBC  https:/…
## 6  RT @CathyJ62: Great progress with the refurbishment of our Library Reading Room -it's going to be a fantastic space for our students ! http…

For example, we learn that user with ID 1001408503 follows user with ID 102062058.

How do we convert these two datasets into a network object in R? There are multiple packages to work with networks, but the most popular is igraph because it’s very flexible and easy to do, and in my experience it’s much faster and scales well to very large networks. Other packages that you may want to explore are sna and networks.

Now, how do we create the igraph object? We can use the graph_from_data_frame function, which takes two arguments: d, the data frame with the edge list in the first two columns; and vertices, a data frame with node data with the node label in the first column. (Note that igraph calls the nodes vertices, but it’s exactly the same thing.)

library(igraph)
g <- graph_from_data_frame(d=edges, vertices=nodes, directed=FALSE)
g

## IGRAPH 74a317a UN-- 902 13606 -- 
## + attr: name (v/c), Label (v/c), description (v/c),
## | followers_count (v/n), statuses_count (v/n), friends_count
## | (v/n), created_at (v/c), location (v/c), lang (v/c), time_zone
## | (v/c), status.id_str (v/n), status.created_at (v/c), status.text
## | (v/c)
## + edges from 74a317a (vertex names):
## [1] Policy@Manchester--Andrew Russell      
## [2] Policy@Manchester--laura sudulich      
## [3] Policy@Manchester--Jean-Paul Vargas    
## [4] Policy@Manchester--ECFR                
## + ... omitted several edges

What does it mean? - U means undirected
- N means named graph
- 902 is the number of nodes
- 13606 is the number of edges
- name (v/c) means name is a node attribute and it’s a character

Network communities

Networks often have different clusters or communities of nodes that are more densely connected to each other than to the rest of the network. Let’s cover some of the different existing methods to identify these communities.

The most straightforward way to partition a network is into connected components. Each component is a group of nodes that are connected to each other, but not to the rest of the nodes. For example, this network has only one component (every node is at least connected to one other node in the network).

str(components(g))

## List of 3
##  $ membership: Named num [1:902] 1 1 1 1 1 1 1 1 1 1 ...
##   ..- attr(*, "names")= chr [1:902] "Policy@Manchester" "Indiana University" "Sciences Po Aix" "Andrew Russell" ...
##  $ csize     : num 902
##  $ no        : int 1

Most networks have a single giant connected component that includes most nodes. Most studies of networks actually focus on the giant component (e.g. the shortest path between nodes in a network with two or more component is Inf!).

giant <- decompose(g, mode="strong")
giant

## [[1]]
## IGRAPH ab5f203 UN-- 902 13606 -- 
## + attr: name (v/c), Label (v/c), description (v/c),
## | followers_count (v/n), statuses_count (v/n), friends_count
## | (v/n), created_at (v/c), location (v/c), lang (v/c), time_zone
## | (v/c), status.id_str (v/n), status.created_at (v/c), status.text
## | (v/c)
## + edges from ab5f203 (vertex names):
## [1] Policy@Manchester--Andrew Russell      
## [2] Policy@Manchester--laura sudulich      
## [3] Policy@Manchester--Jean-Paul Vargas    
## [4] Policy@Manchester--ECFR                
## + ... omitted several edges

Components can be weakly connected (in undirected networks) or strongly connected (in directed networks, where there is an edge that ends in every single node of that component).

weakly <- decompose(g, mode="weak")
weakly

## [[1]]
## IGRAPH 0471281 UN-- 902 13606 -- 
## + attr: name (v/c), Label (v/c), description (v/c),
## | followers_count (v/n), statuses_count (v/n), friends_count
## | (v/n), created_at (v/c), location (v/c), lang (v/c), time_zone
## | (v/c), status.id_str (v/n), status.created_at (v/c), status.text
## | (v/c)
## + edges from 0471281 (vertex names):
## [1] Policy@Manchester--Andrew Russell      
## [2] Policy@Manchester--laura sudulich      
## [3] Policy@Manchester--Jean-Paul Vargas    
## [4] Policy@Manchester--ECFR                
## + ... omitted several edges

Even within a giant component, there can be different subsets of the network that are more connected to each other than to the rest of the network. The goal of community detection algorithms is to identify these subsets.

There are a few different algorithms, each following a different logic.

The walktrap algorithm finds communities through a series of short random walks. The idea is that these random walks tend to stay within the same community. The length of these random walks is 4 edges by default, but you may want to experiment with different values (longer random walks will lead to fewer communities). The goal of this algorithm is to identify the partition that maximizes a modularity score.

cluster_walktrap(g)

## IGRAPH clustering walktrap, groups: 161, mod: 0.16
## + groups:
##   $`1`
##   [1] "Uni Research Rokkan" "ISF"                
##   
##   $`2`
##     [1] "Réseau DEL"           "Peter Ucen"          
##     [3] "CSES"                 "laura sudulich"      
##     [5] "Dr Philipp Köker"     "Aleks Szczerbiak"    
##     [7] "Luis Ramiro"          "JCER"                
##     [9] "Kenneth Benoit"       "Mona Lena Krook"     
##    [11] "AJPS"                 "UIC-GENDER"          
##   + ... omitted several groups/vertices

cluster_walktrap(g, steps=10)

## IGRAPH clustering walktrap, groups: 130, mod: 0.11
## + groups:
##   $`1`
##   [1] "Penn State" "Notre Dame"
##   
##   $`2`
##     [1] "Policy@Manchester"    "Andrew Russell"      
##     [3] "PoLIS - Bath"         "RowmanLit Internat"  
##     [5] "ECPR_SGOC"            "Milja Saari"         
##     [7] "Jane Green"           "Mona Lena Krook"     
##     [9] "UIC-GENDER"           "Rachel E. Johnson"   
##    [11] "Chris Brown"          "Kingston Politics"   
##   + ... omitted several groups/vertices

cluster_walktrap(g, steps=20)

## IGRAPH clustering walktrap, groups: 76, mod: 0.098
## + groups:
##   $`1`
##    [1] "Frank Underwood"      "JCER"                 "Samuel Brazys"       
##    [4] "EU Democracy"         "AK"                   "Andreas Busch"       
##    [7] "Ronny Patz"           "Carsten Q. Schneider" "Bastian Becker"      
##   [10] "Johns Hopkins | SAIS" "peter slominski"      "DOGOPO"              
##   [13] "Hilde vMeegdenburg"   "Régis Dandoy"         "Karolina Króliczek"  
##   [16] "Daniel Chasquetti"    "Carolina Plescia"     "Politics & IR @ Kent"
##   [19] "Stanford CDDRL"       "UCD Politics"         "(((Tove H. Malloy)))"
##   [22] "Politics UVA"         "MPSA"                 "Brian Fabo"          
##   [25] "Alia Papageorgiou"    "ESRC"                 "Political Science"   
##   + ... omitted several groups/vertices

cluster_walktrap(g, steps=30)

## IGRAPH clustering walktrap, groups: 9, mod: 0.099
## + groups:
##   $`1`
##   [1] "Tallinn University" "TTÜ"                "Vilnius University"
##   
##   $`2`
##   [1] "Uni Research Rokkan"  "Universitetet Bergen" "Nord universitet"    
##   [4] "Mittuniversitetet"    "UiT"                  "Linnéuniversitetet"  
##   [7] "ISF"                 
##   
##   $`3`
##    [1] "Humboldt-Universität" "Universität Wien"     "Universität Tübingen"
##   + ... omitted several groups/vertices

Other methods are:

The infomap method attempts to map the flow of information in a network, and the different clusters in which information may get remain for longer periods. Similar to walktrap, but not necessarily maximizing modularity, but rather the so-called “map equation”.
The edge-betweenness method iteratively removes edges with high betweenness, with the idea that they are likely to connect different parts of the network. Here betweenness (gatekeeping potential) applies to edges, but the intuition is the same.
The label propagation method labels each node with unique labels, and then updates these labels by choosing the label assigned to the majority of their neighbors, and repeat this iteratively until each node has the most common labels among its neighbors.
The Louvain algorithm initially assigns each node to its own community; nodes are then sequentially assigned to the community that increases modularity (if any) so that communities are merged; this merging process continues until modularity cannot increase or only one community remains.

cluster_infomap(g)
cluster_edge_betweenness(g)
cluster_label_prop(g)
cluster_louvain(g)

The choice of one or other algorithm may depend on substantive or practical reasons, as always. For now, let’s pick the Louvain algorithm.

comm <- cluster_louvain(g)
nodes$cluster <- membership(comm)

head(nodes$Label[nodes$cluster==1], n=10)

##  [1] "bearaboi"     "MilSaari"     "mlkrook"      "UICGENDER"   
##  [5] "dfarrell_ucd" "DrREJohnson"  "sbrazys_ucd"  "taniaverge"  
##  [9] "LawGovDCU"    "theresareidy"

head(nodes$Label[nodes$cluster==2], n=10)

##  [1] "ManuMoschella"   "JCERJournal"     "EUlondonrep"    
##  [4] "Aston_ACE"       "EUDOEUI"         "LSEEuroppblog"  
##  [7] "Daniela_Vintila" "ECPRKnowledge"   "GeorgeKyris"    
## [10] "Erik_Jones_SAIS"

head(nodes$Label[nodes$cluster==3], n=10)

##  [1] "RowmanInternat"  "ecfr"            "JrnlofRS"       
##  [4] "EdinburghUP"     "InternatlTheory" "santinoregilme" 
##  [7] "diisdk"          "fa_burkhardt"    "CEJISS"         
## [10] "ISA_IPSsection"

head(nodes$Label[nodes$cluster==4], n=10)

##  [1] "UoMPolicy"      "PrfAndrwRssll"  "PoLIS_Bath"     "ECPR_SGOC"     
##  [5] "CalumWWhite"    "BCeliktemur"    "chrisjbrown1"   "KUPolitics"    
##  [9] "George_Osborne" "leancar2010"

head(nodes$Label[nodes$cluster==5], n=10)

##  [1] "IndianaUniv"    "ugent"          "BU_Tweets"      "UniTampere"    
##  [5] "LeidenSocial"   "zeppelin"       "UV_EG"          "HumboldtUni"   
##  [9] "FES_Sociologia" "TallinnUni"

head(nodes$Label[nodes$cluster==6], n=10)

##  [1] "Sciences_Po_Aix" "UniEssexLibrary" "reseauDEL"      
##  [4] "peter_ucen"      "Welpita"         "NewBehemot"     
##  [7] "csestweets"      "laurasud"        "Frank_Underwood"
## [10] "UHouston"

table(nodes$lang, nodes$cluster)

##        
##           1   2   3   4   5   6
##   ca      1   0   1   0   0   1
##   cs      0   0   1   0   0   1
##   de      1   2   1   0  39  24
##   en     63  83 109 191  72 157
##   en-gb   5   4   5   8   0   7
##   en-GB   0   0   0   1   0   0
##   es      2   2   1   1   8  11
##   fi      1   1   1   0   3   0
##   fr      0   6   1   2  10  21
##   it      0   4   1   1   9   8
##   ja      0   0   0   0   1   0
##   nl      0   2   0   0   6   4
##   no      0   0   0   0   3   0
##   pl      1   0   1   0   0   0
##   ru      0   0   0   1   0   0
##   sv      0   2   0   0   2   7
##   tr      0   0   0   0   1   0

library(quanteda)

## Warning: package 'quanteda' was built under R version 3.4.4

## Package version: 1.3.0

## Parallel computing: 2 of 4 threads used.

## See https://quanteda.io for tutorials and examples.

## 
## Attaching package: 'quanteda'

## The following object is masked from 'package:utils':
## 
##     View

for (i in 1:6){
  message("Cluster ", i)
  dfm <- dfm(nodes$description[nodes$cluster==i],
             remove_punct=TRUE, remove=stopwords("english"))
  print(topfeatures(dfm, n=25))
}

## Cluster 1

##      politics    university        gender     political      research 
##            39            27            22            21            14 
##     professor        policy       college        public      lecturer 
##            13            12            11            10             9 
##        dublin       ireland           ucd         views       science 
##             8             8             8             7             7 
##     scientist        social       studies    interested      personal 
##             7             7             6             6             6 
## international        school         women            de          news 
##             6             6             5             5             5

## Cluster 2

##      european      politics            eu     political      research 
##            41            27            21            21            19 
##    university international        europe      lecturer       studies 
##            16            16            15            13            13 
##     relations     professor        senior        fellow        centre 
##            12            10            10            10            10 
##         group          t.co         https        policy            uk 
##             9             9             8             7             7 
##    researcher      director          news        tweets   endorsement 
##             7             7             6             6             6

## Cluster 3

## international      politics    university     political     relations 
##            51            26            26            26            23 
##      research       studies      security        global       science 
##            21            19            16            14            14 
##     institute        school     professor      academic      european 
##            12            11             9             8             8 
##       journal         books      journals           law        tweets 
##             8             8             8             8             8 
##        centre        social   endorsement     publisher    department 
##             8             8             8             7             7

## Cluster 4

##      politics    university     political      research        policy 
##            87            63            51            49            36 
## international        social        public          t.co            uk 
##            32            31            29            26            25 
##     professor    department     relations         https       science 
##            22            22            18            17            17 
##          news       studies       twitter      official      teaching 
##            15            14            13            12            12 
##         study        school        tweets      lecturer     democracy 
##            12            12            12            12            10

## Cluster 5

##   university         t.co           de          der     official 
##           61           43           35           34           26 
##      twitter          die        https      account     research 
##           23           23           23           22           20 
##         http          und    impressum  universität         hier 
##           20           16           15           13           13 
##     twittert         news    education       tweets     sciences 
##           12           11           10            9            9 
## pressestelle       follow           us           la          one 
##            8            7            7            7            7

## Cluster 6

##     political      politics       science    university            de 
##           109            64            62            51            44 
##      research     professor      european        social          t.co 
##            43            38            26            26            24 
##     scientist      sciences         https international            en 
##            23            20            20            15            13 
##        public   comparative      lecturer    department       journal 
##            13            10            10            10            10 
##   association        policy       parties     democracy        tweets 
##            10            10            10             9             9

# description
ecpr <- dfm(corpus(nodes[,c("description", "cluster")], text_field="description"))
for (i in 1:6){
    print(
      head(textstat_keyness(ecpr, target=docvars(ecpr)$cluster==i,
                      measure="lr"), n=20)
    )
}

##        feature        G2            p n_target n_reference
## 1       gender 79.973495 0.000000e+00       22           5
## 2          ucd 33.366405 7.633122e-09        8           0
## 3       dublin 29.173693 6.617159e-08        8           1
## 4      college 23.421610 1.301087e-06       11          11
## 5      ireland 21.936510 2.818209e-06        8           4
## 6     politics 14.716246 1.249649e-04       39         204
## 7        women 11.857638 5.742533e-04        5           3
## 8         cork  9.717155 1.825559e-03        3           0
## 9        comms  9.717155 1.825559e-03        3           0
## 10   diversity  9.717155 1.825559e-03        3           0
## 11    personal  8.407011 3.737766e-03        6          10
## 12    equality  7.305312 6.875107e-03        3           1
## 13           .  7.021204 8.055005e-03      114         944
## 14         and  6.916349 8.541116e-03       60         445
## 15         own  5.539356 1.859342e-02        7          21
## 16        aims  5.237577 2.210404e-02        2           0
## 17       spire  5.237577 2.210404e-02        2           0
## 18        like  5.237577 2.210404e-02        2           0
## 19 co-convenor  5.237577 2.210404e-02        2           0
## 20      @ecpg3  5.237577 2.210404e-02        2           0
##                 feature        G2            p n_target n_reference
## 1              european 59.541174 1.199041e-14       41          46
## 2                    eu 39.956838 2.596371e-10       21          14
## 3                europe 19.773486 8.718372e-06       15          17
## 4                  jean 19.040835 1.279508e-05        6           0
## 5                monnet 19.040835 1.279508e-05        6           0
## 6                     * 13.107422 2.941282e-04        6           2
## 7          contemporary 11.341675 7.578695e-04        4           0
## 8                 union  9.856546 1.692285e-03        5           2
## 9               #brexit  8.528382 3.496505e-03        4           1
## 10                    |  7.883977 4.987468e-03       35         132
## 11 \U0001f1ea\U0001f1fa  7.590090 5.869002e-03        3           0
## 12                group  7.119265 7.625995e-03        9          16
## 13               senior  6.899476 8.622101e-03       10          20
## 14               fellow  6.899476 8.622101e-03       10          20
## 15                    &  6.877885 8.726875e-03       39         158
## 16           researcher  6.739806 9.428485e-03        7          10
## 17             lecturer  6.587495 1.026976e-02       13          35
## 18                 also  5.970778 1.454486e-02        6           8
## 19               mostly  5.495393 1.906664e-02        4           3
## 20             bringing  5.282255 2.154383e-02        3           1
##          feature        G2            p n_target n_reference
## 1  international 55.936535 7.482903e-14       51          76
## 2       security 33.480501 7.198203e-09       16           8
## 3      relations 21.203333 4.130457e-06       23          40
## 4         global 20.651675 5.508898e-06       14          13
## 5       journals 19.566128 9.717689e-06        8           2
## 6            and 16.029352 6.236805e-05      100         405
## 7          books 15.512428 8.196459e-05        8           4
## 8      publisher 14.138254 1.698543e-04        7           3
## 9        studies 12.264741 4.615973e-04       19          43
## 10     institute 10.270088 1.352043e-03       12          20
## 11         assoc  8.362391 3.830662e-03        4           1
## 12  #humanrights  7.467106 6.283648e-03        3           0
## 13      produces  7.467106 6.283648e-03        3           0
## 14     princeton  7.467106 6.283648e-03        3           0
## 15       ashgate  7.467106 6.283648e-03        3           0
## 16     community  7.388067 6.565800e-03        6           6
## 17       content  6.609703 1.014246e-02        4           2
## 18            ir  6.516109 1.069016e-02        6           7
## 19      academic  6.044574 1.394910e-02        8          14
## 20           rts  5.757633 1.641717e-02        6           8
##       feature        G2            p n_target n_reference
## 1          uk 32.781367 1.031287e-08       25          11
## 2      policy 21.552715 3.442349e-06       36          37
## 3    politics 16.528694 4.791940e-05       87         156
## 4         the 16.111781 5.971142e-05      154         324
## 5      public 15.651740 7.614251e-05       29          32
## 6  department 13.647051 2.205875e-04       22          22
## 7          of 12.773203 3.516202e-04      170         382
## 8     british 11.574309 6.686929e-04        9           3
## 9    learning 11.574309 6.686929e-04        9           3
## 10      based 11.554935 6.756967e-04        7           1
## 11       uk's 11.554935 6.756967e-04        7           1
## 12      study 10.362344 1.286118e-03       12           9
## 13   practice  9.163474 2.468934e-03        6           1
## 14     social  7.907810 4.922183e-03       31          50
## 15        lse  7.434884 6.397150e-03        7           3
## 16     health  7.193229 7.317917e-03        4           0
## 17  sheffield  7.193229 7.317917e-03        4           0
## 18   guardian  6.851345 8.857459e-03        5           1
## 19          &  5.394479 2.020056e-02       62         135
## 20        and  5.059599 2.448990e-02      144         361
##         feature       G2            p n_target n_reference
## 1           der 89.17961 0.000000e+00       34           6
## 2             / 68.35905 1.110223e-16      144         291
## 3           die 64.52825 9.992007e-16       23           2
## 4             : 58.81399 1.731948e-14       90         149
## 5     impressum 47.23190 6.306622e-12       15           0
## 6          hier 35.35295 2.750520e-09       13           1
## 7           und 32.87162 9.845004e-09       16           5
## 8      twittert 32.03921 1.510919e-08       12           1
## 9          t.co 29.73745 4.946982e-08       43          68
## 10     official 27.88464 1.287676e-07       26          28
## 11         http 27.44067 1.619885e-07       20          16
## 12      account 27.36154 1.687544e-07       22          20
## 13  universität 26.35213 2.845048e-07       13           4
## 14           de 23.77448 1.083088e-06       35          56
## 15 pressestelle 23.03272 1.592675e-06        8           0
## 16      twitter 19.45266 1.031243e-05       23          31
## 17          van 12.77862 3.506042e-04        6           1
## 18           og 12.77862 3.506042e-04        6           1
## 19        visit 10.48764 1.201758e-03        6           2
## 20   university 10.08810 1.492296e-03       61         183
##       feature        G2            p n_target n_reference
## 1   political 59.942867 9.769963e-15      109         124
## 2     science 55.408434 9.792167e-14       62          47
## 3          de 26.274353 2.961961e-07       44          47
## 4   scientist 17.724138 2.553656e-05       23          20
## 5           ; 14.145313 1.692181e-04       23          24
## 6         del 14.059777 1.770908e-04        9           2
## 7   professor 14.006482 1.821815e-04       38          54
## 8          en 12.569733 3.920445e-04       13           9
## 9    populism 11.822844 5.850843e-04        7           1
## 10         po 11.720511 6.181498e-04        8           2
## 11 politiques 11.720511 6.181498e-04        8           2
## 12  resources  9.987500 1.576065e-03        5           0
## 13        via  9.987500 1.576065e-03        5           0
## 14     compte  9.448742 2.112940e-03        7           2
## 15   officiel  9.448742 2.112940e-03        7           2
## 16    methods  9.448742 2.112940e-03        7           2
## 17          |  9.372787 2.202304e-03       57         110
## 18    parties  9.278330 2.318807e-03       10           6
## 19          y  8.709622 3.165348e-03        9           5
## 20        für  8.169209 4.260749e-03        8           4

# location
ecpr <- dfm(corpus(nodes[,c("location", "cluster")], text_field="location"))
for (i in 1:6){
    print(
      head(textstat_keyness(ecpr, target=docvars(ecpr)$cluster==i,
                      measure="lr"), n=20)
    )
}

##      feature        G2            p n_target n_reference
## 1     dublin 90.588312 0.000000e+00       20           1
## 2    ireland 73.787290 0.000000e+00       19           4
## 3       cork 16.411391 5.097795e-05        5           1
## 4   limerick 10.079762 1.499067e-03        3           0
## 5     boston  3.699074 5.444268e-02        2           1
## 6       west  2.077731 1.494622e-01        2           3
## 7          (  1.241030 2.652726e-01        2           5
## 8          )  1.241030 2.652726e-01        2           5
## 9       city  1.241030 2.652726e-01        2           5
## 10     espoo  1.230502 2.673094e-01        1           0
## 11  maynooth  1.230502 2.673094e-01        1           0
## 12     meath  1.230502 2.673094e-01        1           0
## 13    galway  1.230502 2.673094e-01        1           0
## 14    denton  1.230502 2.673094e-01        1           0
## 15        tx  1.230502 2.673094e-01        1           0
## 16       but  1.230502 2.673094e-01        1           0
## 17    really  1.230502 2.673094e-01        1           0
## 18      brno  1.230502 2.673094e-01        1           0
## 19     devon  1.230502 2.673094e-01        1           0
## 20 reykjavik  1.230502 2.673094e-01        1           0
##                   feature         G2            p n_target n_reference
## 1                florence 13.0729262 0.0002995947        4           0
## 2                 belgium  7.4600819 0.0063082138        6           9
## 3                brussels  7.4600819 0.0063082138        6           9
## 4                  europe  4.4205606 0.0355083681        4           6
## 5                   italy  4.0037992 0.0453978242        3           3
## 6                       /  3.5739007 0.0586942838        6          17
## 7                 cardiff  2.9993830 0.0832962350        2           1
## 8               guildford  2.9993830 0.0832962350        2           1
## 9                  exeter  2.9993830 0.0832962350        2           1
## 10               montréal  2.0813998 0.1491033865        2           2
## 11                belfast  1.4810209 0.2236148686        2           3
## 12                 united  1.3522456 0.2448859146        4          14
## 13                kingdom  1.3522456 0.2448859146        4          14
## 14                bristol  1.0575189 0.3037817407        2           4
## 15               helsinki  1.0575189 0.3037817407        2           4
## 16              leicester  1.0575189 0.3037817407        2           4
## 17                  aston  0.9980128 0.3177918364        1           0
## 18 london-brussels-madrid  0.9980128 0.3177918364        1           0
## 19                  where  0.9980128 0.3177918364        1           0
## 20                      ?  0.9980128 0.3177918364        1           0
##       feature        G2            p n_target n_reference
## 1         new 13.543522 0.0002330946       10           8
## 2        york 12.545728 0.0003971132        7           3
## 3          dc 11.468851 0.0007077242        6           2
## 4    belgrade  6.795569 0.0091384391        3           0
## 5      serbia  6.795569 0.0091384391        3           0
## 6   princeton  6.795569 0.0091384391        3           0
## 7    brisbane  4.541437 0.0330838034        3           1
## 8  washington  4.105570 0.0427421934        5           6
## 9   cambridge  3.612956 0.0573311366        4           4
## 10    andrews  3.513256 0.0608797126        2           0
## 11  flensburg  3.513256 0.0608797126        2           0
## 12     prague  3.513256 0.0608797126        2           0
## 13         or  3.513256 0.0608797126        2           0
## 14  australia  2.884235 0.0894504723        4           5
## 15     sussex  2.354207 0.1249453891        3           3
## 16       bath  1.939760 0.1636945960        2           1
## 17         ny  1.939760 0.1636945960        2           1
## 18     poland  1.939760 0.1636945960        2           1
## 19    germany  1.332369 0.2483841932        6          16
## 20     sydney  1.238489 0.2657625180        3           5
##        feature        G2            p n_target n_reference
## 1       london 64.153933 1.110223e-15       58          39
## 2           uk 31.460480 2.035410e-08       42          41
## 3      england 10.022856 1.546096e-03       13          12
## 4           of  6.116546 1.339223e-02       12          15
## 5    sheffield  4.915797 2.661218e-02        3           0
## 6        essex  4.915797 2.661218e-02        3           0
## 7   university  4.333797 3.736281e-02       15          25
## 8    edinburgh  4.112160 4.257597e-02        5           3
## 9            &  3.457223 6.297600e-02        4           2
## 10       leeds  2.847126 9.153742e-02        3           1
## 11 southampton  2.424110 1.194811e-01        2           0
## 12         lse  2.424110 1.194811e-01        2           0
## 13   yorkshire  2.424110 1.194811e-01        2           0
## 14    kingston  2.424110 1.194811e-01        2           0
## 15         uni  2.424110 1.194811e-01        2           0
## 16   newcastle  2.424110 1.194811e-01        2           0
## 17        upon  2.424110 1.194811e-01        2           0
## 18  manchester  2.423527 1.195256e-01        4           3
## 19       wales  2.423527 1.195256e-01        4           3
## 20   liverpool  1.728938 1.885466e-01        3           2
##        feature        G2            p n_target n_reference
## 1       canada 12.650881 0.0003753887        5           0
## 2      germany  7.915026 0.0049025864       10          12
## 3      denmark  6.204399 0.0127433207        3           0
## 4      finland  6.081782 0.0136582962        5           3
## 5       sweden  6.081782 0.0136582962        5           3
## 6  netherlands  4.048528 0.0442098740        5           5
## 7      tallinn  3.168844 0.0750555585        2           0
## 8       bergen  3.168844 0.0750555585        2           0
## 9       leuven  3.168844 0.0750555585        2           0
## 10     estonia  3.168844 0.0750555585        2           0
## 11      italia  3.168844 0.0750555585        2           0
## 12   frankfurt  3.168844 0.0750555585        2           0
## 13          am  3.168844 0.0750555585        2           0
## 14        main  3.168844 0.0750555585        2           0
## 15       turku  3.168844 0.0750555585        2           0
## 16      bremen  3.168844 0.0750555585        2           0
## 17     belgium  3.123207 0.0771847161        6           9
## 18           ,  2.598223 0.1069836157       77         300
## 19      norway  1.908793 0.1670973478        3           3
## 20      leiden  1.643903 0.1997907487        2           1
##        feature        G2            p n_target n_reference
## 1       france 18.064806 2.135118e-05       10           1
## 2     budapest 12.879401 3.322185e-04        9           2
## 3      hungary 11.842917 5.788107e-04        6           0
## 4    amsterdam  8.560537 3.435280e-03        7           2
## 5           de  6.810082 9.064461e-03        4           0
## 6        paris  5.981963 1.445290e-02        8           5
## 7   österreich  4.395070 3.604299e-02        3           0
## 8       zurich  4.395070 3.604299e-02        3           0
## 9  switzerland  4.306463 3.796781e-02        4           1
## 10  nottingham  3.879438 4.888091e-02        6           4
## 11 deutschland  2.397835 1.215033e-01        3           1
## 12  strasbourg  2.397835 1.215033e-01        3           1
## 13      mexico  2.126311 1.447889e-01        2           0
## 14       texas  2.126311 1.447889e-01        2           0
## 15       chile  2.126311 1.447889e-01        2           0
## 16      munich  2.126311 1.447889e-01        2           0
## 17    göteborg  2.126311 1.447889e-01        2           0
## 18    michigan  2.126311 1.447889e-01        2           0
## 19     bamberg  2.126311 1.447889e-01        2           0
## 20         san  2.126311 1.447889e-01        2           0

The final way in which we can think about network communities is in terms of hierarchy or structure. We’ll discuss one of these methods.

K-core decomposition allows us to identify the core and the periphery of the network. A k-core is a maximal subnet of a network such that all nodes have at least degree K.

str(coreness(g))

##  Named num [1:902] 37 2 6 37 30 4 16 17 10 10 ...
##  - attr(*, "names")= chr [1:902] "Policy@Manchester" "Indiana University" "Sciences Po Aix" "Andrew Russell" ...

head(which(coreness(g)==37), n=10) # what is the core of the network?

##    Policy@Manchester       Andrew Russell       laura sudulich 
##                    1                    4                   19 
##                 ECFR     Dr Philipp Köker     Aleks Szczerbiak 
##                   27                   28                   29 
##  British Jnl Pol Sci           Jane Green       Kenneth Benoit 
##                   36                   37                   38 
## AstonCentreForEurope 
##                   42

head(which(coreness(g)==1), n=10) # what is the periphery of the network?

## zeppelin universität  UniversityofHouston       HelenaStensöta 
##                   18                   22                  136 
##     Yoav Shemer Kunz          FPN Beograd         Uni Freiburg 
##                  139                  150                  153 
## Università di Genova                  BYU                  MUP 
##                  184                  226                  242 
##  AZ State University 
##                  246

# looking at what predicts being in the core
nodes$k <- coreness(g)
# number of followers?
plot(nodes$k, log(nodes$followers_count))

cor(nodes$k, log(nodes$followers_count))

## [1] 0.2041074

# text?
ecpr <- dfm(corpus(nodes[,c("description", "k")], text_field="description"))
head(textstat_keyness(ecpr, target=docvars(ecpr)$k==37,
                      measure="lr"), n=20)

##           feature        G2            p n_target n_reference
## 1          public 29.709580 5.018620e-08       22          39
## 2       elections 19.921045 8.070703e-06        9           6
## 3          policy 12.860401 3.356087e-04       18          55
## 4               , 12.271812 4.598515e-04      131         840
## 5        politics 10.749216 1.043235e-03       41         202
## 6         science  9.931242 1.624969e-03       22          87
## 7         @ucigpa  8.887638 2.871079e-03        3           0
## 8       academics  8.459894 3.630630e-03        4           2
## 9       democracy  7.630002 5.740524e-03        8          17
## 10        opinion  7.117897 7.631817e-03        4           3
## 11 representation  7.117897 7.631817e-03        4           3
## 12      electoral  6.068687 1.375992e-02        4           4
## 13      professor  5.966177 1.458286e-02       17          75
## 14        insight  4.746932 2.935059e-02        2           0
## 15   @uompolitics  4.746932 2.935059e-02        2           0
## 16           read  4.746932 2.935059e-02        2           0
## 17          lse's  4.746932 2.935059e-02        2           0
## 18       #postdoc  4.746932 2.935059e-02        2           0
## 19           less  4.746932 2.935059e-02        2           0
## 20       @ukandeu  4.746932 2.935059e-02        2           0

head(textstat_keyness(ecpr, target=docvars(ecpr)$k==1,
                      measure="lr"), n=20)

##           feature        G2           p n_target n_reference
## 1     universitet 10.178138 0.001421156        4           3
## 2       sociology  9.308714 0.002280665        5           8
## 3          health  8.778659 0.003047752        3           1
## 4            http  8.471805 0.003606933        8          28
## 5             för  7.268232 0.007018484        3           2
## 6         houston  6.140725 0.013210317        2           0
## 7     statsvetare  6.140725 0.013210317        2           0
## 8         forskar  6.140725 0.013210317        2           0
## 9          örebro  6.140725 0.013210317        2           0
## 10       ciències  6.140725 0.013210317        2           0
## 11        socials  6.140725 0.013210317        2           0
## 12            any  6.140725 0.013210317        2           0
## 13              -  5.839708 0.015668430       12          79
## 14              ;  5.464824 0.019403009        8          39
## 15            och  5.332214 0.020934773        3           4
## 16 administration  4.647986 0.031089997        3           5
## 17        politik  4.359828 0.036796011        2           1
## 18       twittrar  4.359828 0.036796011        2           1
## 19             00  4.359828 0.036796011        2           1
## 20            vid  4.359828 0.036796011        2           1

If you want to learn more about this technique, we recently published a paper in PLOS ONE where we use it to study large-scale Twitter networks in the context of protest events.

In case you’re curious, here’s the code I used to collect the data:

library(netdemR)
options(stringsAsFactors=F)
oauth_folder = "~/Dropbox/credentials/twitter"

accounts <- getFriends("ecpr", oauth_folder=oauth_folder)

# creating folders (if they do not exists)
try(dir.create("data"))

# first check if there's any list of friends already downloaded to 'outfolder'
accounts.done <- gsub(".rdata", "", list.files("data"))
accounts.left <- accounts[accounts %in% accounts.done == FALSE]
accounts.left <- accounts.left[!is.na(accounts.left)]

# loop over the rest of accounts, downloading friend lists from API
while (length(accounts.left) > 0){

    # sample randomly one account to get friends
    new.user <- sample(accounts.left, 1)
    #new.user <- accounts.left[1]
    cat(new.user, "---", length(accounts.left), " accounts left!\n")    
    
    # download followers (with some exception handling...) 
    error <- tryCatch(friends <- getFriends(user_id=new.user,
        oauth_folder=oauth_folder, sleep=0.5, verbose=FALSE), error=function(e) e)
    if (inherits(error, 'error')) {
        cat("Error! On to the next one...")
        accounts.left <- accounts.left[-which(accounts.left %in% new.user)]
        next
    }
    
    # save to file and remove from lists of "accounts.left"
    file.name <- paste0("data/", new.user, ".rdata")
    save(friends, file=file.name)
    accounts.left <- accounts.left[-which(accounts.left %in% new.user)]

}

# keeping only those for which we have the name
accounts <- gsub(".rdata", "", list.files("data"))

# reading and creating network
edges <- list()
for (i in 1:length(accounts)){
    file.name <- paste0("data/", accounts[i], ".rdata")
    load(file.name)
    if (length(friends)==0){ next }
    chosen <- accounts[accounts %in% friends]
    if (length(chosen)==0){ next }
    edges[[i]] <- data.frame(
        source = accounts[i], target = chosen)
}

edges <- do.call(rbind, edges)
nodes <- data.frame(id_str=unique(c(edges$source, edges$target)))

# adding user data
users <- getUsersBatch(ids=nodes$id_str, oauth_folder=oauth_folder)
nodes <- merge(nodes, users)

library(igraph)
g <- graph_from_data_frame(d=edges, vertices=nodes, directed=TRUE)
g

names(nodes)[1:2] <- c("Id", "Label")
names(edges)[1:2] <- c("Source", "Target")
write.csv(nodes, file="ecpr-nodes.csv", row.names=FALSE)
write.csv(edges, file="ecpr-edges.csv", row.names=FALSE)

Community detection

Pablo Barbera

August 8, 2018

Importing network data into R

Network communities