In this training session we will be using a small dataset to illustrate how to identify latent communities in networks. The dataset corresponds to the Twitter ego network of ECPR – each node is another Twitter account that the ECPR follows, and the edges indicate whether each of those accounts in turn follow each other. (See at the end of this script for the code on how I put together this network.) Edges are thus directed.
The first step is to read the list of edges and nodes in this network:
edges <- read.csv("~/data/ecpr-edges.csv", stringsAsFactors=FALSE)
head(edges)
## Source Target
## 1 1001408503 102062058
## 2 1001408503 106836014
## 3 1001408503 1080956450
## 4 1001408503 108631068
## 5 1001408503 112729477
## 6 1001408503 1241258612
nodes <- read.csv("~/data/ecpr-nodes.csv", stringsAsFactors=FALSE)
head(nodes)
## Id Label name
## 1 1001408503 UoMPolicy Policy@Manchester
## 2 100367386 IndianaUniv Indiana University
## 3 1011441108 Sciences_Po_Aix Sciences Po Aix
## 4 102062058 PrfAndrwRssll Andrew Russell
## 5 1021697672 PoLIS_Bath PoLIS - Bath
## 6 1022158776 UniEssexLibrary Uni of Essex Library
## description
## 1 Influencers and shapers of public policy, based at @officialuom. Follow for robust insight, expertise and highlights from (arguably) the UK's biggest thinktank.
## 2 Established in 1820, Indiana University has 7 campuses: @IUBloomington, @IUEast, @IUKokomo, @IUNorthwest, @IUPUI, @IUSouthBend, and @IUSoutheast.
## 3 Sciences Po Aix - compte officiel de l'Institut d'Etudes Politiques d'Aix-en-Provence\n#SciencesPoAix
## 4 Formerly Poliblogmanc. Professor of Politics, University of Manchester (soon Liverpool) Pols, parlm, elections, Coventry City & England cricket. seldom succinct
## 5 The Department of Politics, Languages and International Studies at the @UniofBath, aimed at bringing news to the wider audience. RTs not endorsements.
## 6 We provide help, support and access to print and online resources for all students and researchers at the University of Essex.
## followers_count statuses_count friends_count
## 1 10027 8157 4377
## 2 66533 12817 392
## 3 2114 774 64
## 4 3988 5068 2705
## 5 1114 1841 125
## 6 1534 2973 156
## created_at location lang
## 1 Mon Dec 10 10:38:42 +0000 2012 en
## 2 Wed Dec 30 01:19:44 +0000 2009 Indiana en
## 3 Fri Dec 14 16:13:19 +0000 2012 Aix-en-Provence fr
## 4 Tue Jan 05 13:34:02 +0000 2010 Manchester/Liverpool en
## 5 Wed Dec 19 09:19:17 +0000 2012 Bath, Somerset, UK en
## 6 Wed Dec 19 14:15:54 +0000 2012 Colchester, Loughton, Southend en
## time_zone status.id_str status.created_at
## 1 Casablanca 8.952762e+17 Wed Aug 09 13:31:01 +0000 2017
## 2 Eastern Time (US & Canada) 8.953914e+17 Wed Aug 09 21:08:37 +0000 2017
## 3 <NA> 8.843515e+17 Mon Jul 10 10:00:08 +0000 2017
## 4 London 8.950706e+17 Tue Aug 08 23:54:06 +0000 2017
## 5 Casablanca 8.910120e+17 Fri Jul 28 19:06:19 +0000 2017
## 6 London 8.953190e+17 Wed Aug 09 16:20:54 +0000 2017
## status.text
## 1 RT @UoMNews: Watch @profbuchan discuss the north/south divide with @lucianaberger and @mattfrei on yesterday's @Channel4News https://t.co/f…
## 2 RT @IUNewsroom: .@IUMedSchool's William J. Wright Scholarship is helping prepare future cancer researchers: https://t.co/jeXD8lq0Mh https:/…
## 3 Cher.e.s étudiant.e.s, \n\nSciences Po Aix vous souhaite de bonnes vacances! \n\n L'I.E.P fermera ses portes le 21... https://t.co/2UWRzLT9Q5
## 4 Glen Campbell so many great renditions (esp of Jimmy Webb songs) but this 2008 Green Day cover remains special https://t.co/6FOLx7GpeC
## 5 RT @UniofBath: Pakistan Supreme Court disqualifies Prime Minister Nawaz Sharif - comments from @PoLIS_Bath 's @WaliAslam for @CNBC https:/…
## 6 RT @CathyJ62: Great progress with the refurbishment of our Library Reading Room -it's going to be a fantastic space for our students ! http…
For example, we learn that user with ID 1001408503 follows user with ID 102062058.
How do we convert these two datasets into a network object in R? There are multiple packages to work with networks, but the most popular is igraph
because it’s very flexible and easy to do, and in my experience it’s much faster and scales well to very large networks. Other packages that you may want to explore are sna
and networks
.
Now, how do we create the igraph object? We can use the graph_from_data_frame
function, which takes two arguments: d
, the data frame with the edge list in the first two columns; and vertices
, a data frame with node data with the node label in the first column. (Note that igraph calls the nodes vertices
, but it’s exactly the same thing.)
library(igraph)
g <- graph_from_data_frame(d=edges, vertices=nodes, directed=FALSE)
g
## IGRAPH 74a317a UN-- 902 13606 --
## + attr: name (v/c), Label (v/c), description (v/c),
## | followers_count (v/n), statuses_count (v/n), friends_count
## | (v/n), created_at (v/c), location (v/c), lang (v/c), time_zone
## | (v/c), status.id_str (v/n), status.created_at (v/c), status.text
## | (v/c)
## + edges from 74a317a (vertex names):
## [1] Policy@Manchester--Andrew Russell
## [2] Policy@Manchester--laura sudulich
## [3] Policy@Manchester--Jean-Paul Vargas
## [4] Policy@Manchester--ECFR
## + ... omitted several edges
What does it mean? - U
means undirected
- N
means named graph
- 902
is the number of nodes
- 13606
is the number of edges
- name (v/c)
means name is a node attribute and it’s a character
Networks often have different clusters or communities of nodes that are more densely connected to each other than to the rest of the network. Let’s cover some of the different existing methods to identify these communities.
The most straightforward way to partition a network is into connected components. Each component is a group of nodes that are connected to each other, but not to the rest of the nodes. For example, this network has only one component (every node is at least connected to one other node in the network).
str(components(g))
## List of 3
## $ membership: Named num [1:902] 1 1 1 1 1 1 1 1 1 1 ...
## ..- attr(*, "names")= chr [1:902] "Policy@Manchester" "Indiana University" "Sciences Po Aix" "Andrew Russell" ...
## $ csize : num 902
## $ no : int 1
Most networks have a single giant connected component that includes most nodes. Most studies of networks actually focus on the giant component (e.g. the shortest path between nodes in a network with two or more component is Inf!).
giant <- decompose(g, mode="strong")
giant
## [[1]]
## IGRAPH ab5f203 UN-- 902 13606 --
## + attr: name (v/c), Label (v/c), description (v/c),
## | followers_count (v/n), statuses_count (v/n), friends_count
## | (v/n), created_at (v/c), location (v/c), lang (v/c), time_zone
## | (v/c), status.id_str (v/n), status.created_at (v/c), status.text
## | (v/c)
## + edges from ab5f203 (vertex names):
## [1] Policy@Manchester--Andrew Russell
## [2] Policy@Manchester--laura sudulich
## [3] Policy@Manchester--Jean-Paul Vargas
## [4] Policy@Manchester--ECFR
## + ... omitted several edges
Components can be weakly connected (in undirected networks) or strongly connected (in directed networks, where there is an edge that ends in every single node of that component).
weakly <- decompose(g, mode="weak")
weakly
## [[1]]
## IGRAPH 0471281 UN-- 902 13606 --
## + attr: name (v/c), Label (v/c), description (v/c),
## | followers_count (v/n), statuses_count (v/n), friends_count
## | (v/n), created_at (v/c), location (v/c), lang (v/c), time_zone
## | (v/c), status.id_str (v/n), status.created_at (v/c), status.text
## | (v/c)
## + edges from 0471281 (vertex names):
## [1] Policy@Manchester--Andrew Russell
## [2] Policy@Manchester--laura sudulich
## [3] Policy@Manchester--Jean-Paul Vargas
## [4] Policy@Manchester--ECFR
## + ... omitted several edges
Even within a giant component, there can be different subsets of the network that are more connected to each other than to the rest of the network. The goal of community detection algorithms is to identify these subsets.
There are a few different algorithms, each following a different logic.
The walktrap algorithm finds communities through a series of short random walks. The idea is that these random walks tend to stay within the same community. The length of these random walks is 4 edges by default, but you may want to experiment with different values (longer random walks will lead to fewer communities). The goal of this algorithm is to identify the partition that maximizes a modularity score.
cluster_walktrap(g)
## IGRAPH clustering walktrap, groups: 161, mod: 0.16
## + groups:
## $`1`
## [1] "Uni Research Rokkan" "ISF"
##
## $`2`
## [1] "Réseau DEL" "Peter Ucen"
## [3] "CSES" "laura sudulich"
## [5] "Dr Philipp Köker" "Aleks Szczerbiak"
## [7] "Luis Ramiro" "JCER"
## [9] "Kenneth Benoit" "Mona Lena Krook"
## [11] "AJPS" "UIC-GENDER"
## + ... omitted several groups/vertices
cluster_walktrap(g, steps=10)
## IGRAPH clustering walktrap, groups: 130, mod: 0.11
## + groups:
## $`1`
## [1] "Penn State" "Notre Dame"
##
## $`2`
## [1] "Policy@Manchester" "Andrew Russell"
## [3] "PoLIS - Bath" "RowmanLit Internat"
## [5] "ECPR_SGOC" "Milja Saari"
## [7] "Jane Green" "Mona Lena Krook"
## [9] "UIC-GENDER" "Rachel E. Johnson"
## [11] "Chris Brown" "Kingston Politics"
## + ... omitted several groups/vertices
cluster_walktrap(g, steps=20)
## IGRAPH clustering walktrap, groups: 76, mod: 0.098
## + groups:
## $`1`
## [1] "Frank Underwood" "JCER" "Samuel Brazys"
## [4] "EU Democracy" "AK" "Andreas Busch"
## [7] "Ronny Patz" "Carsten Q. Schneider" "Bastian Becker"
## [10] "Johns Hopkins | SAIS" "peter slominski" "DOGOPO"
## [13] "Hilde vMeegdenburg" "Régis Dandoy" "Karolina Króliczek"
## [16] "Daniel Chasquetti" "Carolina Plescia" "Politics & IR @ Kent"
## [19] "Stanford CDDRL" "UCD Politics" "(((Tove H. Malloy)))"
## [22] "Politics UVA" "MPSA" "Brian Fabo"
## [25] "Alia Papageorgiou" "ESRC" "Political Science"
## + ... omitted several groups/vertices
cluster_walktrap(g, steps=30)
## IGRAPH clustering walktrap, groups: 9, mod: 0.099
## + groups:
## $`1`
## [1] "Tallinn University" "TTÜ" "Vilnius University"
##
## $`2`
## [1] "Uni Research Rokkan" "Universitetet Bergen" "Nord universitet"
## [4] "Mittuniversitetet" "UiT" "Linnéuniversitetet"
## [7] "ISF"
##
## $`3`
## [1] "Humboldt-Universität" "Universität Wien" "Universität Tübingen"
## + ... omitted several groups/vertices
Other methods are:
cluster_infomap(g)
cluster_edge_betweenness(g)
cluster_label_prop(g)
cluster_louvain(g)
The choice of one or other algorithm may depend on substantive or practical reasons, as always. For now, let’s pick the Louvain algorithm.
comm <- cluster_louvain(g)
nodes$cluster <- membership(comm)
head(nodes$Label[nodes$cluster==1], n=10)
## [1] "bearaboi" "MilSaari" "mlkrook" "UICGENDER"
## [5] "dfarrell_ucd" "DrREJohnson" "sbrazys_ucd" "taniaverge"
## [9] "LawGovDCU" "theresareidy"
head(nodes$Label[nodes$cluster==2], n=10)
## [1] "ManuMoschella" "JCERJournal" "EUlondonrep"
## [4] "Aston_ACE" "EUDOEUI" "LSEEuroppblog"
## [7] "Daniela_Vintila" "ECPRKnowledge" "GeorgeKyris"
## [10] "Erik_Jones_SAIS"
head(nodes$Label[nodes$cluster==3], n=10)
## [1] "RowmanInternat" "ecfr" "JrnlofRS"
## [4] "EdinburghUP" "InternatlTheory" "santinoregilme"
## [7] "diisdk" "fa_burkhardt" "CEJISS"
## [10] "ISA_IPSsection"
head(nodes$Label[nodes$cluster==4], n=10)
## [1] "UoMPolicy" "PrfAndrwRssll" "PoLIS_Bath" "ECPR_SGOC"
## [5] "CalumWWhite" "BCeliktemur" "chrisjbrown1" "KUPolitics"
## [9] "George_Osborne" "leancar2010"
head(nodes$Label[nodes$cluster==5], n=10)
## [1] "IndianaUniv" "ugent" "BU_Tweets" "UniTampere"
## [5] "LeidenSocial" "zeppelin" "UV_EG" "HumboldtUni"
## [9] "FES_Sociologia" "TallinnUni"
head(nodes$Label[nodes$cluster==6], n=10)
## [1] "Sciences_Po_Aix" "UniEssexLibrary" "reseauDEL"
## [4] "peter_ucen" "Welpita" "NewBehemot"
## [7] "csestweets" "laurasud" "Frank_Underwood"
## [10] "UHouston"
table(nodes$lang, nodes$cluster)
##
## 1 2 3 4 5 6
## ca 1 0 1 0 0 1
## cs 0 0 1 0 0 1
## de 1 2 1 0 39 24
## en 63 83 109 191 72 157
## en-gb 5 4 5 8 0 7
## en-GB 0 0 0 1 0 0
## es 2 2 1 1 8 11
## fi 1 1 1 0 3 0
## fr 0 6 1 2 10 21
## it 0 4 1 1 9 8
## ja 0 0 0 0 1 0
## nl 0 2 0 0 6 4
## no 0 0 0 0 3 0
## pl 1 0 1 0 0 0
## ru 0 0 0 1 0 0
## sv 0 2 0 0 2 7
## tr 0 0 0 0 1 0
library(quanteda)
## Warning: package 'quanteda' was built under R version 3.4.4
## Package version: 1.3.0
## Parallel computing: 2 of 4 threads used.
## See https://quanteda.io for tutorials and examples.
##
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
##
## View
for (i in 1:6){
message("Cluster ", i)
dfm <- dfm(nodes$description[nodes$cluster==i],
remove_punct=TRUE, remove=stopwords("english"))
print(topfeatures(dfm, n=25))
}
## Cluster 1
## politics university gender political research
## 39 27 22 21 14
## professor policy college public lecturer
## 13 12 11 10 9
## dublin ireland ucd views science
## 8 8 8 7 7
## scientist social studies interested personal
## 7 7 6 6 6
## international school women de news
## 6 6 5 5 5
## Cluster 2
## european politics eu political research
## 41 27 21 21 19
## university international europe lecturer studies
## 16 16 15 13 13
## relations professor senior fellow centre
## 12 10 10 10 10
## group t.co https policy uk
## 9 9 8 7 7
## researcher director news tweets endorsement
## 7 7 6 6 6
## Cluster 3
## international politics university political relations
## 51 26 26 26 23
## research studies security global science
## 21 19 16 14 14
## institute school professor academic european
## 12 11 9 8 8
## journal books journals law tweets
## 8 8 8 8 8
## centre social endorsement publisher department
## 8 8 8 7 7
## Cluster 4
## politics university political research policy
## 87 63 51 49 36
## international social public t.co uk
## 32 31 29 26 25
## professor department relations https science
## 22 22 18 17 17
## news studies twitter official teaching
## 15 14 13 12 12
## study school tweets lecturer democracy
## 12 12 12 12 10
## Cluster 5
## university t.co de der official
## 61 43 35 34 26
## twitter die https account research
## 23 23 23 22 20
## http und impressum universität hier
## 20 16 15 13 13
## twittert news education tweets sciences
## 12 11 10 9 9
## pressestelle follow us la one
## 8 7 7 7 7
## Cluster 6
## political politics science university de
## 109 64 62 51 44
## research professor european social t.co
## 43 38 26 26 24
## scientist sciences https international en
## 23 20 20 15 13
## public comparative lecturer department journal
## 13 10 10 10 10
## association policy parties democracy tweets
## 10 10 10 9 9
# description
ecpr <- dfm(corpus(nodes[,c("description", "cluster")], text_field="description"))
for (i in 1:6){
print(
head(textstat_keyness(ecpr, target=docvars(ecpr)$cluster==i,
measure="lr"), n=20)
)
}
## feature G2 p n_target n_reference
## 1 gender 79.973495 0.000000e+00 22 5
## 2 ucd 33.366405 7.633122e-09 8 0
## 3 dublin 29.173693 6.617159e-08 8 1
## 4 college 23.421610 1.301087e-06 11 11
## 5 ireland 21.936510 2.818209e-06 8 4
## 6 politics 14.716246 1.249649e-04 39 204
## 7 women 11.857638 5.742533e-04 5 3
## 8 cork 9.717155 1.825559e-03 3 0
## 9 comms 9.717155 1.825559e-03 3 0
## 10 diversity 9.717155 1.825559e-03 3 0
## 11 personal 8.407011 3.737766e-03 6 10
## 12 equality 7.305312 6.875107e-03 3 1
## 13 . 7.021204 8.055005e-03 114 944
## 14 and 6.916349 8.541116e-03 60 445
## 15 own 5.539356 1.859342e-02 7 21
## 16 aims 5.237577 2.210404e-02 2 0
## 17 spire 5.237577 2.210404e-02 2 0
## 18 like 5.237577 2.210404e-02 2 0
## 19 co-convenor 5.237577 2.210404e-02 2 0
## 20 @ecpg3 5.237577 2.210404e-02 2 0
## feature G2 p n_target n_reference
## 1 european 59.541174 1.199041e-14 41 46
## 2 eu 39.956838 2.596371e-10 21 14
## 3 europe 19.773486 8.718372e-06 15 17
## 4 jean 19.040835 1.279508e-05 6 0
## 5 monnet 19.040835 1.279508e-05 6 0
## 6 * 13.107422 2.941282e-04 6 2
## 7 contemporary 11.341675 7.578695e-04 4 0
## 8 union 9.856546 1.692285e-03 5 2
## 9 #brexit 8.528382 3.496505e-03 4 1
## 10 | 7.883977 4.987468e-03 35 132
## 11 \U0001f1ea\U0001f1fa 7.590090 5.869002e-03 3 0
## 12 group 7.119265 7.625995e-03 9 16
## 13 senior 6.899476 8.622101e-03 10 20
## 14 fellow 6.899476 8.622101e-03 10 20
## 15 & 6.877885 8.726875e-03 39 158
## 16 researcher 6.739806 9.428485e-03 7 10
## 17 lecturer 6.587495 1.026976e-02 13 35
## 18 also 5.970778 1.454486e-02 6 8
## 19 mostly 5.495393 1.906664e-02 4 3
## 20 bringing 5.282255 2.154383e-02 3 1
## feature G2 p n_target n_reference
## 1 international 55.936535 7.482903e-14 51 76
## 2 security 33.480501 7.198203e-09 16 8
## 3 relations 21.203333 4.130457e-06 23 40
## 4 global 20.651675 5.508898e-06 14 13
## 5 journals 19.566128 9.717689e-06 8 2
## 6 and 16.029352 6.236805e-05 100 405
## 7 books 15.512428 8.196459e-05 8 4
## 8 publisher 14.138254 1.698543e-04 7 3
## 9 studies 12.264741 4.615973e-04 19 43
## 10 institute 10.270088 1.352043e-03 12 20
## 11 assoc 8.362391 3.830662e-03 4 1
## 12 #humanrights 7.467106 6.283648e-03 3 0
## 13 produces 7.467106 6.283648e-03 3 0
## 14 princeton 7.467106 6.283648e-03 3 0
## 15 ashgate 7.467106 6.283648e-03 3 0
## 16 community 7.388067 6.565800e-03 6 6
## 17 content 6.609703 1.014246e-02 4 2
## 18 ir 6.516109 1.069016e-02 6 7
## 19 academic 6.044574 1.394910e-02 8 14
## 20 rts 5.757633 1.641717e-02 6 8
## feature G2 p n_target n_reference
## 1 uk 32.781367 1.031287e-08 25 11
## 2 policy 21.552715 3.442349e-06 36 37
## 3 politics 16.528694 4.791940e-05 87 156
## 4 the 16.111781 5.971142e-05 154 324
## 5 public 15.651740 7.614251e-05 29 32
## 6 department 13.647051 2.205875e-04 22 22
## 7 of 12.773203 3.516202e-04 170 382
## 8 british 11.574309 6.686929e-04 9 3
## 9 learning 11.574309 6.686929e-04 9 3
## 10 based 11.554935 6.756967e-04 7 1
## 11 uk's 11.554935 6.756967e-04 7 1
## 12 study 10.362344 1.286118e-03 12 9
## 13 practice 9.163474 2.468934e-03 6 1
## 14 social 7.907810 4.922183e-03 31 50
## 15 lse 7.434884 6.397150e-03 7 3
## 16 health 7.193229 7.317917e-03 4 0
## 17 sheffield 7.193229 7.317917e-03 4 0
## 18 guardian 6.851345 8.857459e-03 5 1
## 19 & 5.394479 2.020056e-02 62 135
## 20 and 5.059599 2.448990e-02 144 361
## feature G2 p n_target n_reference
## 1 der 89.17961 0.000000e+00 34 6
## 2 / 68.35905 1.110223e-16 144 291
## 3 die 64.52825 9.992007e-16 23 2
## 4 : 58.81399 1.731948e-14 90 149
## 5 impressum 47.23190 6.306622e-12 15 0
## 6 hier 35.35295 2.750520e-09 13 1
## 7 und 32.87162 9.845004e-09 16 5
## 8 twittert 32.03921 1.510919e-08 12 1
## 9 t.co 29.73745 4.946982e-08 43 68
## 10 official 27.88464 1.287676e-07 26 28
## 11 http 27.44067 1.619885e-07 20 16
## 12 account 27.36154 1.687544e-07 22 20
## 13 universität 26.35213 2.845048e-07 13 4
## 14 de 23.77448 1.083088e-06 35 56
## 15 pressestelle 23.03272 1.592675e-06 8 0
## 16 twitter 19.45266 1.031243e-05 23 31
## 17 van 12.77862 3.506042e-04 6 1
## 18 og 12.77862 3.506042e-04 6 1
## 19 visit 10.48764 1.201758e-03 6 2
## 20 university 10.08810 1.492296e-03 61 183
## feature G2 p n_target n_reference
## 1 political 59.942867 9.769963e-15 109 124
## 2 science 55.408434 9.792167e-14 62 47
## 3 de 26.274353 2.961961e-07 44 47
## 4 scientist 17.724138 2.553656e-05 23 20
## 5 ; 14.145313 1.692181e-04 23 24
## 6 del 14.059777 1.770908e-04 9 2
## 7 professor 14.006482 1.821815e-04 38 54
## 8 en 12.569733 3.920445e-04 13 9
## 9 populism 11.822844 5.850843e-04 7 1
## 10 po 11.720511 6.181498e-04 8 2
## 11 politiques 11.720511 6.181498e-04 8 2
## 12 resources 9.987500 1.576065e-03 5 0
## 13 via 9.987500 1.576065e-03 5 0
## 14 compte 9.448742 2.112940e-03 7 2
## 15 officiel 9.448742 2.112940e-03 7 2
## 16 methods 9.448742 2.112940e-03 7 2
## 17 | 9.372787 2.202304e-03 57 110
## 18 parties 9.278330 2.318807e-03 10 6
## 19 y 8.709622 3.165348e-03 9 5
## 20 für 8.169209 4.260749e-03 8 4
# location
ecpr <- dfm(corpus(nodes[,c("location", "cluster")], text_field="location"))
for (i in 1:6){
print(
head(textstat_keyness(ecpr, target=docvars(ecpr)$cluster==i,
measure="lr"), n=20)
)
}
## feature G2 p n_target n_reference
## 1 dublin 90.588312 0.000000e+00 20 1
## 2 ireland 73.787290 0.000000e+00 19 4
## 3 cork 16.411391 5.097795e-05 5 1
## 4 limerick 10.079762 1.499067e-03 3 0
## 5 boston 3.699074 5.444268e-02 2 1
## 6 west 2.077731 1.494622e-01 2 3
## 7 ( 1.241030 2.652726e-01 2 5
## 8 ) 1.241030 2.652726e-01 2 5
## 9 city 1.241030 2.652726e-01 2 5
## 10 espoo 1.230502 2.673094e-01 1 0
## 11 maynooth 1.230502 2.673094e-01 1 0
## 12 meath 1.230502 2.673094e-01 1 0
## 13 galway 1.230502 2.673094e-01 1 0
## 14 denton 1.230502 2.673094e-01 1 0
## 15 tx 1.230502 2.673094e-01 1 0
## 16 but 1.230502 2.673094e-01 1 0
## 17 really 1.230502 2.673094e-01 1 0
## 18 brno 1.230502 2.673094e-01 1 0
## 19 devon 1.230502 2.673094e-01 1 0
## 20 reykjavik 1.230502 2.673094e-01 1 0
## feature G2 p n_target n_reference
## 1 florence 13.0729262 0.0002995947 4 0
## 2 belgium 7.4600819 0.0063082138 6 9
## 3 brussels 7.4600819 0.0063082138 6 9
## 4 europe 4.4205606 0.0355083681 4 6
## 5 italy 4.0037992 0.0453978242 3 3
## 6 / 3.5739007 0.0586942838 6 17
## 7 cardiff 2.9993830 0.0832962350 2 1
## 8 guildford 2.9993830 0.0832962350 2 1
## 9 exeter 2.9993830 0.0832962350 2 1
## 10 montréal 2.0813998 0.1491033865 2 2
## 11 belfast 1.4810209 0.2236148686 2 3
## 12 united 1.3522456 0.2448859146 4 14
## 13 kingdom 1.3522456 0.2448859146 4 14
## 14 bristol 1.0575189 0.3037817407 2 4
## 15 helsinki 1.0575189 0.3037817407 2 4
## 16 leicester 1.0575189 0.3037817407 2 4
## 17 aston 0.9980128 0.3177918364 1 0
## 18 london-brussels-madrid 0.9980128 0.3177918364 1 0
## 19 where 0.9980128 0.3177918364 1 0
## 20 ? 0.9980128 0.3177918364 1 0
## feature G2 p n_target n_reference
## 1 new 13.543522 0.0002330946 10 8
## 2 york 12.545728 0.0003971132 7 3
## 3 dc 11.468851 0.0007077242 6 2
## 4 belgrade 6.795569 0.0091384391 3 0
## 5 serbia 6.795569 0.0091384391 3 0
## 6 princeton 6.795569 0.0091384391 3 0
## 7 brisbane 4.541437 0.0330838034 3 1
## 8 washington 4.105570 0.0427421934 5 6
## 9 cambridge 3.612956 0.0573311366 4 4
## 10 andrews 3.513256 0.0608797126 2 0
## 11 flensburg 3.513256 0.0608797126 2 0
## 12 prague 3.513256 0.0608797126 2 0
## 13 or 3.513256 0.0608797126 2 0
## 14 australia 2.884235 0.0894504723 4 5
## 15 sussex 2.354207 0.1249453891 3 3
## 16 bath 1.939760 0.1636945960 2 1
## 17 ny 1.939760 0.1636945960 2 1
## 18 poland 1.939760 0.1636945960 2 1
## 19 germany 1.332369 0.2483841932 6 16
## 20 sydney 1.238489 0.2657625180 3 5
## feature G2 p n_target n_reference
## 1 london 64.153933 1.110223e-15 58 39
## 2 uk 31.460480 2.035410e-08 42 41
## 3 england 10.022856 1.546096e-03 13 12
## 4 of 6.116546 1.339223e-02 12 15
## 5 sheffield 4.915797 2.661218e-02 3 0
## 6 essex 4.915797 2.661218e-02 3 0
## 7 university 4.333797 3.736281e-02 15 25
## 8 edinburgh 4.112160 4.257597e-02 5 3
## 9 & 3.457223 6.297600e-02 4 2
## 10 leeds 2.847126 9.153742e-02 3 1
## 11 southampton 2.424110 1.194811e-01 2 0
## 12 lse 2.424110 1.194811e-01 2 0
## 13 yorkshire 2.424110 1.194811e-01 2 0
## 14 kingston 2.424110 1.194811e-01 2 0
## 15 uni 2.424110 1.194811e-01 2 0
## 16 newcastle 2.424110 1.194811e-01 2 0
## 17 upon 2.424110 1.194811e-01 2 0
## 18 manchester 2.423527 1.195256e-01 4 3
## 19 wales 2.423527 1.195256e-01 4 3
## 20 liverpool 1.728938 1.885466e-01 3 2
## feature G2 p n_target n_reference
## 1 canada 12.650881 0.0003753887 5 0
## 2 germany 7.915026 0.0049025864 10 12
## 3 denmark 6.204399 0.0127433207 3 0
## 4 finland 6.081782 0.0136582962 5 3
## 5 sweden 6.081782 0.0136582962 5 3
## 6 netherlands 4.048528 0.0442098740 5 5
## 7 tallinn 3.168844 0.0750555585 2 0
## 8 bergen 3.168844 0.0750555585 2 0
## 9 leuven 3.168844 0.0750555585 2 0
## 10 estonia 3.168844 0.0750555585 2 0
## 11 italia 3.168844 0.0750555585 2 0
## 12 frankfurt 3.168844 0.0750555585 2 0
## 13 am 3.168844 0.0750555585 2 0
## 14 main 3.168844 0.0750555585 2 0
## 15 turku 3.168844 0.0750555585 2 0
## 16 bremen 3.168844 0.0750555585 2 0
## 17 belgium 3.123207 0.0771847161 6 9
## 18 , 2.598223 0.1069836157 77 300
## 19 norway 1.908793 0.1670973478 3 3
## 20 leiden 1.643903 0.1997907487 2 1
## feature G2 p n_target n_reference
## 1 france 18.064806 2.135118e-05 10 1
## 2 budapest 12.879401 3.322185e-04 9 2
## 3 hungary 11.842917 5.788107e-04 6 0
## 4 amsterdam 8.560537 3.435280e-03 7 2
## 5 de 6.810082 9.064461e-03 4 0
## 6 paris 5.981963 1.445290e-02 8 5
## 7 österreich 4.395070 3.604299e-02 3 0
## 8 zurich 4.395070 3.604299e-02 3 0
## 9 switzerland 4.306463 3.796781e-02 4 1
## 10 nottingham 3.879438 4.888091e-02 6 4
## 11 deutschland 2.397835 1.215033e-01 3 1
## 12 strasbourg 2.397835 1.215033e-01 3 1
## 13 mexico 2.126311 1.447889e-01 2 0
## 14 texas 2.126311 1.447889e-01 2 0
## 15 chile 2.126311 1.447889e-01 2 0
## 16 munich 2.126311 1.447889e-01 2 0
## 17 göteborg 2.126311 1.447889e-01 2 0
## 18 michigan 2.126311 1.447889e-01 2 0
## 19 bamberg 2.126311 1.447889e-01 2 0
## 20 san 2.126311 1.447889e-01 2 0
The final way in which we can think about network communities is in terms of hierarchy or structure. We’ll discuss one of these methods.
K-core decomposition allows us to identify the core and the periphery of the network. A k-core is a maximal subnet of a network such that all nodes have at least degree K.
str(coreness(g))
## Named num [1:902] 37 2 6 37 30 4 16 17 10 10 ...
## - attr(*, "names")= chr [1:902] "Policy@Manchester" "Indiana University" "Sciences Po Aix" "Andrew Russell" ...
head(which(coreness(g)==37), n=10) # what is the core of the network?
## Policy@Manchester Andrew Russell laura sudulich
## 1 4 19
## ECFR Dr Philipp Köker Aleks Szczerbiak
## 27 28 29
## British Jnl Pol Sci Jane Green Kenneth Benoit
## 36 37 38
## AstonCentreForEurope
## 42
head(which(coreness(g)==1), n=10) # what is the periphery of the network?
## zeppelin universität UniversityofHouston HelenaStensöta
## 18 22 136
## Yoav Shemer Kunz FPN Beograd Uni Freiburg
## 139 150 153
## Università di Genova BYU MUP
## 184 226 242
## AZ State University
## 246
# looking at what predicts being in the core
nodes$k <- coreness(g)
# number of followers?
plot(nodes$k, log(nodes$followers_count))
cor(nodes$k, log(nodes$followers_count))
## [1] 0.2041074
# text?
ecpr <- dfm(corpus(nodes[,c("description", "k")], text_field="description"))
head(textstat_keyness(ecpr, target=docvars(ecpr)$k==37,
measure="lr"), n=20)
## feature G2 p n_target n_reference
## 1 public 29.709580 5.018620e-08 22 39
## 2 elections 19.921045 8.070703e-06 9 6
## 3 policy 12.860401 3.356087e-04 18 55
## 4 , 12.271812 4.598515e-04 131 840
## 5 politics 10.749216 1.043235e-03 41 202
## 6 science 9.931242 1.624969e-03 22 87
## 7 @ucigpa 8.887638 2.871079e-03 3 0
## 8 academics 8.459894 3.630630e-03 4 2
## 9 democracy 7.630002 5.740524e-03 8 17
## 10 opinion 7.117897 7.631817e-03 4 3
## 11 representation 7.117897 7.631817e-03 4 3
## 12 electoral 6.068687 1.375992e-02 4 4
## 13 professor 5.966177 1.458286e-02 17 75
## 14 insight 4.746932 2.935059e-02 2 0
## 15 @uompolitics 4.746932 2.935059e-02 2 0
## 16 read 4.746932 2.935059e-02 2 0
## 17 lse's 4.746932 2.935059e-02 2 0
## 18 #postdoc 4.746932 2.935059e-02 2 0
## 19 less 4.746932 2.935059e-02 2 0
## 20 @ukandeu 4.746932 2.935059e-02 2 0
head(textstat_keyness(ecpr, target=docvars(ecpr)$k==1,
measure="lr"), n=20)
## feature G2 p n_target n_reference
## 1 universitet 10.178138 0.001421156 4 3
## 2 sociology 9.308714 0.002280665 5 8
## 3 health 8.778659 0.003047752 3 1
## 4 http 8.471805 0.003606933 8 28
## 5 för 7.268232 0.007018484 3 2
## 6 houston 6.140725 0.013210317 2 0
## 7 statsvetare 6.140725 0.013210317 2 0
## 8 forskar 6.140725 0.013210317 2 0
## 9 örebro 6.140725 0.013210317 2 0
## 10 ciències 6.140725 0.013210317 2 0
## 11 socials 6.140725 0.013210317 2 0
## 12 any 6.140725 0.013210317 2 0
## 13 - 5.839708 0.015668430 12 79
## 14 ; 5.464824 0.019403009 8 39
## 15 och 5.332214 0.020934773 3 4
## 16 administration 4.647986 0.031089997 3 5
## 17 politik 4.359828 0.036796011 2 1
## 18 twittrar 4.359828 0.036796011 2 1
## 19 00 4.359828 0.036796011 2 1
## 20 vid 4.359828 0.036796011 2 1
If you want to learn more about this technique, we recently published a paper in PLOS ONE where we use it to study large-scale Twitter networks in the context of protest events.
In case you’re curious, here’s the code I used to collect the data:
library(netdemR)
options(stringsAsFactors=F)
oauth_folder = "~/Dropbox/credentials/twitter"
accounts <- getFriends("ecpr", oauth_folder=oauth_folder)
# creating folders (if they do not exists)
try(dir.create("data"))
# first check if there's any list of friends already downloaded to 'outfolder'
accounts.done <- gsub(".rdata", "", list.files("data"))
accounts.left <- accounts[accounts %in% accounts.done == FALSE]
accounts.left <- accounts.left[!is.na(accounts.left)]
# loop over the rest of accounts, downloading friend lists from API
while (length(accounts.left) > 0){
# sample randomly one account to get friends
new.user <- sample(accounts.left, 1)
#new.user <- accounts.left[1]
cat(new.user, "---", length(accounts.left), " accounts left!\n")
# download followers (with some exception handling...)
error <- tryCatch(friends <- getFriends(user_id=new.user,
oauth_folder=oauth_folder, sleep=0.5, verbose=FALSE), error=function(e) e)
if (inherits(error, 'error')) {
cat("Error! On to the next one...")
accounts.left <- accounts.left[-which(accounts.left %in% new.user)]
next
}
# save to file and remove from lists of "accounts.left"
file.name <- paste0("data/", new.user, ".rdata")
save(friends, file=file.name)
accounts.left <- accounts.left[-which(accounts.left %in% new.user)]
}
# keeping only those for which we have the name
accounts <- gsub(".rdata", "", list.files("data"))
# reading and creating network
edges <- list()
for (i in 1:length(accounts)){
file.name <- paste0("data/", accounts[i], ".rdata")
load(file.name)
if (length(friends)==0){ next }
chosen <- accounts[accounts %in% friends]
if (length(chosen)==0){ next }
edges[[i]] <- data.frame(
source = accounts[i], target = chosen)
}
edges <- do.call(rbind, edges)
nodes <- data.frame(id_str=unique(c(edges$source, edges$target)))
# adding user data
users <- getUsersBatch(ids=nodes$id_str, oauth_folder=oauth_folder)
nodes <- merge(nodes, users)
library(igraph)
g <- graph_from_data_frame(d=edges, vertices=nodes, directed=TRUE)
g
names(nodes)[1:2] <- c("Id", "Label")
names(edges)[1:2] <- c("Source", "Target")
write.csv(nodes, file="ecpr-nodes.csv", row.names=FALSE)
write.csv(edges, file="ecpr-edges.csv", row.names=FALSE)