Community detection

Importing network data into R

In this guided coding session we will be using a small dataset to illustrate how to identify latent communities in networks. The dataset corresponds to the Twitter ego network of USC POIR – each node is another Twitter account that the USC POIR account follows, and the edges indicate whether each of those accounts in turn follow each other. (See at the end of this script for the code on how I put together this network.) Edges are thus directed.

The first step is to read the list of edges and nodes in this network:

edges <- read.csv("../data/poir-edges.csv", stringsAsFactors=FALSE)
head(edges)

##      Source     Target
## 1 112448318  116630713
## 2 112448318  119679411
## 3 112448318  119682506
## 4 112448318  135469780
## 5 112448318 1852171094
## 6 112448318  186894716

nodes <- read.csv("../data/poir-nodes.csv", stringsAsFactors=FALSE)
head(nodes)

##           Id          Label                name
## 1  112448318         BJPolS British Jnl Pol Sci
## 2 1137637033    AJPS_Editor                AJPS
## 3  114865774    USCGouldLaw       USC Gould Law
## 4  116630713 monkeycageblog         Monkey Cage
## 5 1180479770          jaj7d        Jeff Jenkins
## 6  119679411    EPSRjournal Euro Pol Sci Review
##                                                                                                                                  description
## 1                                                                      British Journal of Political Science from Cambridge University Press.
## 2                                                                                                      American Journal of Political Science
## 3                                     A top-20 law school, USC Gould offers students a world-class education and unparalleled opportunities.
## 4 H.L. Mencken said: "Democracy is the art of running the circus from the monkey cage." We do political science and politics. Tweets by bot.
## 5                                           Provost Professor of Public Policy, Political Science, and Law. @USC\nDirector, @BedrosianCenter
## 6                                                European Political Science Review, the new journal from ECPR and Cambridge University Press
##   followers_count statuses_count friends_count
## 1            8720           3563           379
## 2            7014            506            64
## 3            5638           6688          2267
## 4           39323          18731           411
## 5              70             10            79
## 6            7402           2638           189
##                       created_at        location lang
## 1 Mon Feb 08 14:52:14 +0000 2010       Cambridge   en
## 2 Thu Jan 31 18:42:30 +0000 2013                   en
## 3 Tue Feb 16 21:24:55 +0000 2010 Los Angeles, CA   en
## 4 Tue Feb 23 03:53:00 +0000 2010                   en
## 5 Thu Feb 14 22:26:28 +0000 2013 Los Angeles, CA   en
## 6 Thu Mar 04 09:44:16 +0000 2010                   en
##                    time_zone status.id_str              status.created_at
## 1                     London  9.253148e+17 Tue Oct 31 10:53:28 +0000 2017
## 2 Eastern Time (US & Canada)  9.254536e+17 Tue Oct 31 20:05:05 +0000 2017
## 3 Pacific Time (US & Canada)  9.254832e+17 Tue Oct 31 22:02:45 +0000 2017
## 4 Central Time (US & Canada)  9.253175e+17 Tue Oct 31 11:04:23 +0000 2017
## 5 Eastern Time (US & Canada)  9.174209e+17 Mon Oct 09 16:05:55 +0000 2017
## 6                     London  9.253079e+17 Tue Oct 31 10:26:10 +0000 2017
##                                                                                                                                        status.text
## 1           #FirstView - The Measurement of Real-Time Perceptions of Financial Stress: Implications for Political Science… https://t.co/v7bp8LgcsV
## 2     Everything to Everyone Electoral Consequences of Broad-Appeal Strategy in Europe https://t.co/ZyoXP7oAJF  via @AJPS_Editor #AJPSVirtualIssue
## 3 RT @vanessablum: Love this Twitter debate btwn @isamuel (@Harvard_Law @FirstMondaysFM) &amp; @OrinKerr (@gwlaw @USCGouldLaw). Scholarly joustin…
## 4                           Chief Justice Roberts and other judges have a hard time with statistics. That’s a real problem https://t.co/gTNpPurF63
## 5                                             A little bit about my PIPE initiative at USC.  I'm really excited about it.  https://t.co/9t3kV2NkAr
## 6        MT @KingsDMES: .@ferdinandeibl &amp; @lynge_mangueira - the effects of democratization on political budget cycles https://t.co/0aqPHM0ckw

For example, we learn that user with ID 112448318 follows user with ID 116630713

We will now convert these two datasets into a network object in R using igraph.

library(igraph)
g <- graph_from_data_frame(d=edges, vertices=nodes, directed=TRUE)
g

## IGRAPH DN-- 135 2654 -- 
## + attr: name (v/c), Label (v/c), description (v/c),
## | followers_count (v/n), statuses_count (v/n), friends_count
## | (v/n), created_at (v/c), location (v/c), lang (v/c), time_zone
## | (v/c), status.id_str (v/n), status.created_at (v/c), status.text
## | (v/c)
## + edges (vertex names):
## [1] British Jnl Pol Sci->Monkey Cage         
## [2] British Jnl Pol Sci->Euro Pol Sci Review 
## [3] British Jnl Pol Sci->International Theory
## [4] British Jnl Pol Sci->EPSA                
## + ... omitted several edges

What does it mean? - U means undirected
- N means named graph
- 902 is the number of nodes
- 13606 is the number of edges
- name (v/c) means name is a node attribute and it’s a character

Network communities

Networks often have different clusters or communities of nodes that are more densely connected to each other than to the rest of the network. Let’s cover some of the different existing methods to identify these communities.

The most straightforward way to partition a network is into connected components. Each component is a group of nodes that are connected to each other, but not to the rest of the nodes. For example, this network has only one component (every node is at least connected to one other node in the network).

components(g)

## $membership
##  British Jnl Pol Sci                 AJPS        USC Gould Law 
##                    1                    1                    1 
##          Monkey Cage         Jeff Jenkins  Euro Pol Sci Review 
##                    1                    1                    1 
## International Theory Innovation @ Harvard         USC Dornsife 
##                    1                    1                    1 
##        USC Sociology                 SPSA                 ACLU 
##                    1                    1                    1 
##                 EPSA  USC History Dept. ✌️            Ray Kwong 
##                    1                    1                    1 
##          Rod Albuyeh   Sen. Barbara Boxer   USC Religious Life 
##                    1                    1                    1 
##         Jay Maharjan            USA TODAY         USC Research 
##                    1                    1                    1 
##         USC Marshall                 NCSL      Jason Giannaros 
##                    1                    1                    1 
##    Los Angeles Times        Eric Garcetti               AnnLab 
##                    1                    1                    1 
##          Marist Poll                 MPSA         Daily Trojan 
##                    1                    1                    1 
##   USC Annenberg CCLP   Political Analysis Jordan Carr Peterson 
##                    1                    1                    1 
##          USC Rossier             USC EASC          Jerry Brown 
##                    1                    1                    1 
##                  USC USC PoliticalScience   Norman Lear Center 
##                    1                    1                    1 
##            Tim Scott           Kyuri Park        USC Libraries 
##                    1                    1                    1 
##      FiveThirtyEight             CSII USC          Adam Badawy 
##                    1                    1                    1 
##      Christian Grose        Taylor Dalton      Washington Post 
##                    1                    1                    1 
## USC Shoah Foundation     Youssef Chouhoud       Fanny Cisneros 
##                    1                    1                    1 
##         CUP Politics         Robert Shrum                  PSA 
##                    1                    1                    1 
## USC Visions & Voices        USC Annenberg    USC Wrigley Inst. 
##                    1                    1                    1 
##   USC Viterbi School                  PSQ    American_Politics 
##                    1                    1                    1 
##  Journal of Politics       M Drake Reitan        Anne van Wijk 
##                    1                    1                    1 
##                 JEPS             USC EALC          Whitney Hua 
##                    1                    1                    1 
##  Wall Street Journal        sara sadhwani     USC Dornsife CFR 
##                    1                    1                    1 
##      USC Dermatology          Nola Haynes Tyler Bonanno-Curley 
##                    1                    1                    1 
##              USC SIR                  PGI         Adam Feldman 
##                    1                    1                    1 
##            Megan Eme               RISIST             PERE USC 
##                    1                    1                    1 
##  USC Unruh Institute  CA.gov (California)                 APSA 
##                    1                    1                    1 
##    Nicolás Albertoni                LAFLA               USCKSI 
##                    1                    1                    1 
##          joshua timm     The Harris Poll®          AP Politics 
##                    1                    1                    1 
##     Long Beach Mayor        Meredith Shaw        Pablo Barberá 
##                    1                    1                    1 
## Sen Dianne Feinstein        USC Economics      USC Social Work 
##                    1                    1                    1 
##   Fulbright Programs Graduate Student Gov   Polymathic Academy 
##                    1                    1                    1 
##       bryn rosenfeld USC Bedrosian Center The Associated Press 
##                    1                    1                    1 
##   USC Cinematic Arts USC Computer Science               Gallup 
##                    1                    1                    1 
##             USC CRCC             Ronan Fu            Dave Kang 
##                    1                    1                    1 
##       Fels Institute USC Public Diplomacy             Pongkwan 
##                    1                    1                    1 
##      Joey Huddleston    USC Annenberg PhD Keck Medicine of USC 
##                    1                    1                    1 
##            Kyle Rapp        Sangay Mishra       Political Data 
##                    1                    1                    1 
##                  ISA  USC Graduate School         Mark Paradis 
##                    1                    1                    1 
##     Evgeniia Iakhnis                  CNN              USC CIS 
##                    1                    1                    1 
##      Quinnipiac Poll  Erin Baggott Carter   The New York Times 
##                    1                    1                    1 
##         Brett Carter    NetDem Lab at USC            Abby Wood 
##                    1                    1                    1 
##              theWPSA             SPEC Lab    Stefanie Neumeier 
##                    1                    1                    1 
##        Andy Sinclair         Brian Knafou     USC Price School 
##                    1                    1                    1 
##    Kelebogile Zvobgo Victoria Chonn Ching         PSRM journal 
##                    1                    1                    1 
## 
## $csize
## [1] 135
## 
## $no
## [1] 1

Most networks have a single giant connected component that includes most nodes. Most studies of networks actually focus on the giant component (e.g. the shortest path between nodes in a network with two or more component is Inf!).

giant <- decompose(g)
giant

## [[1]]
## IGRAPH DN-- 135 2654 -- 
## + attr: name (v/c), Label (v/c), description (v/c),
## | followers_count (v/n), statuses_count (v/n), friends_count
## | (v/n), created_at (v/c), location (v/c), lang (v/c), time_zone
## | (v/c), status.id_str (v/n), status.created_at (v/c), status.text
## | (v/c)
## + edges (vertex names):
## [1] British Jnl Pol Sci->Monkey Cage         
## [2] British Jnl Pol Sci->Euro Pol Sci Review 
## [3] British Jnl Pol Sci->International Theory
## [4] British Jnl Pol Sci->EPSA                
## + ... omitted several edges

Even within a giant component, there can be different subsets of the network that are more connected to each other than to the rest of the network. The goal of community detection algorithms is to identify these subsets.

There are a few different algorithms, each following a different logic.

The walktrap algorithm finds communities through a series of short random walks. The idea is that these random walks tend to stay within the same community. The length of these random walks is 4 edges by default, but you may want to experiment with different values (longer random walks will lead to fewer communities). The goal of this algorithm is to identify the partition that maximizes a modularity score.

cluster_walktrap(g)

## IGRAPH clustering walktrap, groups: 4, mod: 0.3
## + groups:
##   $`1`
##    [1] "Innovation @ Harvard" "USC Dornsife"         "ACLU"                
##    [4] "Rod Albuyeh"          "Sen. Barbara Boxer"   "USA TODAY"           
##    [7] "NCSL"                 "Jason Giannaros"      "Los Angeles Times"   
##   [10] "Marist Poll"          "Jordan Carr Peterson" "Jerry Brown"         
##   [13] "USC PoliticalScience" "Tim Scott"            "Kyuri Park"          
##   [16] "FiveThirtyEight"      "Adam Badawy"          "Christian Grose"     
##   [19] "Taylor Dalton"        "Washington Post"      "Youssef Chouhoud"    
##   [22] "Fanny Cisneros"       "Robert Shrum"         "American_Politics"   
##   [25] "M Drake Reitan"       "Anne van Wijk"        "Whitney Hua"         
##   + ... omitted several groups/vertices

cluster_walktrap(g, steps=10)

## IGRAPH clustering walktrap, groups: 3, mod: 0.34
## + groups:
##   $`1`
##    [1] "Innovation @ Harvard" "ACLU"                 "Rod Albuyeh"         
##    [4] "Sen. Barbara Boxer"   "USA TODAY"            "NCSL"                
##    [7] "Jason Giannaros"      "Los Angeles Times"    "Marist Poll"         
##   [10] "Jordan Carr Peterson" "Tim Scott"            "Kyuri Park"          
##   [13] "FiveThirtyEight"      "Adam Badawy"          "Christian Grose"     
##   [16] "Taylor Dalton"        "Washington Post"      "Youssef Chouhoud"    
##   [19] "Fanny Cisneros"       "Robert Shrum"         "Anne van Wijk"       
##   [22] "Whitney Hua"          "Wall Street Journal"  "sara sadhwani"       
##   [25] "Nola Haynes"          "Tyler Bonanno-Curley" "USC SIR"             
##   + ... omitted several groups/vertices

cluster_walktrap(g, steps=20)

## IGRAPH clustering walktrap, groups: 3, mod: 0.3
## + groups:
##   $`1`
##    [1] "British Jnl Pol Sci"  "AJPS"                 "Monkey Cage"         
##    [4] "Jeff Jenkins"         "Euro Pol Sci Review"  "International Theory"
##    [7] "SPSA"                 "EPSA"                 "Rod Albuyeh"         
##   [10] "NCSL"                 "Jason Giannaros"      "Marist Poll"         
##   [13] "MPSA"                 "Political Analysis"   "Jordan Carr Peterson"
##   [16] "Tim Scott"            "Kyuri Park"           "FiveThirtyEight"     
##   [19] "Adam Badawy"          "Christian Grose"      "Youssef Chouhoud"    
##   [22] "CUP Politics"         "PSA"                  "PSQ"                 
##   [25] "American_Politics"    "Journal of Politics"  "Anne van Wijk"       
##   + ... omitted several groups/vertices

cluster_walktrap(g, steps=30)

## IGRAPH clustering walktrap, groups: 3, mod: 0.3
## + groups:
##   $`1`
##    [1] "British Jnl Pol Sci"  "AJPS"                 "Monkey Cage"         
##    [4] "Jeff Jenkins"         "Euro Pol Sci Review"  "International Theory"
##    [7] "SPSA"                 "EPSA"                 "Rod Albuyeh"         
##   [10] "NCSL"                 "Jason Giannaros"      "Marist Poll"         
##   [13] "MPSA"                 "Political Analysis"   "Jordan Carr Peterson"
##   [16] "Tim Scott"            "Kyuri Park"           "FiveThirtyEight"     
##   [19] "Adam Badawy"          "Christian Grose"      "Youssef Chouhoud"    
##   [22] "CUP Politics"         "PSA"                  "PSQ"                 
##   [25] "American_Politics"    "Journal of Politics"  "Anne van Wijk"       
##   + ... omitted several groups/vertices

Other methods are:

The infomap method attempts to map the flow of information in a network, and the different clusters in which information may get remain for longer periods. Similar to walktrap, but not necessarily maximizing modularity, but rather the so-called “map equation”.
The edge-betweenness method iteratively removes edges with high betweenness, with the idea that they are likely to connect different parts of the network. Here betweenness (gatekeeping potential) applies to edges, but the intuition is the same.
The label propagation method labels each node with unique labels, and then updates these labels by choosing the label assigned to the majority of their neighbors, and repeat this iteratively until each node has the most common labels among its neighbors.
The Louvain algorithm initially assigns each node to its own community; nodes are then sequentially assigned to the community that increases modularity (if any) so that communities are merged; this merging process continues until modularity cannot increase or only one community remains.

cluster_infomap(g)

## IGRAPH clustering infomap, groups: 4, mod: 0.35
## + groups:
##   $`1`
##    [1] "USC Gould Law"        "USC Dornsife"         "USC Sociology"       
##    [4] "USC History Dept. ✌️"  "USC Religious Life"   "Jay Maharjan"        
##    [7] "USC Research"         "USC Marshall"         "Eric Garcetti"       
##   [10] "AnnLab"               "Daily Trojan"         "USC Annenberg CCLP"  
##   [13] "USC Rossier"          "USC EASC"             "USC"                 
##   [16] "USC PoliticalScience" "Norman Lear Center"   "USC Libraries"       
##   [19] "CSII USC"             "USC Shoah Foundation" "Robert Shrum"        
##   [22] "USC Visions & Voices" "USC Annenberg"        "USC Wrigley Inst."   
##   [25] "USC Viterbi School"   "M Drake Reitan"       "USC EALC"            
##   + ... omitted several groups/vertices

cluster_edge_betweenness(g)

## IGRAPH clustering edge betweenness, groups: 91, mod: 0.033
## + groups:
##   $`1`
##    [1] "British Jnl Pol Sci"  "AJPS"                 "Monkey Cage"         
##    [4] "Euro Pol Sci Review"  "ACLU"                 "USA TODAY"           
##    [7] "Los Angeles Times"    "MPSA"                 "Kyuri Park"          
##   [10] "FiveThirtyEight"      "Adam Badawy"          "Washington Post"     
##   [13] "CUP Politics"         "M Drake Reitan"       "Anne van Wijk"       
##   [16] "Whitney Hua"          "Wall Street Journal"  "sara sadhwani"       
##   [19] "Nola Haynes"          "Adam Feldman"         "Megan Eme"           
##   [22] "RISIST"               "APSA"                 "Meredith Shaw"       
##   [25] "The Associated Press" "Ronan Fu"             "Dave Kang"           
##   + ... omitted several groups/vertices

cluster_label_prop(g)

## IGRAPH clustering label propagation, groups: 1, mod: 0
## + groups:
##   $`1`
##     [1] "British Jnl Pol Sci"  "AJPS"                
##     [3] "USC Gould Law"        "Monkey Cage"         
##     [5] "Jeff Jenkins"         "Euro Pol Sci Review" 
##     [7] "International Theory" "Innovation @ Harvard"
##     [9] "USC Dornsife"         "USC Sociology"       
##    [11] "SPSA"                 "ACLU"                
##    [13] "EPSA"                 "USC History Dept. ✌️" 
##    [15] "Ray Kwong"            "Rod Albuyeh"         
##    [17] "Sen. Barbara Boxer"   "USC Religious Life"  
##   + ... omitted several groups/vertices

cluster_louvain(as.undirected(g))

## IGRAPH clustering multi level, groups: 4, mod: 0.33
## + groups:
##   $`1`
##    [1] "British Jnl Pol Sci"  "AJPS"                 "Monkey Cage"         
##    [4] "Jeff Jenkins"         "Euro Pol Sci Review"  "International Theory"
##    [7] "SPSA"                 "EPSA"                 "MPSA"                
##   [10] "Political Analysis"   "Jordan Carr Peterson" "FiveThirtyEight"     
##   [13] "CUP Politics"         "PSA"                  "PSQ"                 
##   [16] "American_Politics"    "Journal of Politics"  "JEPS"                
##   [19] "sara sadhwani"        "PGI"                  "APSA"                
##   [22] "Sangay Mishra"        "Political Data"       "ISA"                 
##   [25] "Abby Wood"            "theWPSA"              "Stefanie Neumeier"   
##   + ... omitted several groups/vertices

The choice of one or other algorithm may depend on substantive or practical reasons, as always. For now, let’s pick the Louvain algorithm.

comm <- cluster_louvain(as.undirected(g))
nodes$cluster <- membership(comm)

nodes$Label[nodes$cluster==1]

##  [1] "BJPolS"          "AJPS_Editor"     "monkeycageblog" 
##  [4] "jaj7d"           "EPSRjournal"     "InternatlTheory"
##  [7] "SPSAnews"        "europsa"         "MPSAnet"        
## [10] "polanalysis"     "JordanCarrP"     "FiveThirtyEight"
## [13] "CUP_PoliSci"     "PolStudiesAssoc" "PSQ_CSPC"       
## [16] "PSA_APG"         "The_JOP"         "JEPS_ed"        
## [19] "sarasadhwani"    "PGI_WPSA"        "APSAtweets"     
## [22] "SangayMishra"    "Political_Data"  "isanet"         
## [25] "yesthatabbywood" "theWPSA"         "SteffiNeumeier" 
## [28] "jandrewsinclair" "PSRMJournal"

nodes$Label[nodes$cluster==2]

##  [1] "ACLU"           "raykwong"       "RodAlbuyeh"     "jasongiannaros"
##  [5] "kyuripark1"     "adambbadawy"    "christiangrose" "taylordalton"  
##  [9] "_abuelbanat"    "BobShrum"       "mdrakereitan"   "annevwijk"     
## [13] "whitney_hua"    "USC_SIR"        "AdamSFeldman"   "N_Albertoni"   
## [17] "verbal_gaffe"   "changmishaw"    "p_barbera"      "USC_Econ"      
## [21] "brynrosenfeld"  "ronantfu"       "daveckang"      "joeyhuddleston"
## [25] "KyleSRapp"      "markpa84"       "geniia_iakhnis" "UscCis"        
## [29] "baggottcarter"  "brett_l_carter" "NetDem_USC"     "SPECLabUSC"    
## [33] "Bknafou"        "kelly_zvobgo"   "V_Chonn"

nodes$Label[nodes$cluster==3]

##  [1] "USCGouldLaw"     "USCDornsife"     "USC_Soci"       
##  [4] "USCHistory"      "USCRELIGIOUSLIF" "4entrepreneur"  
##  [7] "USC_Research"    "USCMarshall"     "ericgarcetti"   
## [10] "annenberglab"    "dailytrojan"     "USC_CCLP"       
## [13] "USCRossier"      "USCeasc"         "JerryBrownGov"  
## [16] "USC"             "USCPOSC"         "LearCenter"     
## [19] "USCLibraries"    "CSII_USC"        "USCShoahFdn"    
## [22] "VisionsnVoices"  "USCAnnenberg"    "USCWrigleyInst" 
## [25] "USCViterbi"      "USC_EALC"        "USC_CFR"        
## [28] "USCDermatology"  "AngeMarieH"      "PERE_USC"       
## [31] "UnruhInstitute"  "CAgovernment"    "USCKSI"         
## [34] "uscsocialwork"   "FulbrightPrgrm"  "USCGSG"         
## [37] "USCPolymathy"    "BedrosianCenter" "USCCinema"      
## [40] "CSatUSC"         "usccrcc"         "PublicDiplomacy"
## [43] "USCAnnenbergPhD" "KeckMedUSC"      "USCGradSchool"  
## [46] "USCPrice"

nodes$Label[nodes$cluster==4]

##  [1] "GovInnovations"  "SenatorBoxer"    "USATODAY"       
##  [4] "NCSLorg"         "latimes"         "maristpoll"     
##  [7] "SenatorTimScott" "washingtonpost"  "fancis30"       
## [10] "WSJ"             "nolahtheveil"    "curleyt13"      
## [13] "megan_eme"       "LegalAidLA"      "HarrisPoll"     
## [16] "AP_Politics"     "LongBeachMayor"  "SenFeinstein"   
## [19] "AP"              "Gallup"          "PennFels"       
## [22] "pongkwans"       "CNN"             "QuinnipiacPoll" 
## [25] "nytimes"

table(grepl("los angeles", nodes$location, ignore.case=TRUE), 
      nodes$cluster)

##        
##          1  2  3  4
##   FALSE 25 11 10 20
##   TRUE   4 24 36  5

library(quanteda)

## Warning: package 'quanteda' was built under R version 3.4.2

## quanteda version 0.99.9

## Using 3 of 4 threads for parallel computing

## 
## Attaching package: 'quanteda'

## The following objects are masked from 'package:igraph':
## 
##     %>%, similarity

## The following object is masked from 'package:utils':
## 
##     View

for (i in 1:4){
  message("Cluster ", i)
  dfm <- dfm(nodes$description[nodes$cluster==i],
             remove_punct=TRUE, remove=stopwords("english"))
  print(topfeatures(dfm, n=25))
}

## Cluster 1

##     political       science      politics       journal   association 
##            25            21            11             9             8 
## international    university      research           law      american 
##             7             6             6             5             4 
##     published     cambridge         press     professor      european 
##             4             3             3             3             3 
##           phd       studies         study        public        policy 
##             3             3             3             2             2 
##   endorsement     candidate       twitter     relations  experimental 
##             2             2             2             2             2

## Cluster 2

##     political       science           phd           usc       student 
##            13            10             9             8             7 
##     relations      southern    california international        social 
##             7             7             7             7             6 
##     candidate    university          @usc      politics     professor 
##             6             6             5             5             5 
##          ph.d         media     assistant         https          t.co 
##             4             4             4             3             3 
##      director      research  organization            us     institute 
##             3             3             2             2             2

## Cluster 3

##        usc     school    account university   official     public 
##         21         10          9          8          8          8 
##   southern california         us   research  education     follow 
##          7          7          7          7          6          6 
##    twitter     center department       news      study   students 
##          6          6          5          5          5          4 
##     across     social        rts      media       t.co     policy 
##          4          4          4          4          4          4 
##     offers 
##          3

## Cluster 4

##       news       t.co      https     public   breaking government 
##          9          9          6          4          4          3 
##        u.s    senator   official      state       http       poll 
##          3          3          3          3          3          3 
##    opinion     follow    harvard innovation california     latest 
##          3          3          2          2          2          2 
##    stories    twitter   national        est        los    angeles 
##          2          2          2          2          2          2 
##      world 
##          2

# description
poir <- dfm(corpus(nodes[,c("description", "cluster")], text_field="description"))
for (i in 1:4){
    print(
      head(textstat_keyness(poir, target=docvars(poir)$cluster==i,
                      measure="lr"), n=20)
    )
}

##                      G2            p n_target n_reference
## political     37.818263 7.765143e-10       25          14
## science       31.279728 2.234002e-08       21          12
## journal       24.331179 8.111543e-07        9           0
## association   21.149616 4.247864e-06        8           0
## politics      10.955381 9.333220e-04       11           9
## published      8.695019 3.190807e-03        4           0
## law            6.732350 9.467980e-03        5           2
## american       6.029723 1.406694e-02        4           1
## cambridge      5.718189 1.679004e-02        3           0
## european       5.718189 1.679004e-02        3           0
## that           5.718189 1.679004e-02        3           0
## international  5.246374 2.199255e-02        7           7
## press          3.558967 5.922460e-02        3           1
## we             3.558967 5.922460e-02        3           1
## "              2.887638 8.926168e-02        2           0
## experimental   2.887638 8.926168e-02        2           0
## western        2.887638 8.926168e-02        2           0
## methods        2.887638 8.926168e-02        2           0
## the            2.427361 1.192335e-01       29          86
## research       1.986965 1.586586e-01        6          10
##                     G2           p n_target n_reference
## phd           7.346920 0.006717776        9           6
## ph.d          7.261400 0.007045235        4           0
## candidate     7.156528 0.007469165        6           2
## ,             6.902388 0.008608070       49          98
## ;             6.748438 0.009382978        8           5
## student       6.166356 0.013020257        7           4
## relations     6.166356 0.013020257        7           4
## at            6.065362 0.013785852       13          16
## (             5.669637 0.017261024        8           6
## )             5.669637 0.017261024        8           6
## |             4.854353 0.027576447       10          12
## social        4.422759 0.035462647        6           4
## in            3.880020 0.048863965       13          20
## professor     3.811790 0.050893033        5           3
## @usc          3.811790 0.050893033        5           3
## international 3.350260 0.067194398        7           7
## assistant     3.218875 0.072793641        4           2
## southern      2.706054 0.099968002        7           8
## director      2.670157 0.102245924        3           1
## ~             2.307880 0.128719464        2           0
##                      G2           p n_target n_reference
## for           12.820970 0.000342756       25          13
## usc           10.646869 0.001102574       21          11
## school         8.835417 0.002954402       10           2
## !              7.343656 0.006729981        9           2
## account        7.343656 0.006729981        9           2
## on             5.247741 0.021975284       13           8
## &              4.213050 0.040114157       19          16
## a              4.185726 0.040765761       10           6
## department     3.538785 0.059949361        5           1
## not            3.538785 0.059949361        5           1
## education      3.257616 0.071092401        6           2
## -              3.134481 0.076652793        8           4
## offers         2.699006 0.100410820        3           0
## annenberg      2.699006 0.100410820        3           0
## work           2.699006 0.100410820        3           0
## east           2.699006 0.100410820        3           0
## events         2.699006 0.100410820        3           0
## shaping        2.699006 0.100410820        3           0
## environmental  2.699006 0.100410820        3           0
## engineering    2.699006 0.100410820        3           0
##                   G2            p n_target n_reference
## /          31.231458 2.290245e-08       30          28
## most       11.798784 5.926942e-04        5           0
## news       11.644335 6.439834e-04        9           5
## breaking    8.733379 3.124370e-03        4           0
## from        8.181582 4.231783e-03        8           6
## t.co        8.109129 4.404288e-03        9           8
## :           6.642950 9.954894e-03       13          21
## senator     5.745174 1.653402e-02        3           0
## poll        5.745174 1.653402e-02        3           0
## https       4.355313 3.689367e-02        6           6
## government  3.583236 5.836536e-02        3           1
## u.s         3.583236 5.836536e-02        3           1
## opinion     3.583236 5.836536e-02        3           1
## harvard     2.903270 8.840004e-02        2           0
## stories     2.903270 8.840004e-02        2           0
## national    2.903270 8.840004e-02        2           0
## est         2.903270 8.840004e-02        2           0
## los         2.903270 8.840004e-02        2           0
## angeles     2.903270 8.840004e-02        2           0
## editors     2.903270 8.840004e-02        2           0

# location
poir <- dfm(corpus(nodes[,c("location", "cluster")], text_field="location"))
for (i in 1:4){
    print(
      head(textstat_keyness(poir, target=docvars(poir)$cluster==i,
                      measure="lr"), n=20)
    )
}

##                  G2          p n_target n_reference
## new        6.071688 0.01373657        4           2
## uk         3.631702 0.05668882        2           0
## university 3.631702 0.05668882        2           0
## united     3.631702 0.05668882        2           0
## ny         3.425055 0.06421409        3           2
## york       3.425055 0.06421409        3           2
## states     2.040867 0.15312233        2           1
## cambridge  0.675642 0.41109145        1           0
## atlanta    0.675642 0.41109145        1           0
## ga         0.675642 0.41109145        1           0
## europe     0.675642 0.41109145        1           0
## #mpsa18    0.675642 0.41109145        1           0
## april      0.675642 0.41109145        1           0
## 5-8        0.675642 0.41109145        1           0
## chicago    0.675642 0.41109145        1           0
## london     0.675642 0.41109145        1           0
## texas      0.675642 0.41109145        1           0
## a          0.675642 0.41109145        1           0
## &          0.675642 0.41109145        1           0
## m          0.675642 0.41109145        1           0
##                    G2         p n_target n_reference
## ca          2.2540208 0.1332677       20          33
## seoul       1.8338339 0.1756754        2           0
## korea       1.8338339 0.1756754        2           0
## los         1.3947083 0.2376116       24          45
## angeles     1.3947083 0.2376116       24          45
## of          0.5986740 0.4390844        2           1
## usa         0.3980593 0.5280932        1           1
## |           0.3980593 0.5280932        1           1
## the         0.3980593 0.5280932        1           1
## all         0.1985045 0.6559307        1           0
## 50          0.1985045 0.6559307        1           0
## pacific     0.1985045 0.6559307        1           0
## rim         0.1985045 0.6559307        1           0
## cairo       0.1985045 0.6559307        1           0
## utrecht     0.1985045 0.6559307        1           0
## netherlands 0.1985045 0.6559307        1           0
## washdc      0.1985045 0.6559307        1           0
## y           0.1985045 0.6559307        1           0
## uruguay     0.1985045 0.6559307        1           0
## republic    0.1985045 0.6559307        1           0
##                   G2          p n_target n_reference
## los        6.7652616 0.00929493       36          33
## angeles    6.7652616 0.00929493       36          33
## usc        5.9834390 0.01444082        5           0
## ca         3.0185465 0.08231721       26          27
## california 0.5370916 0.46364056        5           4
## -          0.1763771 0.67450537        2           1
## hazel      0.0585039 0.80887641        1           0
## stanley    0.0585039 0.80887641        1           0
## hall       0.0585039 0.80887641        1           0
## 314        0.0585039 0.80887641        1           0
## 3520       0.0585039 0.80887641        1           0
## trousdale  0.0585039 0.80887641        1           0
## pkwy       0.0585039 0.80887641        1           0
## la         0.0585039 0.80887641        1           0
## 90089      0.0585039 0.80887641        1           0
## las        0.0585039 0.80887641        1           0
## vegas      0.0585039 0.80887641        1           0
## sacramento 0.0585039 0.80887641        1           0
## dml        0.0585039 0.80887641        1           0
## 241        0.0585039 0.80887641        1           0
##                     G2           p n_target n_reference
## washington   9.4288031 0.002136037        5           1
## .            6.5673525 0.010386635        4           1
## d.c          3.9054281 0.048130363        3           1
## global       3.0998275 0.078300588        2           0
## dc           1.5833935 0.208272569        2           1
## harvard      0.5244819 0.468936081        1           0
## kennedy      0.5244819 0.468936081        1           0
## school       0.5244819 0.468936081        1           0
## today        0.5244819 0.468936081        1           0
## hq           0.5244819 0.468936081        1           0
## mclean       0.5244819 0.468936081        1           0
## va           0.5244819 0.468936081        1           0
## denver       0.5244819 0.468936081        1           0
## co           0.5244819 0.468936081        1           0
## poughkeepsie 0.5244819 0.468936081        1           0
## south        0.5244819 0.468936081        1           0
## carolina     0.5244819 0.468936081        1           0
## hollywood    0.5244819 0.468936081        1           0
## long         0.5244819 0.468936081        1           0
## beach        0.5244819 0.468936081        1           0

The final way in which we can think about network communities is in terms of hierarchy or structure. We’ll discuss one of these methods.

K-core decomposition allows us to identify the core and the periphery of the network. A k-core is a maximal subnet of a network such that all nodes have at least degree K.

coreness(g)

##  British Jnl Pol Sci                 AJPS        USC Gould Law 
##                   24                   24                   33 
##          Monkey Cage         Jeff Jenkins  Euro Pol Sci Review 
##                   24                   19                   20 
## International Theory Innovation @ Harvard         USC Dornsife 
##                   17                    6                   33 
##        USC Sociology                 SPSA                 ACLU 
##                   33                   24                   15 
##                 EPSA  USC History Dept. ✌️            Ray Kwong 
##                   19                   32                   17 
##          Rod Albuyeh   Sen. Barbara Boxer   USC Religious Life 
##                   10                   12                   33 
##         Jay Maharjan            USA TODAY         USC Research 
##                   18                   24                   33 
##         USC Marshall                 NCSL      Jason Giannaros 
##                   33                   17                    5 
##    Los Angeles Times        Eric Garcetti               AnnLab 
##                   29                   24                   31 
##          Marist Poll                 MPSA         Daily Trojan 
##                   12                   24                   33 
##   USC Annenberg CCLP   Political Analysis Jordan Carr Peterson 
##                   33                   24                   24 
##          USC Rossier             USC EASC          Jerry Brown 
##                   33                   33                   21 
##                  USC USC PoliticalScience   Norman Lear Center 
##                   33                   28                   33 
##            Tim Scott           Kyuri Park        USC Libraries 
##                    9                   24                   33 
##      FiveThirtyEight             CSII USC          Adam Badawy 
##                   24                   33                   24 
##      Christian Grose        Taylor Dalton      Washington Post 
##                   27                   24                   24 
## USC Shoah Foundation     Youssef Chouhoud       Fanny Cisneros 
##                   33                   24                   17 
##         CUP Politics         Robert Shrum                  PSA 
##                   24                   18                   24 
## USC Visions & Voices        USC Annenberg    USC Wrigley Inst. 
##                   33                   33                   31 
##   USC Viterbi School                  PSQ    American_Politics 
##                   33                   24                   24 
##  Journal of Politics       M Drake Reitan        Anne van Wijk 
##                   23                   15                    8 
##                 JEPS             USC EALC          Whitney Hua 
##                    8                   20                   20 
##  Wall Street Journal        sara sadhwani     USC Dornsife CFR 
##                   24                   24                   33 
##      USC Dermatology          Nola Haynes Tyler Bonanno-Curley 
##                   24                   18                   22 
##              USC SIR                  PGI         Adam Feldman 
##                   27                   24                   14 
##            Megan Eme               RISIST             PERE USC 
##                   16                   24                   33 
##  USC Unruh Institute  CA.gov (California)                 APSA 
##                   33                    8                   24 
##    Nicolás Albertoni                LAFLA               USCKSI 
##                   24                   12                   33 
##          joshua timm     The Harris Poll®          AP Politics 
##                    3                    7                   12 
##     Long Beach Mayor        Meredith Shaw        Pablo Barberá 
##                   13                   18                   24 
## Sen Dianne Feinstein        USC Economics      USC Social Work 
##                   17                    6                   33 
##   Fulbright Programs Graduate Student Gov   Polymathic Academy 
##                   14                   24                   28 
##       bryn rosenfeld USC Bedrosian Center The Associated Press 
##                    1                   33                   24 
##   USC Cinematic Arts USC Computer Science               Gallup 
##                   30                   29                    9 
##             USC CRCC             Ronan Fu            Dave Kang 
##                   33                   24                   24 
##       Fels Institute USC Public Diplomacy             Pongkwan 
##                    7                   33                   15 
##      Joey Huddleston    USC Annenberg PhD Keck Medicine of USC 
##                    9                   28                   33 
##            Kyle Rapp        Sangay Mishra       Political Data 
##                   24                   23                   10 
##                  ISA  USC Graduate School         Mark Paradis 
##                   24                   33                   22 
##     Evgeniia Iakhnis                  CNN              USC CIS 
##                   24                   24                   24 
##      Quinnipiac Poll  Erin Baggott Carter   The New York Times 
##                    6                   22                   24 
##         Brett Carter    NetDem Lab at USC            Abby Wood 
##                   20                   15                   24 
##              theWPSA             SPEC Lab    Stefanie Neumeier 
##                   24                   24                   23 
##        Andy Sinclair         Brian Knafou     USC Price School 
##                   24                   22                   33 
##    Kelebogile Zvobgo Victoria Chonn Ching         PSRM journal 
##                   24                   24                   20

which(coreness(g)==33) # what is the core of the network?

##        USC Gould Law         USC Dornsife        USC Sociology 
##                    3                    9                   10 
##   USC Religious Life         USC Research         USC Marshall 
##                   18                   21                   22 
##         Daily Trojan   USC Annenberg CCLP          USC Rossier 
##                   30                   31                   34 
##             USC EASC                  USC   Norman Lear Center 
##                   35                   37                   39 
##        USC Libraries             CSII USC USC Shoah Foundation 
##                   42                   44                   49 
## USC Visions & Voices        USC Annenberg   USC Viterbi School 
##                   55                   56                   58 
##     USC Dornsife CFR             PERE USC  USC Unruh Institute 
##                   69                   78                   79 
##               USCKSI      USC Social Work USC Bedrosian Center 
##                   84                   93                   98 
##             USC CRCC USC Public Diplomacy Keck Medicine of USC 
##                  103                  107                  111 
##  USC Graduate School     USC Price School 
##                  116                  132

which(coreness(g)==1) # what is the periphery of the network?

## bryn rosenfeld 
##             97

# looking at what predicts being in the core
nodes$k <- coreness(g)
# number of followers?
plot(nodes$k, log(nodes$followers_count))

cor(nodes$k, log(nodes$followers_count))

## [1] 0.09102953

# text?
poir <- dfm(corpus(nodes[,c("description", "k")], text_field="description"))
head(textstat_keyness(poir, target=docvars(poir)$k==33,
                      measure="lr"), n=20)

##                  G2           p n_target n_reference
## school     7.782301 0.005276054        8           4
## &          7.521373 0.006097118       16          19
## usc        5.703835 0.016927889       14          18
## center     5.425831 0.019840994        6           3
## shaping    4.613143 0.031727836        3           0
## for        4.194837 0.040547244       15          23
## a          3.775917 0.051995401        8           8
## education  3.668911 0.055436377        5           3
## the        2.835163 0.092221448       36          79
## you        2.584930 0.107885152        3           1
## change     2.584930 0.107885152        3           1
## journalism 2.584930 0.107885152        3           1
## to         2.413369 0.120303230        9          14
## dornsife   2.251393 0.133494144        2           0
## academic   2.251393 0.133494144        2           0
## humanities 2.251393 0.133494144        2           0
## sciences   2.251393 0.133494144        2           0
## with       2.251393 0.133494144        2           0
## fostering  2.251393 0.133494144        2           0
## instagram  2.251393 0.133494144        2           0

head(textstat_keyness(poir, target=docvars(poir)$k<5,
                      measure="lr"), n=20)

##                          G2          p n_target n_reference
## science        3.7072849954 0.05417545        2          31
## political      3.2598286147 0.07099655        2          37
## assistant      1.7796165999 0.18219641        1           5
## professor      1.5037237810 0.22009929        1           7
## student        1.2105717370 0.27121892        1          10
## southern       0.9397763156 0.33233537        1          14
## phd            0.9397763156 0.33233537        1          14
## of             0.8531906330 0.35565129        2         110
## california     0.8354276367 0.36070775        1          16
## university     0.7049193744 0.40113563        1          19
## at             0.4324209091 0.51080343        1          28
## usc            0.3675731776 0.54433008        1          31
## ,              0.0237610379 0.87749449        1         146
## and           -0.0001319675 0.99083434        0          87
## british       -0.0117164030 0.91380346        0           1
## top-20        -0.0117164030 0.91380346        0           1
## gould         -0.0117164030 0.91380346        0           1
## world-class   -0.0117164030 0.91380346        0           1
## unparalleled  -0.0117164030 0.91380346        0           1
## opportunities -0.0117164030 0.91380346        0           1

If you want to learn more about this technique, we recently published a paper in PLOS ONE where we use it to study large-scale Twitter networks in the context of protest events.

library(netdemR)
options(stringsAsFactors=F)
oauth_folder = "~/Dropbox/credentials/twitter"

accounts <- getFriends("uscpoir", oauth_folder=oauth_folder)

# creating folders (if they do not exists)
try(dir.create("friends"))

# first check if there's any list of friends already downloaded to 'outfolder'
accounts.done <- gsub(".rdata", "", list.files("data"))
accounts.left <- accounts[accounts %in% accounts.done == FALSE]
accounts.left <- accounts.left[!is.na(accounts.left)]

# loop over the rest of accounts, downloading friend lists from API
while (length(accounts.left) > 0){

    # sample randomly one account to get friends
    new.user <- sample(accounts.left, 1)
    #new.user <- accounts.left[1]
    cat(new.user, "---", length(accounts.left), " accounts left!\n")    
    
    # download followers (with some exception handling...) 
    error <- tryCatch(friends <- getFriends(user_id=new.user,
        oauth_folder=oauth_folder, sleep=0.5, verbose=FALSE), error=function(e) e)
    if (inherits(error, 'error')) {
        cat("Error! On to the next one...")
        accounts.left <- accounts.left[-which(accounts.left %in% new.user)]
        next
    }
    
    # save to file and remove from lists of "accounts.left"
    file.name <- paste0("friends/", new.user, ".rdata")
    save(friends, file=file.name)
    accounts.left <- accounts.left[-which(accounts.left %in% new.user)]

}

# keeping only those for which we have the name
accounts <- gsub(".rdata", "", list.files("friends"))

# reading and creating network
edges <- list()
for (i in 1:length(accounts)){
    file.name <- paste0("friends/", accounts[i], ".rdata")
    load(file.name)
    if (length(friends)==0){ next }
    chosen <- accounts[accounts %in% friends]
    if (length(chosen)==0){ next }
    edges[[i]] <- data.frame(
        source = accounts[i], target = chosen)
}

edges <- do.call(rbind, edges)
nodes <- data.frame(id_str=unique(c(edges$source, edges$target)))

# adding user data
users <- getUsersBatch(ids=nodes$id_str, oauth_folder=oauth_folder)
nodes <- merge(nodes, users)

library(igraph)
g <- graph_from_data_frame(d=edges, vertices=nodes, directed=TRUE)
g

names(nodes)[1:2] <- c("Id", "Label")
names(edges)[1:2] <- c("Source", "Target")
write.csv(nodes, file="../data/poir-nodes.csv", row.names=FALSE)
write.csv(edges, file="../data/poir-edges.csv", row.names=FALSE)

Community detection

Pablo Barbera

October 31, 2017

Importing network data into R

Network communities