In this guided coding session we will be using a small dataset to illustrate how to identify latent communities in networks. The dataset corresponds to the Twitter ego network of USC POIR – each node is another Twitter account that the USC POIR account follows, and the edges indicate whether each of those accounts in turn follow each other. (See at the end of this script for the code on how I put together this network.) Edges are thus directed.
The first step is to read the list of edges and nodes in this network:
edges <- read.csv("../data/poir-edges.csv", stringsAsFactors=FALSE)
head(edges)
## Source Target
## 1 112448318 116630713
## 2 112448318 119679411
## 3 112448318 119682506
## 4 112448318 135469780
## 5 112448318 1852171094
## 6 112448318 186894716
nodes <- read.csv("../data/poir-nodes.csv", stringsAsFactors=FALSE)
head(nodes)
## Id Label name
## 1 112448318 BJPolS British Jnl Pol Sci
## 2 1137637033 AJPS_Editor AJPS
## 3 114865774 USCGouldLaw USC Gould Law
## 4 116630713 monkeycageblog Monkey Cage
## 5 1180479770 jaj7d Jeff Jenkins
## 6 119679411 EPSRjournal Euro Pol Sci Review
## description
## 1 British Journal of Political Science from Cambridge University Press.
## 2 American Journal of Political Science
## 3 A top-20 law school, USC Gould offers students a world-class education and unparalleled opportunities.
## 4 H.L. Mencken said: "Democracy is the art of running the circus from the monkey cage." We do political science and politics. Tweets by bot.
## 5 Provost Professor of Public Policy, Political Science, and Law. @USC\nDirector, @BedrosianCenter
## 6 European Political Science Review, the new journal from ECPR and Cambridge University Press
## followers_count statuses_count friends_count
## 1 8720 3563 379
## 2 7014 506 64
## 3 5638 6688 2267
## 4 39323 18731 411
## 5 70 10 79
## 6 7402 2638 189
## created_at location lang
## 1 Mon Feb 08 14:52:14 +0000 2010 Cambridge en
## 2 Thu Jan 31 18:42:30 +0000 2013 en
## 3 Tue Feb 16 21:24:55 +0000 2010 Los Angeles, CA en
## 4 Tue Feb 23 03:53:00 +0000 2010 en
## 5 Thu Feb 14 22:26:28 +0000 2013 Los Angeles, CA en
## 6 Thu Mar 04 09:44:16 +0000 2010 en
## time_zone status.id_str status.created_at
## 1 London 9.253148e+17 Tue Oct 31 10:53:28 +0000 2017
## 2 Eastern Time (US & Canada) 9.254536e+17 Tue Oct 31 20:05:05 +0000 2017
## 3 Pacific Time (US & Canada) 9.254832e+17 Tue Oct 31 22:02:45 +0000 2017
## 4 Central Time (US & Canada) 9.253175e+17 Tue Oct 31 11:04:23 +0000 2017
## 5 Eastern Time (US & Canada) 9.174209e+17 Mon Oct 09 16:05:55 +0000 2017
## 6 London 9.253079e+17 Tue Oct 31 10:26:10 +0000 2017
## status.text
## 1 #FirstView - The Measurement of Real-Time Perceptions of Financial Stress: Implications for Political Science… https://t.co/v7bp8LgcsV
## 2 Everything to Everyone Electoral Consequences of Broad-Appeal Strategy in Europe https://t.co/ZyoXP7oAJF via @AJPS_Editor #AJPSVirtualIssue
## 3 RT @vanessablum: Love this Twitter debate btwn @isamuel (@Harvard_Law @FirstMondaysFM) & @OrinKerr (@gwlaw @USCGouldLaw). Scholarly joustin…
## 4 Chief Justice Roberts and other judges have a hard time with statistics. That’s a real problem https://t.co/gTNpPurF63
## 5 A little bit about my PIPE initiative at USC. I'm really excited about it. https://t.co/9t3kV2NkAr
## 6 MT @KingsDMES: .@ferdinandeibl & @lynge_mangueira - the effects of democratization on political budget cycles https://t.co/0aqPHM0ckw
For example, we learn that user with ID 112448318 follows user with ID 116630713
We will now convert these two datasets into a network object in R using igraph
.
library(igraph)
g <- graph_from_data_frame(d=edges, vertices=nodes, directed=TRUE)
g
## IGRAPH DN-- 135 2654 --
## + attr: name (v/c), Label (v/c), description (v/c),
## | followers_count (v/n), statuses_count (v/n), friends_count
## | (v/n), created_at (v/c), location (v/c), lang (v/c), time_zone
## | (v/c), status.id_str (v/n), status.created_at (v/c), status.text
## | (v/c)
## + edges (vertex names):
## [1] British Jnl Pol Sci->Monkey Cage
## [2] British Jnl Pol Sci->Euro Pol Sci Review
## [3] British Jnl Pol Sci->International Theory
## [4] British Jnl Pol Sci->EPSA
## + ... omitted several edges
What does it mean? - U
means undirected
- N
means named graph
- 902
is the number of nodes
- 13606
is the number of edges
- name (v/c)
means name is a node attribute and it’s a character
Networks often have different clusters or communities of nodes that are more densely connected to each other than to the rest of the network. Let’s cover some of the different existing methods to identify these communities.
The most straightforward way to partition a network is into connected components. Each component is a group of nodes that are connected to each other, but not to the rest of the nodes. For example, this network has only one component (every node is at least connected to one other node in the network).
components(g)
## $membership
## British Jnl Pol Sci AJPS USC Gould Law
## 1 1 1
## Monkey Cage Jeff Jenkins Euro Pol Sci Review
## 1 1 1
## International Theory Innovation @ Harvard USC Dornsife
## 1 1 1
## USC Sociology SPSA ACLU
## 1 1 1
## EPSA USC History Dept. ✌️ Ray Kwong
## 1 1 1
## Rod Albuyeh Sen. Barbara Boxer USC Religious Life
## 1 1 1
## Jay Maharjan USA TODAY USC Research
## 1 1 1
## USC Marshall NCSL Jason Giannaros
## 1 1 1
## Los Angeles Times Eric Garcetti AnnLab
## 1 1 1
## Marist Poll MPSA Daily Trojan
## 1 1 1
## USC Annenberg CCLP Political Analysis Jordan Carr Peterson
## 1 1 1
## USC Rossier USC EASC Jerry Brown
## 1 1 1
## USC USC PoliticalScience Norman Lear Center
## 1 1 1
## Tim Scott Kyuri Park USC Libraries
## 1 1 1
## FiveThirtyEight CSII USC Adam Badawy
## 1 1 1
## Christian Grose Taylor Dalton Washington Post
## 1 1 1
## USC Shoah Foundation Youssef Chouhoud Fanny Cisneros
## 1 1 1
## CUP Politics Robert Shrum PSA
## 1 1 1
## USC Visions & Voices USC Annenberg USC Wrigley Inst.
## 1 1 1
## USC Viterbi School PSQ American_Politics
## 1 1 1
## Journal of Politics M Drake Reitan Anne van Wijk
## 1 1 1
## JEPS USC EALC Whitney Hua
## 1 1 1
## Wall Street Journal sara sadhwani USC Dornsife CFR
## 1 1 1
## USC Dermatology Nola Haynes Tyler Bonanno-Curley
## 1 1 1
## USC SIR PGI Adam Feldman
## 1 1 1
## Megan Eme RISIST PERE USC
## 1 1 1
## USC Unruh Institute CA.gov (California) APSA
## 1 1 1
## Nicolás Albertoni LAFLA USCKSI
## 1 1 1
## joshua timm The Harris Poll® AP Politics
## 1 1 1
## Long Beach Mayor Meredith Shaw Pablo Barberá
## 1 1 1
## Sen Dianne Feinstein USC Economics USC Social Work
## 1 1 1
## Fulbright Programs Graduate Student Gov Polymathic Academy
## 1 1 1
## bryn rosenfeld USC Bedrosian Center The Associated Press
## 1 1 1
## USC Cinematic Arts USC Computer Science Gallup
## 1 1 1
## USC CRCC Ronan Fu Dave Kang
## 1 1 1
## Fels Institute USC Public Diplomacy Pongkwan
## 1 1 1
## Joey Huddleston USC Annenberg PhD Keck Medicine of USC
## 1 1 1
## Kyle Rapp Sangay Mishra Political Data
## 1 1 1
## ISA USC Graduate School Mark Paradis
## 1 1 1
## Evgeniia Iakhnis CNN USC CIS
## 1 1 1
## Quinnipiac Poll Erin Baggott Carter The New York Times
## 1 1 1
## Brett Carter NetDem Lab at USC Abby Wood
## 1 1 1
## theWPSA SPEC Lab Stefanie Neumeier
## 1 1 1
## Andy Sinclair Brian Knafou USC Price School
## 1 1 1
## Kelebogile Zvobgo Victoria Chonn Ching PSRM journal
## 1 1 1
##
## $csize
## [1] 135
##
## $no
## [1] 1
Most networks have a single giant connected component that includes most nodes. Most studies of networks actually focus on the giant component (e.g. the shortest path between nodes in a network with two or more component is Inf!).
giant <- decompose(g)
giant
## [[1]]
## IGRAPH DN-- 135 2654 --
## + attr: name (v/c), Label (v/c), description (v/c),
## | followers_count (v/n), statuses_count (v/n), friends_count
## | (v/n), created_at (v/c), location (v/c), lang (v/c), time_zone
## | (v/c), status.id_str (v/n), status.created_at (v/c), status.text
## | (v/c)
## + edges (vertex names):
## [1] British Jnl Pol Sci->Monkey Cage
## [2] British Jnl Pol Sci->Euro Pol Sci Review
## [3] British Jnl Pol Sci->International Theory
## [4] British Jnl Pol Sci->EPSA
## + ... omitted several edges
Even within a giant component, there can be different subsets of the network that are more connected to each other than to the rest of the network. The goal of community detection algorithms is to identify these subsets.
There are a few different algorithms, each following a different logic.
The walktrap algorithm finds communities through a series of short random walks. The idea is that these random walks tend to stay within the same community. The length of these random walks is 4 edges by default, but you may want to experiment with different values (longer random walks will lead to fewer communities). The goal of this algorithm is to identify the partition that maximizes a modularity score.
cluster_walktrap(g)
## IGRAPH clustering walktrap, groups: 4, mod: 0.3
## + groups:
## $`1`
## [1] "Innovation @ Harvard" "USC Dornsife" "ACLU"
## [4] "Rod Albuyeh" "Sen. Barbara Boxer" "USA TODAY"
## [7] "NCSL" "Jason Giannaros" "Los Angeles Times"
## [10] "Marist Poll" "Jordan Carr Peterson" "Jerry Brown"
## [13] "USC PoliticalScience" "Tim Scott" "Kyuri Park"
## [16] "FiveThirtyEight" "Adam Badawy" "Christian Grose"
## [19] "Taylor Dalton" "Washington Post" "Youssef Chouhoud"
## [22] "Fanny Cisneros" "Robert Shrum" "American_Politics"
## [25] "M Drake Reitan" "Anne van Wijk" "Whitney Hua"
## + ... omitted several groups/vertices
cluster_walktrap(g, steps=10)
## IGRAPH clustering walktrap, groups: 3, mod: 0.34
## + groups:
## $`1`
## [1] "Innovation @ Harvard" "ACLU" "Rod Albuyeh"
## [4] "Sen. Barbara Boxer" "USA TODAY" "NCSL"
## [7] "Jason Giannaros" "Los Angeles Times" "Marist Poll"
## [10] "Jordan Carr Peterson" "Tim Scott" "Kyuri Park"
## [13] "FiveThirtyEight" "Adam Badawy" "Christian Grose"
## [16] "Taylor Dalton" "Washington Post" "Youssef Chouhoud"
## [19] "Fanny Cisneros" "Robert Shrum" "Anne van Wijk"
## [22] "Whitney Hua" "Wall Street Journal" "sara sadhwani"
## [25] "Nola Haynes" "Tyler Bonanno-Curley" "USC SIR"
## + ... omitted several groups/vertices
cluster_walktrap(g, steps=20)
## IGRAPH clustering walktrap, groups: 3, mod: 0.3
## + groups:
## $`1`
## [1] "British Jnl Pol Sci" "AJPS" "Monkey Cage"
## [4] "Jeff Jenkins" "Euro Pol Sci Review" "International Theory"
## [7] "SPSA" "EPSA" "Rod Albuyeh"
## [10] "NCSL" "Jason Giannaros" "Marist Poll"
## [13] "MPSA" "Political Analysis" "Jordan Carr Peterson"
## [16] "Tim Scott" "Kyuri Park" "FiveThirtyEight"
## [19] "Adam Badawy" "Christian Grose" "Youssef Chouhoud"
## [22] "CUP Politics" "PSA" "PSQ"
## [25] "American_Politics" "Journal of Politics" "Anne van Wijk"
## + ... omitted several groups/vertices
cluster_walktrap(g, steps=30)
## IGRAPH clustering walktrap, groups: 3, mod: 0.3
## + groups:
## $`1`
## [1] "British Jnl Pol Sci" "AJPS" "Monkey Cage"
## [4] "Jeff Jenkins" "Euro Pol Sci Review" "International Theory"
## [7] "SPSA" "EPSA" "Rod Albuyeh"
## [10] "NCSL" "Jason Giannaros" "Marist Poll"
## [13] "MPSA" "Political Analysis" "Jordan Carr Peterson"
## [16] "Tim Scott" "Kyuri Park" "FiveThirtyEight"
## [19] "Adam Badawy" "Christian Grose" "Youssef Chouhoud"
## [22] "CUP Politics" "PSA" "PSQ"
## [25] "American_Politics" "Journal of Politics" "Anne van Wijk"
## + ... omitted several groups/vertices
Other methods are:
cluster_infomap(g)
## IGRAPH clustering infomap, groups: 4, mod: 0.35
## + groups:
## $`1`
## [1] "USC Gould Law" "USC Dornsife" "USC Sociology"
## [4] "USC History Dept. ✌️" "USC Religious Life" "Jay Maharjan"
## [7] "USC Research" "USC Marshall" "Eric Garcetti"
## [10] "AnnLab" "Daily Trojan" "USC Annenberg CCLP"
## [13] "USC Rossier" "USC EASC" "USC"
## [16] "USC PoliticalScience" "Norman Lear Center" "USC Libraries"
## [19] "CSII USC" "USC Shoah Foundation" "Robert Shrum"
## [22] "USC Visions & Voices" "USC Annenberg" "USC Wrigley Inst."
## [25] "USC Viterbi School" "M Drake Reitan" "USC EALC"
## + ... omitted several groups/vertices
cluster_edge_betweenness(g)
## IGRAPH clustering edge betweenness, groups: 91, mod: 0.033
## + groups:
## $`1`
## [1] "British Jnl Pol Sci" "AJPS" "Monkey Cage"
## [4] "Euro Pol Sci Review" "ACLU" "USA TODAY"
## [7] "Los Angeles Times" "MPSA" "Kyuri Park"
## [10] "FiveThirtyEight" "Adam Badawy" "Washington Post"
## [13] "CUP Politics" "M Drake Reitan" "Anne van Wijk"
## [16] "Whitney Hua" "Wall Street Journal" "sara sadhwani"
## [19] "Nola Haynes" "Adam Feldman" "Megan Eme"
## [22] "RISIST" "APSA" "Meredith Shaw"
## [25] "The Associated Press" "Ronan Fu" "Dave Kang"
## + ... omitted several groups/vertices
cluster_label_prop(g)
## IGRAPH clustering label propagation, groups: 1, mod: 0
## + groups:
## $`1`
## [1] "British Jnl Pol Sci" "AJPS"
## [3] "USC Gould Law" "Monkey Cage"
## [5] "Jeff Jenkins" "Euro Pol Sci Review"
## [7] "International Theory" "Innovation @ Harvard"
## [9] "USC Dornsife" "USC Sociology"
## [11] "SPSA" "ACLU"
## [13] "EPSA" "USC History Dept. ✌️"
## [15] "Ray Kwong" "Rod Albuyeh"
## [17] "Sen. Barbara Boxer" "USC Religious Life"
## + ... omitted several groups/vertices
cluster_louvain(as.undirected(g))
## IGRAPH clustering multi level, groups: 4, mod: 0.33
## + groups:
## $`1`
## [1] "British Jnl Pol Sci" "AJPS" "Monkey Cage"
## [4] "Jeff Jenkins" "Euro Pol Sci Review" "International Theory"
## [7] "SPSA" "EPSA" "MPSA"
## [10] "Political Analysis" "Jordan Carr Peterson" "FiveThirtyEight"
## [13] "CUP Politics" "PSA" "PSQ"
## [16] "American_Politics" "Journal of Politics" "JEPS"
## [19] "sara sadhwani" "PGI" "APSA"
## [22] "Sangay Mishra" "Political Data" "ISA"
## [25] "Abby Wood" "theWPSA" "Stefanie Neumeier"
## + ... omitted several groups/vertices
The choice of one or other algorithm may depend on substantive or practical reasons, as always. For now, let’s pick the Louvain algorithm.
comm <- cluster_louvain(as.undirected(g))
nodes$cluster <- membership(comm)
nodes$Label[nodes$cluster==1]
## [1] "BJPolS" "AJPS_Editor" "monkeycageblog"
## [4] "jaj7d" "EPSRjournal" "InternatlTheory"
## [7] "SPSAnews" "europsa" "MPSAnet"
## [10] "polanalysis" "JordanCarrP" "FiveThirtyEight"
## [13] "CUP_PoliSci" "PolStudiesAssoc" "PSQ_CSPC"
## [16] "PSA_APG" "The_JOP" "JEPS_ed"
## [19] "sarasadhwani" "PGI_WPSA" "APSAtweets"
## [22] "SangayMishra" "Political_Data" "isanet"
## [25] "yesthatabbywood" "theWPSA" "SteffiNeumeier"
## [28] "jandrewsinclair" "PSRMJournal"
nodes$Label[nodes$cluster==2]
## [1] "ACLU" "raykwong" "RodAlbuyeh" "jasongiannaros"
## [5] "kyuripark1" "adambbadawy" "christiangrose" "taylordalton"
## [9] "_abuelbanat" "BobShrum" "mdrakereitan" "annevwijk"
## [13] "whitney_hua" "USC_SIR" "AdamSFeldman" "N_Albertoni"
## [17] "verbal_gaffe" "changmishaw" "p_barbera" "USC_Econ"
## [21] "brynrosenfeld" "ronantfu" "daveckang" "joeyhuddleston"
## [25] "KyleSRapp" "markpa84" "geniia_iakhnis" "UscCis"
## [29] "baggottcarter" "brett_l_carter" "NetDem_USC" "SPECLabUSC"
## [33] "Bknafou" "kelly_zvobgo" "V_Chonn"
nodes$Label[nodes$cluster==3]
## [1] "USCGouldLaw" "USCDornsife" "USC_Soci"
## [4] "USCHistory" "USCRELIGIOUSLIF" "4entrepreneur"
## [7] "USC_Research" "USCMarshall" "ericgarcetti"
## [10] "annenberglab" "dailytrojan" "USC_CCLP"
## [13] "USCRossier" "USCeasc" "JerryBrownGov"
## [16] "USC" "USCPOSC" "LearCenter"
## [19] "USCLibraries" "CSII_USC" "USCShoahFdn"
## [22] "VisionsnVoices" "USCAnnenberg" "USCWrigleyInst"
## [25] "USCViterbi" "USC_EALC" "USC_CFR"
## [28] "USCDermatology" "AngeMarieH" "PERE_USC"
## [31] "UnruhInstitute" "CAgovernment" "USCKSI"
## [34] "uscsocialwork" "FulbrightPrgrm" "USCGSG"
## [37] "USCPolymathy" "BedrosianCenter" "USCCinema"
## [40] "CSatUSC" "usccrcc" "PublicDiplomacy"
## [43] "USCAnnenbergPhD" "KeckMedUSC" "USCGradSchool"
## [46] "USCPrice"
nodes$Label[nodes$cluster==4]
## [1] "GovInnovations" "SenatorBoxer" "USATODAY"
## [4] "NCSLorg" "latimes" "maristpoll"
## [7] "SenatorTimScott" "washingtonpost" "fancis30"
## [10] "WSJ" "nolahtheveil" "curleyt13"
## [13] "megan_eme" "LegalAidLA" "HarrisPoll"
## [16] "AP_Politics" "LongBeachMayor" "SenFeinstein"
## [19] "AP" "Gallup" "PennFels"
## [22] "pongkwans" "CNN" "QuinnipiacPoll"
## [25] "nytimes"
table(grepl("los angeles", nodes$location, ignore.case=TRUE),
nodes$cluster)
##
## 1 2 3 4
## FALSE 25 11 10 20
## TRUE 4 24 36 5
library(quanteda)
## Warning: package 'quanteda' was built under R version 3.4.2
## quanteda version 0.99.9
## Using 3 of 4 threads for parallel computing
##
## Attaching package: 'quanteda'
## The following objects are masked from 'package:igraph':
##
## %>%, similarity
## The following object is masked from 'package:utils':
##
## View
for (i in 1:4){
message("Cluster ", i)
dfm <- dfm(nodes$description[nodes$cluster==i],
remove_punct=TRUE, remove=stopwords("english"))
print(topfeatures(dfm, n=25))
}
## Cluster 1
## political science politics journal association
## 25 21 11 9 8
## international university research law american
## 7 6 6 5 4
## published cambridge press professor european
## 4 3 3 3 3
## phd studies study public policy
## 3 3 3 2 2
## endorsement candidate twitter relations experimental
## 2 2 2 2 2
## Cluster 2
## political science phd usc student
## 13 10 9 8 7
## relations southern california international social
## 7 7 7 7 6
## candidate university @usc politics professor
## 6 6 5 5 5
## ph.d media assistant https t.co
## 4 4 4 3 3
## director research organization us institute
## 3 3 2 2 2
## Cluster 3
## usc school account university official public
## 21 10 9 8 8 8
## southern california us research education follow
## 7 7 7 7 6 6
## twitter center department news study students
## 6 6 5 5 5 4
## across social rts media t.co policy
## 4 4 4 4 4 4
## offers
## 3
## Cluster 4
## news t.co https public breaking government
## 9 9 6 4 4 3
## u.s senator official state http poll
## 3 3 3 3 3 3
## opinion follow harvard innovation california latest
## 3 3 2 2 2 2
## stories twitter national est los angeles
## 2 2 2 2 2 2
## world
## 2
# description
poir <- dfm(corpus(nodes[,c("description", "cluster")], text_field="description"))
for (i in 1:4){
print(
head(textstat_keyness(poir, target=docvars(poir)$cluster==i,
measure="lr"), n=20)
)
}
## G2 p n_target n_reference
## political 37.818263 7.765143e-10 25 14
## science 31.279728 2.234002e-08 21 12
## journal 24.331179 8.111543e-07 9 0
## association 21.149616 4.247864e-06 8 0
## politics 10.955381 9.333220e-04 11 9
## published 8.695019 3.190807e-03 4 0
## law 6.732350 9.467980e-03 5 2
## american 6.029723 1.406694e-02 4 1
## cambridge 5.718189 1.679004e-02 3 0
## european 5.718189 1.679004e-02 3 0
## that 5.718189 1.679004e-02 3 0
## international 5.246374 2.199255e-02 7 7
## press 3.558967 5.922460e-02 3 1
## we 3.558967 5.922460e-02 3 1
## " 2.887638 8.926168e-02 2 0
## experimental 2.887638 8.926168e-02 2 0
## western 2.887638 8.926168e-02 2 0
## methods 2.887638 8.926168e-02 2 0
## the 2.427361 1.192335e-01 29 86
## research 1.986965 1.586586e-01 6 10
## G2 p n_target n_reference
## phd 7.346920 0.006717776 9 6
## ph.d 7.261400 0.007045235 4 0
## candidate 7.156528 0.007469165 6 2
## , 6.902388 0.008608070 49 98
## ; 6.748438 0.009382978 8 5
## student 6.166356 0.013020257 7 4
## relations 6.166356 0.013020257 7 4
## at 6.065362 0.013785852 13 16
## ( 5.669637 0.017261024 8 6
## ) 5.669637 0.017261024 8 6
## | 4.854353 0.027576447 10 12
## social 4.422759 0.035462647 6 4
## in 3.880020 0.048863965 13 20
## professor 3.811790 0.050893033 5 3
## @usc 3.811790 0.050893033 5 3
## international 3.350260 0.067194398 7 7
## assistant 3.218875 0.072793641 4 2
## southern 2.706054 0.099968002 7 8
## director 2.670157 0.102245924 3 1
## ~ 2.307880 0.128719464 2 0
## G2 p n_target n_reference
## for 12.820970 0.000342756 25 13
## usc 10.646869 0.001102574 21 11
## school 8.835417 0.002954402 10 2
## ! 7.343656 0.006729981 9 2
## account 7.343656 0.006729981 9 2
## on 5.247741 0.021975284 13 8
## & 4.213050 0.040114157 19 16
## a 4.185726 0.040765761 10 6
## department 3.538785 0.059949361 5 1
## not 3.538785 0.059949361 5 1
## education 3.257616 0.071092401 6 2
## - 3.134481 0.076652793 8 4
## offers 2.699006 0.100410820 3 0
## annenberg 2.699006 0.100410820 3 0
## work 2.699006 0.100410820 3 0
## east 2.699006 0.100410820 3 0
## events 2.699006 0.100410820 3 0
## shaping 2.699006 0.100410820 3 0
## environmental 2.699006 0.100410820 3 0
## engineering 2.699006 0.100410820 3 0
## G2 p n_target n_reference
## / 31.231458 2.290245e-08 30 28
## most 11.798784 5.926942e-04 5 0
## news 11.644335 6.439834e-04 9 5
## breaking 8.733379 3.124370e-03 4 0
## from 8.181582 4.231783e-03 8 6
## t.co 8.109129 4.404288e-03 9 8
## : 6.642950 9.954894e-03 13 21
## senator 5.745174 1.653402e-02 3 0
## poll 5.745174 1.653402e-02 3 0
## https 4.355313 3.689367e-02 6 6
## government 3.583236 5.836536e-02 3 1
## u.s 3.583236 5.836536e-02 3 1
## opinion 3.583236 5.836536e-02 3 1
## harvard 2.903270 8.840004e-02 2 0
## stories 2.903270 8.840004e-02 2 0
## national 2.903270 8.840004e-02 2 0
## est 2.903270 8.840004e-02 2 0
## los 2.903270 8.840004e-02 2 0
## angeles 2.903270 8.840004e-02 2 0
## editors 2.903270 8.840004e-02 2 0
# location
poir <- dfm(corpus(nodes[,c("location", "cluster")], text_field="location"))
for (i in 1:4){
print(
head(textstat_keyness(poir, target=docvars(poir)$cluster==i,
measure="lr"), n=20)
)
}
## G2 p n_target n_reference
## new 6.071688 0.01373657 4 2
## uk 3.631702 0.05668882 2 0
## university 3.631702 0.05668882 2 0
## united 3.631702 0.05668882 2 0
## ny 3.425055 0.06421409 3 2
## york 3.425055 0.06421409 3 2
## states 2.040867 0.15312233 2 1
## cambridge 0.675642 0.41109145 1 0
## atlanta 0.675642 0.41109145 1 0
## ga 0.675642 0.41109145 1 0
## europe 0.675642 0.41109145 1 0
## #mpsa18 0.675642 0.41109145 1 0
## april 0.675642 0.41109145 1 0
## 5-8 0.675642 0.41109145 1 0
## chicago 0.675642 0.41109145 1 0
## london 0.675642 0.41109145 1 0
## texas 0.675642 0.41109145 1 0
## a 0.675642 0.41109145 1 0
## & 0.675642 0.41109145 1 0
## m 0.675642 0.41109145 1 0
## G2 p n_target n_reference
## ca 2.2540208 0.1332677 20 33
## seoul 1.8338339 0.1756754 2 0
## korea 1.8338339 0.1756754 2 0
## los 1.3947083 0.2376116 24 45
## angeles 1.3947083 0.2376116 24 45
## of 0.5986740 0.4390844 2 1
## usa 0.3980593 0.5280932 1 1
## | 0.3980593 0.5280932 1 1
## the 0.3980593 0.5280932 1 1
## all 0.1985045 0.6559307 1 0
## 50 0.1985045 0.6559307 1 0
## pacific 0.1985045 0.6559307 1 0
## rim 0.1985045 0.6559307 1 0
## cairo 0.1985045 0.6559307 1 0
## utrecht 0.1985045 0.6559307 1 0
## netherlands 0.1985045 0.6559307 1 0
## washdc 0.1985045 0.6559307 1 0
## y 0.1985045 0.6559307 1 0
## uruguay 0.1985045 0.6559307 1 0
## republic 0.1985045 0.6559307 1 0
## G2 p n_target n_reference
## los 6.7652616 0.00929493 36 33
## angeles 6.7652616 0.00929493 36 33
## usc 5.9834390 0.01444082 5 0
## ca 3.0185465 0.08231721 26 27
## california 0.5370916 0.46364056 5 4
## - 0.1763771 0.67450537 2 1
## hazel 0.0585039 0.80887641 1 0
## stanley 0.0585039 0.80887641 1 0
## hall 0.0585039 0.80887641 1 0
## 314 0.0585039 0.80887641 1 0
## 3520 0.0585039 0.80887641 1 0
## trousdale 0.0585039 0.80887641 1 0
## pkwy 0.0585039 0.80887641 1 0
## la 0.0585039 0.80887641 1 0
## 90089 0.0585039 0.80887641 1 0
## las 0.0585039 0.80887641 1 0
## vegas 0.0585039 0.80887641 1 0
## sacramento 0.0585039 0.80887641 1 0
## dml 0.0585039 0.80887641 1 0
## 241 0.0585039 0.80887641 1 0
## G2 p n_target n_reference
## washington 9.4288031 0.002136037 5 1
## . 6.5673525 0.010386635 4 1
## d.c 3.9054281 0.048130363 3 1
## global 3.0998275 0.078300588 2 0
## dc 1.5833935 0.208272569 2 1
## harvard 0.5244819 0.468936081 1 0
## kennedy 0.5244819 0.468936081 1 0
## school 0.5244819 0.468936081 1 0
## today 0.5244819 0.468936081 1 0
## hq 0.5244819 0.468936081 1 0
## mclean 0.5244819 0.468936081 1 0
## va 0.5244819 0.468936081 1 0
## denver 0.5244819 0.468936081 1 0
## co 0.5244819 0.468936081 1 0
## poughkeepsie 0.5244819 0.468936081 1 0
## south 0.5244819 0.468936081 1 0
## carolina 0.5244819 0.468936081 1 0
## hollywood 0.5244819 0.468936081 1 0
## long 0.5244819 0.468936081 1 0
## beach 0.5244819 0.468936081 1 0
The final way in which we can think about network communities is in terms of hierarchy or structure. We’ll discuss one of these methods.
K-core decomposition allows us to identify the core and the periphery of the network. A k-core is a maximal subnet of a network such that all nodes have at least degree K.
coreness(g)
## British Jnl Pol Sci AJPS USC Gould Law
## 24 24 33
## Monkey Cage Jeff Jenkins Euro Pol Sci Review
## 24 19 20
## International Theory Innovation @ Harvard USC Dornsife
## 17 6 33
## USC Sociology SPSA ACLU
## 33 24 15
## EPSA USC History Dept. ✌️ Ray Kwong
## 19 32 17
## Rod Albuyeh Sen. Barbara Boxer USC Religious Life
## 10 12 33
## Jay Maharjan USA TODAY USC Research
## 18 24 33
## USC Marshall NCSL Jason Giannaros
## 33 17 5
## Los Angeles Times Eric Garcetti AnnLab
## 29 24 31
## Marist Poll MPSA Daily Trojan
## 12 24 33
## USC Annenberg CCLP Political Analysis Jordan Carr Peterson
## 33 24 24
## USC Rossier USC EASC Jerry Brown
## 33 33 21
## USC USC PoliticalScience Norman Lear Center
## 33 28 33
## Tim Scott Kyuri Park USC Libraries
## 9 24 33
## FiveThirtyEight CSII USC Adam Badawy
## 24 33 24
## Christian Grose Taylor Dalton Washington Post
## 27 24 24
## USC Shoah Foundation Youssef Chouhoud Fanny Cisneros
## 33 24 17
## CUP Politics Robert Shrum PSA
## 24 18 24
## USC Visions & Voices USC Annenberg USC Wrigley Inst.
## 33 33 31
## USC Viterbi School PSQ American_Politics
## 33 24 24
## Journal of Politics M Drake Reitan Anne van Wijk
## 23 15 8
## JEPS USC EALC Whitney Hua
## 8 20 20
## Wall Street Journal sara sadhwani USC Dornsife CFR
## 24 24 33
## USC Dermatology Nola Haynes Tyler Bonanno-Curley
## 24 18 22
## USC SIR PGI Adam Feldman
## 27 24 14
## Megan Eme RISIST PERE USC
## 16 24 33
## USC Unruh Institute CA.gov (California) APSA
## 33 8 24
## Nicolás Albertoni LAFLA USCKSI
## 24 12 33
## joshua timm The Harris Poll® AP Politics
## 3 7 12
## Long Beach Mayor Meredith Shaw Pablo Barberá
## 13 18 24
## Sen Dianne Feinstein USC Economics USC Social Work
## 17 6 33
## Fulbright Programs Graduate Student Gov Polymathic Academy
## 14 24 28
## bryn rosenfeld USC Bedrosian Center The Associated Press
## 1 33 24
## USC Cinematic Arts USC Computer Science Gallup
## 30 29 9
## USC CRCC Ronan Fu Dave Kang
## 33 24 24
## Fels Institute USC Public Diplomacy Pongkwan
## 7 33 15
## Joey Huddleston USC Annenberg PhD Keck Medicine of USC
## 9 28 33
## Kyle Rapp Sangay Mishra Political Data
## 24 23 10
## ISA USC Graduate School Mark Paradis
## 24 33 22
## Evgeniia Iakhnis CNN USC CIS
## 24 24 24
## Quinnipiac Poll Erin Baggott Carter The New York Times
## 6 22 24
## Brett Carter NetDem Lab at USC Abby Wood
## 20 15 24
## theWPSA SPEC Lab Stefanie Neumeier
## 24 24 23
## Andy Sinclair Brian Knafou USC Price School
## 24 22 33
## Kelebogile Zvobgo Victoria Chonn Ching PSRM journal
## 24 24 20
which(coreness(g)==33) # what is the core of the network?
## USC Gould Law USC Dornsife USC Sociology
## 3 9 10
## USC Religious Life USC Research USC Marshall
## 18 21 22
## Daily Trojan USC Annenberg CCLP USC Rossier
## 30 31 34
## USC EASC USC Norman Lear Center
## 35 37 39
## USC Libraries CSII USC USC Shoah Foundation
## 42 44 49
## USC Visions & Voices USC Annenberg USC Viterbi School
## 55 56 58
## USC Dornsife CFR PERE USC USC Unruh Institute
## 69 78 79
## USCKSI USC Social Work USC Bedrosian Center
## 84 93 98
## USC CRCC USC Public Diplomacy Keck Medicine of USC
## 103 107 111
## USC Graduate School USC Price School
## 116 132
which(coreness(g)==1) # what is the periphery of the network?
## bryn rosenfeld
## 97
# looking at what predicts being in the core
nodes$k <- coreness(g)
# number of followers?
plot(nodes$k, log(nodes$followers_count))
cor(nodes$k, log(nodes$followers_count))
## [1] 0.09102953
# text?
poir <- dfm(corpus(nodes[,c("description", "k")], text_field="description"))
head(textstat_keyness(poir, target=docvars(poir)$k==33,
measure="lr"), n=20)
## G2 p n_target n_reference
## school 7.782301 0.005276054 8 4
## & 7.521373 0.006097118 16 19
## usc 5.703835 0.016927889 14 18
## center 5.425831 0.019840994 6 3
## shaping 4.613143 0.031727836 3 0
## for 4.194837 0.040547244 15 23
## a 3.775917 0.051995401 8 8
## education 3.668911 0.055436377 5 3
## the 2.835163 0.092221448 36 79
## you 2.584930 0.107885152 3 1
## change 2.584930 0.107885152 3 1
## journalism 2.584930 0.107885152 3 1
## to 2.413369 0.120303230 9 14
## dornsife 2.251393 0.133494144 2 0
## academic 2.251393 0.133494144 2 0
## humanities 2.251393 0.133494144 2 0
## sciences 2.251393 0.133494144 2 0
## with 2.251393 0.133494144 2 0
## fostering 2.251393 0.133494144 2 0
## instagram 2.251393 0.133494144 2 0
head(textstat_keyness(poir, target=docvars(poir)$k<5,
measure="lr"), n=20)
## G2 p n_target n_reference
## science 3.7072849954 0.05417545 2 31
## political 3.2598286147 0.07099655 2 37
## assistant 1.7796165999 0.18219641 1 5
## professor 1.5037237810 0.22009929 1 7
## student 1.2105717370 0.27121892 1 10
## southern 0.9397763156 0.33233537 1 14
## phd 0.9397763156 0.33233537 1 14
## of 0.8531906330 0.35565129 2 110
## california 0.8354276367 0.36070775 1 16
## university 0.7049193744 0.40113563 1 19
## at 0.4324209091 0.51080343 1 28
## usc 0.3675731776 0.54433008 1 31
## , 0.0237610379 0.87749449 1 146
## and -0.0001319675 0.99083434 0 87
## british -0.0117164030 0.91380346 0 1
## top-20 -0.0117164030 0.91380346 0 1
## gould -0.0117164030 0.91380346 0 1
## world-class -0.0117164030 0.91380346 0 1
## unparalleled -0.0117164030 0.91380346 0 1
## opportunities -0.0117164030 0.91380346 0 1
If you want to learn more about this technique, we recently published a paper in PLOS ONE where we use it to study large-scale Twitter networks in the context of protest events.
library(netdemR)
options(stringsAsFactors=F)
oauth_folder = "~/Dropbox/credentials/twitter"
accounts <- getFriends("uscpoir", oauth_folder=oauth_folder)
# creating folders (if they do not exists)
try(dir.create("friends"))
# first check if there's any list of friends already downloaded to 'outfolder'
accounts.done <- gsub(".rdata", "", list.files("data"))
accounts.left <- accounts[accounts %in% accounts.done == FALSE]
accounts.left <- accounts.left[!is.na(accounts.left)]
# loop over the rest of accounts, downloading friend lists from API
while (length(accounts.left) > 0){
# sample randomly one account to get friends
new.user <- sample(accounts.left, 1)
#new.user <- accounts.left[1]
cat(new.user, "---", length(accounts.left), " accounts left!\n")
# download followers (with some exception handling...)
error <- tryCatch(friends <- getFriends(user_id=new.user,
oauth_folder=oauth_folder, sleep=0.5, verbose=FALSE), error=function(e) e)
if (inherits(error, 'error')) {
cat("Error! On to the next one...")
accounts.left <- accounts.left[-which(accounts.left %in% new.user)]
next
}
# save to file and remove from lists of "accounts.left"
file.name <- paste0("friends/", new.user, ".rdata")
save(friends, file=file.name)
accounts.left <- accounts.left[-which(accounts.left %in% new.user)]
}
# keeping only those for which we have the name
accounts <- gsub(".rdata", "", list.files("friends"))
# reading and creating network
edges <- list()
for (i in 1:length(accounts)){
file.name <- paste0("friends/", accounts[i], ".rdata")
load(file.name)
if (length(friends)==0){ next }
chosen <- accounts[accounts %in% friends]
if (length(chosen)==0){ next }
edges[[i]] <- data.frame(
source = accounts[i], target = chosen)
}
edges <- do.call(rbind, edges)
nodes <- data.frame(id_str=unique(c(edges$source, edges$target)))
# adding user data
users <- getUsersBatch(ids=nodes$id_str, oauth_folder=oauth_folder)
nodes <- merge(nodes, users)
library(igraph)
g <- graph_from_data_frame(d=edges, vertices=nodes, directed=TRUE)
g
names(nodes)[1:2] <- c("Id", "Label")
names(edges)[1:2] <- c("Source", "Target")
write.csv(nodes, file="../data/poir-nodes.csv", row.names=FALSE)
write.csv(edges, file="../data/poir-edges.csv", row.names=FALSE)