Social network analysis with R: Descriptive analysis

Measuring node importance

What are the most important nodes in a network? What is the propensity of two nodes that are connected to be both connected to a third node? What are the different hidden communities in a network? These are some of the descriptive questions that we will adress now, following with the example of the Star Wars network.

Node properties

We’ll start with descriptive statistics at the node level. All of these are in some way measures of importance or centrality.

The most basic measure is degree, the number of adjacent edges to each node. It is often considered a measure of direct influence. In the Star Wars network, it will be the unique number of characters that each character is interacting with.

sort(degree(g))

##   GOLD FIVE      GREEDO       JABBA       CAMIE     RED TEN        OWEN 
##           0           1           1           2           2           3 
##       MOTTI      TARKIN        BERU DARTH VADER     DODONNA GOLD LEADER 
##           3           3           4           5           5           5 
##       WEDGE       R2-D2       BIGGS     OBI-WAN  RED LEADER   CHEWBACCA 
##           5           7           7           7           7           8 
##         HAN       C-3PO        LEIA        LUKE 
##           8          10          12          15

In directed graphs, there are three types of degree: indegree (incoming edges), outdegree (outgoing edges), and total degree. You can find these using mode="in" or mode="out" or mode="total".

Strength is a weighted measure of degree that takes into account the number of edges that go from one node to another. In this network, it will be the total number of interactions of each character with anybody else.

sort(strength(g))

##   GOLD FIVE      GREEDO       JABBA     RED TEN       CAMIE       MOTTI 
##           0           1           1           2           4           4 
##     DODONNA GOLD LEADER        OWEN        BERU       WEDGE      TARKIN 
##           5           5           8           9           9          10 
## DARTH VADER  RED LEADER       BIGGS     OBI-WAN       R2-D2        LEIA 
##          11          13          14          49          50          59 
##   CHEWBACCA       C-3PO         HAN        LUKE 
##          63          64          80         129

Closeness measures how many steps are required to access every other node from a given node. It’s a measure of how long information takes to arrive (who hears news first?). Higher values mean less centrality.

sort(closeness(g, normalized=TRUE))

##   GOLD FIVE      GREEDO       JABBA         HAN        OWEN       CAMIE 
##  0.04545455  0.11797753  0.11797753  0.13207547  0.17796610  0.18918919 
##      TARKIN     OBI-WAN       MOTTI       R2-D2        BERU       WEDGE 
##  0.20588235  0.21428571  0.21428571  0.22105263  0.22105263  0.22340426 
##     RED TEN   CHEWBACCA DARTH VADER        LUKE       C-3PO        LEIA 
##  0.22826087  0.23595506  0.23595506  0.24137931  0.24418605  0.25301205 
##     DODONNA GOLD LEADER       BIGGS  RED LEADER 
##  0.25609756  0.25609756  0.26250000  0.26250000

Betweenness measures brokerage or gatekeeping potential. It is (approximately) the number of shortest paths between nodes that pass through a particular node.

sort(betweenness(g))

##       CAMIE        OWEN     OBI-WAN       MOTTI      TARKIN      GREEDO 
##    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000 
##       JABBA       WEDGE   GOLD FIVE        BERU     RED TEN DARTH VADER 
##    0.000000    0.000000    0.000000    1.666667    2.200000   15.583333 
##   CHEWBACCA        LUKE       R2-D2 GOLD LEADER  RED LEADER       BIGGS 
##   15.916667   18.333333   22.750000   23.800000   31.416667   31.916667 
##       C-3PO         HAN     DODONNA        LEIA 
##   32.783333   37.000000   47.533333   59.950000

Eigenvector centrality is a measure of being well-connected connected to the well-connected. First eigenvector of the graph adjacency matrix. Only works with undirected networks.

sort(eigen_centrality(g)$vector)

##       MOTTI      TARKIN      GREEDO       JABBA     RED TEN GOLD LEADER 
## 0.009298135 0.011493177 0.011602599 0.011602606 0.015241788 0.017475028 
## DARTH VADER       CAMIE     DODONNA       WEDGE        OWEN  RED LEADER 
## 0.027009360 0.030744971 0.031723500 0.034374360 0.062695663 0.065141221 
##        BERU       BIGGS   GOLD FIVE       R2-D2     OBI-WAN        LEIA 
## 0.070823980 0.078921393 0.121485778 0.503053909 0.541729364 0.592624836 
##       C-3PO   CHEWBACCA         HAN        LUKE 
## 0.595864451 0.657663369 0.812242359 1.000000000

Page rank approximates probability that any message will arrive to a particular node. This algorithm was developed by Google founders, and originally applied to website links.

sort(page_rank(g)$vector)

##   GOLD FIVE       JABBA      GREEDO     RED TEN       CAMIE     DODONNA 
## 0.007092199 0.008310156 0.008310156 0.010573836 0.013792262 0.016185680 
##       MOTTI GOLD LEADER        OWEN        BERU       WEDGE      TARKIN 
## 0.016813964 0.017945853 0.018881975 0.020368818 0.026377242 0.034180007 
## DARTH VADER  RED LEADER       BIGGS     OBI-WAN       R2-D2        LEIA 
## 0.034576040 0.034578060 0.035070288 0.067378471 0.068538690 0.086027500 
##   CHEWBACCA       C-3PO         HAN        LUKE 
## 0.086390090 0.088708430 0.114631333 0.185268949

Authority score is another measure of centrality initially applied to the Web. A node has high authority when it is linked by many other nodes that are linking many other nodes.

sort(authority_score(g)$vector)

##    GOLD FIVE        MOTTI       TARKIN       GREEDO        JABBA 
## 1.273708e-17 9.118469e-03 1.133319e-02 1.154515e-02 1.154515e-02 
##      RED TEN  GOLD LEADER  DARTH VADER        CAMIE      DODONNA 
## 1.512880e-02 1.717615e-02 2.671707e-02 3.064953e-02 3.143121e-02 
##        WEDGE         OWEN   RED LEADER         BERU        BIGGS 
## 3.410098e-02 6.256707e-02 6.476889e-02 7.063977e-02 7.856101e-02 
##        R2-D2      OBI-WAN         LEIA        C-3PO    CHEWBACCA 
## 5.030995e-01 5.417666e-01 5.923767e-01 5.957835e-01 6.577603e-01 
##          HAN         LUKE 
## 8.125507e-01 1.000000e+00

Finally, not exactly a measure of centrality, but we can learn more about who each node is connected to by using the following functions: neighbors (for direct neighbors) and ego (for neighbors up to n neighbors away)

neighbors(g, v=which(V(g)$name=="DARTH VADER"))

## + 5/22 vertices, named:
## [1] CHEWBACCA LEIA      OBI-WAN   MOTTI     TARKIN

ego(g, order=2, nodes=which(V(g)$name=="DARTH VADER"))

## [[1]]
## + 14/22 vertices, named:
##  [1] DARTH VADER CHEWBACCA   LEIA        OBI-WAN     MOTTI      
##  [6] TARKIN      R2-D2       C-3PO       LUKE        HAN        
## [11] DODONNA     BIGGS       BERU        RED LEADER

Network properties

Let’s now try to describe what a network looks like as a whole. We can start with measures of the size of a network. diameter is the length of the longest path (in number of edges) between two nodes. We can use get_diameter to identify this path. mean_distance is the average number of edges between any two nodes in the network. We can find each of these paths between pairs of edges with distances.

diameter(g, directed=FALSE, weights=NA)

## [1] 3

get_diameter(g, directed=FALSE, weights=NA)

## + 4/22 vertices, named:
## [1] DARTH VADER CHEWBACCA   C-3PO       OWEN

mean_distance(g, directed=FALSE)

## [1] 1.909524

dist <- distances(g, weights=NA)
dist[1:5, 1:5]

##             R2-D2 CHEWBACCA C-3PO LUKE DARTH VADER
## R2-D2           0         1     1    1           2
## CHEWBACCA       1         0     1    1           1
## C-3PO           1         1     0    1           2
## LUKE            1         1     1    0           2
## DARTH VADER     2         1     2    2           0

edge_density is the proportion of edges in the network over all possible edges that could exist.

edge_density(g)

## [1] 0.2597403

# 22*21 possible edges / 2 because it's undirected = 231 possible edges
# but only 60 exist
60/((22*21)/2)

## [1] 0.2597403

reciprocity measures the propensity of each edge to be a mutual edge; that is, the probability that if i is connected to j, j is also connected to i.

reciprocity(g)

## [1] 1

# Why is it 1?

transitivity, also known as clustering coefficient, measures that probability that adjacent nodes of a network are connected. In other words, if i is connected to j, and j is connected to k, what is the probability that i is also connected to k?

transitivity(g)

## [1] 0.5375303

Network communities

Networks often have different clusters or communities of nodes that are more densely connected to each other than to the rest of the network. Let’s cover some of the different existing methods to identify these communities.

The most straightforward way to partition a network is into connected components. Each component is a group of nodes that are connected to each other, but not to the rest of the nodes. For example, this network has two components.

components(g)

## $membership
##       R2-D2   CHEWBACCA       C-3PO        LUKE DARTH VADER       CAMIE 
##           1           1           1           1           1           1 
##       BIGGS        LEIA        BERU        OWEN     OBI-WAN       MOTTI 
##           1           1           1           1           1           1 
##      TARKIN         HAN      GREEDO       JABBA     DODONNA GOLD LEADER 
##           1           1           1           1           1           1 
##       WEDGE  RED LEADER     RED TEN   GOLD FIVE 
##           1           1           1           2 
## 
## $csize
## [1] 21  1
## 
## $no
## [1] 2

par(mar=c(0,0,0,0)); plot(g)

Most networks have a single giant connected component that includes most nodes. Most studies of networks actually focus on the giant component (e.g. the shortest path between nodes in a network with two or more component is Inf!).

giant <- decompose(g)[[1]]

Components can be weakly connected (in undirected networks) or __strongly connected (in directed networks, where there is an edge that ends in every single node of that component).

Even within a giant component, there can be different subsets of the network that are more connected to each other than to the rest of the network. The goal of community detection algorithms is to identify these subsets.

There are a few different algorithms, each following a different logic.

The walktrap algorithm finds communities through a series of short random walks. The idea is that these random walks tend to stay within the same community. The length of these random walks is 4 edges by default, but you may want to experiment with different values. The goal of this algorithm is to identify the partition that maximizes a modularity score.

cluster_walktrap(giant)

## IGRAPH clustering walktrap, groups: 6, mod: 0.16
## + groups:
##   $`1`
##   [1] "CAMIE"       "BIGGS"       "DODONNA"     "GOLD LEADER" "WEDGE"      
##   [6] "RED LEADER"  "RED TEN"    
##   
##   $`2`
##   [1] "DARTH VADER" "MOTTI"       "TARKIN"     
##   
##   $`3`
##   [1] "R2-D2"     "CHEWBACCA" "C-3PO"     "LUKE"      "LEIA"     
##   [6] "OBI-WAN"   "HAN"      
##   + ... omitted several groups/vertices

cluster_walktrap(giant, steps=10)

## IGRAPH clustering walktrap, groups: 3, mod: 0.15
## + groups:
##   $`1`
##   [1] "DARTH VADER" "MOTTI"       "TARKIN"     
##   
##   $`2`
##    [1] "R2-D2"     "CHEWBACCA" "C-3PO"     "LUKE"      "LEIA"     
##    [6] "BERU"      "OWEN"      "OBI-WAN"   "HAN"       "GREEDO"   
##   [11] "JABBA"    
##   
##   $`3`
##   [1] "CAMIE"       "BIGGS"       "DODONNA"     "GOLD LEADER" "WEDGE"      
##   + ... omitted several groups/vertices

Other methods are:

The fast and greedy method tries to directly optimize this modularity score.
The infomap method attempts to map the flow of information in a network, and the different clusters in which information may get remain for longer periods. Similar to walktrap, but not necessarily maximizing modularity, but rather the so-called “map equation”.
The edge-betweenness method iteratively removes edges with high betweenness, with the idea that they are likely to connect different parts of the network. Here betweenness (gatekeeping potential) applies to edges, but the intuition is the same.
The label propagation method labels each node with unique labels, and then updates these labels by choosing the label assigned to the majority of their neighbors, and repeat this iteratively until each node has the most common labels among its neighbors.

cluster_fast_greedy(giant)

## IGRAPH clustering fast greedy, groups: 4, mod: 0.17
## + groups:
##   $`1`
##   [1] "CHEWBACCA" "LUKE"      "LEIA"      "OBI-WAN"   "HAN"      
##   [6] "GREEDO"    "JABBA"    
##   
##   $`2`
##   [1] "R2-D2" "C-3PO" "BERU"  "OWEN" 
##   
##   $`3`
##   [1] "CAMIE"       "BIGGS"       "DODONNA"     "GOLD LEADER" "WEDGE"      
##   [6] "RED LEADER"  "RED TEN"    
##   + ... omitted several groups/vertices

cluster_edge_betweenness(giant)

## IGRAPH clustering edge betweenness, groups: 2, mod: 0.033
## + groups:
##   $`1`
##    [1] "R2-D2"       "CHEWBACCA"   "DARTH VADER" "LEIA"       
##    [5] "OBI-WAN"     "MOTTI"       "TARKIN"      "HAN"        
##    [9] "GREEDO"      "JABBA"      
##   
##   $`2`
##    [1] "C-3PO"       "LUKE"        "CAMIE"       "BIGGS"      
##    [5] "BERU"        "OWEN"        "DODONNA"     "GOLD LEADER"
##    [9] "WEDGE"       "RED LEADER"  "RED TEN"    
##

cluster_infomap(giant)

## IGRAPH clustering infomap, groups: 2, mod: 0.064
## + groups:
##   $`1`
##    [1] "R2-D2"       "CHEWBACCA"   "C-3PO"       "LUKE"       
##    [5] "CAMIE"       "BIGGS"       "LEIA"        "BERU"       
##    [9] "OWEN"        "OBI-WAN"     "HAN"         "GREEDO"     
##   [13] "JABBA"       "DODONNA"     "GOLD LEADER" "WEDGE"      
##   [17] "RED LEADER"  "RED TEN"    
##   
##   $`2`
##   [1] "DARTH VADER" "MOTTI"       "TARKIN"     
##

cluster_label_prop(giant)

## IGRAPH clustering label propagation, groups: 2, mod: 0.064
## + groups:
##   $`1`
##    [1] "R2-D2"       "CHEWBACCA"   "C-3PO"       "LUKE"       
##    [5] "CAMIE"       "BIGGS"       "LEIA"        "BERU"       
##    [9] "OWEN"        "OBI-WAN"     "HAN"         "GREEDO"     
##   [13] "JABBA"       "DODONNA"     "GOLD LEADER" "WEDGE"      
##   [17] "RED LEADER"  "RED TEN"    
##   
##   $`2`
##   [1] "DARTH VADER" "MOTTI"       "TARKIN"     
##

My experience is that infomap tends to work better in most social science examples (websites, social media, classrooms, etc), but fastgreedy is faster.

igraph also makes it very easy to plot the resulting communities:

comm <- cluster_infomap(giant)
modularity(comm) # modularity score

## [1] 0.06420569

par(mar=c(0,0,0,0)); plot(comm, giant)

Alternatively, we can also add the membership to different communities as a color parameter in the igraph object.

V(giant)$color <- membership(comm)
par(mar=c(0,0,0,0)); plot(giant)

The final way in which we can think about network communities is in terms of hierarchy or structure. We’ll discuss one of these methods.

K-core decomposition allows us to identify the core and the periphery of the network. A k-core is a maximal subnet of a network such that all nodes have at least degree K.

coreness(g)

##       R2-D2   CHEWBACCA       C-3PO        LUKE DARTH VADER       CAMIE 
##           6           6           6           6           3           2 
##       BIGGS        LEIA        BERU        OWEN     OBI-WAN       MOTTI 
##           5           6           3           3           6           3 
##      TARKIN         HAN      GREEDO       JABBA     DODONNA GOLD LEADER 
##           3           6           1           1           5           5 
##       WEDGE  RED LEADER     RED TEN   GOLD FIVE 
##           5           5           2           0

which(coreness(g)==6) # what is the core of the network?

##     R2-D2 CHEWBACCA     C-3PO      LUKE      LEIA   OBI-WAN       HAN 
##         1         2         3         4         8        11        14

which(coreness(g)==1) # what is the periphery of the network?

## GREEDO  JABBA 
##     15     16

# Visualizing network structure
V(g)$coreness <- coreness(g)
par(mfrow=c(2, 3), mar=c(0.1,0.1,1,0.1))
set.seed(777); fr <- layout_with_fr(g)
for (k in 1:6){
  V(g)$color <- ifelse(V(g)$coreness>=k, "orange", "grey")
  plot(g, main=paste0(k, '-core shell'), layout=fr)
}

If you want to learn more about this technique, we recently published a paper in PLOS ONE where we use it to study large-scale Twitter networks in the context of protest events.

Social network analysis with R: Descriptive analysis

Pablo Barbera

June 27, 2017

Measuring node importance

Node properties

Network properties

Network communities