Importing network data into R

In this training session we will be using a small network that indicates interactions in the movie Star Wars Episode IV. Here, each node is a character and each edge indicates whether they appeared together in a scene of the movie. Edges here are thus undirected and they also have weights attached, since they can appear in multiple scenes together.

The first step is to read the list of edges and nodes in this network:

edges <- read.csv("../data/star-wars-network-edges.csv")
head(edges)
##      source target weight
## 1     C-3PO  R2-D2     17
## 2      LUKE  R2-D2     13
## 3   OBI-WAN  R2-D2      6
## 4      LEIA  R2-D2      5
## 5       HAN  R2-D2      5
## 6 CHEWBACCA  R2-D2      3
nodes <- read.csv("../data/star-wars-network-nodes.csv")
head(nodes)
##          name id
## 1       R2-D2  0
## 2   CHEWBACCA  1
## 3       C-3PO  2
## 4        LUKE  3
## 5 DARTH VADER  4
## 6       CAMIE  5

For example, we learn that C-3PO and R2-D2 appeared in 17 scenes together.

How do we convert these two datasets into a network object in R? There are multiple packages to work with networks, but the most popular is igraph because it’s very flexible and easy to do, and in my experience it’s much faster and scales well to very large networks. Other packages that you may want to explore are sna and networks.

Now, how do we create the igraph object? We can use the graph_from_data_frame function, which takes two arguments: d, the data frame with the edge list in the first two columns; and vertices, a data frame with node data with the node label in the first column. (Note that igraph calls the nodes vertices, but it’s exactly the same thing.)

library(igraph)
g <- graph_from_data_frame(d=edges, vertices=nodes, directed=FALSE)
g
## IGRAPH UNW- 22 60 -- 
## + attr: name (v/c), id (v/n), weight (e/n)
## + edges (vertex names):
##  [1] R2-D2      --C-3PO       R2-D2      --LUKE       
##  [3] R2-D2      --OBI-WAN     R2-D2      --LEIA       
##  [5] R2-D2      --HAN         R2-D2      --CHEWBACCA  
##  [7] R2-D2      --DODONNA     CHEWBACCA  --OBI-WAN    
##  [9] CHEWBACCA  --C-3PO       CHEWBACCA  --LUKE       
## [11] CHEWBACCA  --HAN         CHEWBACCA  --LEIA       
## [13] CHEWBACCA  --DARTH VADER CHEWBACCA  --DODONNA    
## [15] LUKE       --CAMIE       CAMIE      --BIGGS      
## + ... omitted several edges

What does it mean? - U means undirected
- N means named graph
- W means weighted graph
- 22 is the number of nodes
- 60 is the number of edges
- name (v/c) means name is a node attribute and it’s a character
- weight (e/n) means weight is an edge attribute and it’s numeric

This is how you access specific elements within the igraph object:

V(g) # nodes
## + 22/22 vertices, named:
##  [1] R2-D2       CHEWBACCA   C-3PO       LUKE        DARTH VADER
##  [6] CAMIE       BIGGS       LEIA        BERU        OWEN       
## [11] OBI-WAN     MOTTI       TARKIN      HAN         GREEDO     
## [16] JABBA       DODONNA     GOLD LEADER WEDGE       RED LEADER 
## [21] RED TEN     GOLD FIVE
V(g)$name # names of each node
##  [1] "R2-D2"       "CHEWBACCA"   "C-3PO"       "LUKE"        "DARTH VADER"
##  [6] "CAMIE"       "BIGGS"       "LEIA"        "BERU"        "OWEN"       
## [11] "OBI-WAN"     "MOTTI"       "TARKIN"      "HAN"         "GREEDO"     
## [16] "JABBA"       "DODONNA"     "GOLD LEADER" "WEDGE"       "RED LEADER" 
## [21] "RED TEN"     "GOLD FIVE"
vertex_attr(g) # all attributes of the nodes
## $name
##  [1] "R2-D2"       "CHEWBACCA"   "C-3PO"       "LUKE"        "DARTH VADER"
##  [6] "CAMIE"       "BIGGS"       "LEIA"        "BERU"        "OWEN"       
## [11] "OBI-WAN"     "MOTTI"       "TARKIN"      "HAN"         "GREEDO"     
## [16] "JABBA"       "DODONNA"     "GOLD LEADER" "WEDGE"       "RED LEADER" 
## [21] "RED TEN"     "GOLD FIVE"  
## 
## $id
##  [1]  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21
E(g) # edges
## + 60/60 edges (vertex names):
##  [1] R2-D2      --C-3PO       R2-D2      --LUKE       
##  [3] R2-D2      --OBI-WAN     R2-D2      --LEIA       
##  [5] R2-D2      --HAN         R2-D2      --CHEWBACCA  
##  [7] R2-D2      --DODONNA     CHEWBACCA  --OBI-WAN    
##  [9] CHEWBACCA  --C-3PO       CHEWBACCA  --LUKE       
## [11] CHEWBACCA  --HAN         CHEWBACCA  --LEIA       
## [13] CHEWBACCA  --DARTH VADER CHEWBACCA  --DODONNA    
## [15] LUKE       --CAMIE       CAMIE      --BIGGS      
## [17] LUKE       --BIGGS       DARTH VADER--LEIA       
## [19] LUKE       --BERU        BERU       --OWEN       
## + ... omitted several edges
E(g)$weight # weights for each edge
##  [1] 17 13  6  5  5  3  1  7  5 16 19 11  1  1  2  2  4  1  3  3  2  3 18
## [24]  2  6 17  1 19  6  1  2  1  7  9 26  1  1  6  1  1 13  1  1  1  1  1
## [47]  1  2  1  1  3  3  1  1  3  1  2  1  1  1
edge_attr(g) # all attributes of the edges
## $weight
##  [1] 17 13  6  5  5  3  1  7  5 16 19 11  1  1  2  2  4  1  3  3  2  3 18
## [24]  2  6 17  1 19  6  1  2  1  7  9 26  1  1  6  1  1 13  1  1  1  1  1
## [47]  1  2  1  1  3  3  1  1  3  1  2  1  1  1
g[] # adjacency matrix
## 22 x 22 sparse Matrix of class "dgCMatrix"
##    [[ suppressing 22 column names 'R2-D2', 'CHEWBACCA', 'C-3PO' ... ]]
##                                                               
## R2-D2        .  3 17 13 . . .  5 . .  6 . .  5 . . 1 . . . . .
## CHEWBACCA    3  .  5 16 1 . . 11 . .  7 . . 19 . . 1 . . . . .
## C-3PO       17  5  . 18 . . 1  6 2 2  6 . .  6 . . . . . 1 . .
## LUKE        13 16 18  . . 2 4 17 3 3 19 . . 26 . . 1 1 2 3 1 .
## DARTH VADER  .  1  .  . . . .  1 . .  1 1 7  . . . . . . . . .
## CAMIE        .  .  .  2 . . 2  . . .  . . .  . . . . . . . . .
## BIGGS        .  .  1  4 . 2 .  1 . .  . . .  . . . . 1 2 3 . .
## LEIA         5 11  6 17 1 . 1  . 1 .  1 1 1 13 . . . . . 1 . .
## BERU         .  .  2  3 . . .  1 . 3  . . .  . . . . . . . . .
## OWEN         .  .  2  3 . . .  . 3 .  . . .  . . . . . . . . .
## OBI-WAN      6  7  6 19 1 . .  1 . .  . . .  9 . . . . . . . .
## MOTTI        .  .  .  . 1 . .  1 . .  . . 2  . . . . . . . . .
## TARKIN       .  .  .  . 7 . .  1 . .  . 2 .  . . . . . . . . .
## HAN          5 19  6 26 . . . 13 . .  9 . .  . 1 1 . . . . . .
## GREEDO       .  .  .  . . . .  . . .  . . .  1 . . . . . . . .
## JABBA        .  .  .  . . . .  . . .  . . .  1 . . . . . . . .
## DODONNA      1  1  .  1 . . .  . . .  . . .  . . . . 1 1 . . .
## GOLD LEADER  .  .  .  1 . . 1  . . .  . . .  . . . 1 . 1 1 . .
## WEDGE        .  .  .  2 . . 2  . . .  . . .  . . . 1 1 . 3 . .
## RED LEADER   .  .  1  3 . . 3  1 . .  . . .  . . . . 1 3 . 1 .
## RED TEN      .  .  .  1 . . .  . . .  . . .  . . . . . . 1 . .
## GOLD FIVE    .  .  .  . . . .  . . .  . . .  . . . . . . . . .
g[1,] # first row of adjacency matrix
##       R2-D2   CHEWBACCA       C-3PO        LUKE DARTH VADER       CAMIE 
##           0           3          17          13           0           0 
##       BIGGS        LEIA        BERU        OWEN     OBI-WAN       MOTTI 
##           0           5           0           0           6           0 
##      TARKIN         HAN      GREEDO       JABBA     DODONNA GOLD LEADER 
##           0           5           0           0           1           0 
##       WEDGE  RED LEADER     RED TEN   GOLD FIVE 
##           0           0           0           0

Network visualization

How can we visualize this network? The plot() function works out of the box, but the default options are often not ideal:

par(mar=c(0,0,0,0))
plot(g)

Let’s see how we can improve this figure. To see all the available plotting options, you can check ?igraph.plotting. Let’s start by fixing some of these.

par(mar=c(0,0,0,0))
plot(g,
     vertex.color = "grey", # change color of nodes
     vertex.label.color = "black", # change color of labels
     vertex.label.cex = .75, # change size of labels to 75% of original size
     edge.curved=.25, # add a 25% curve to the edges
     edge.color="grey20") # change edge color to grey