Network visualization (in R) with “netplot” and motif counting (in C++) with “barry”

SCI Seminar

George G. Vega Yon, Ph.D.

Division of Epidemiology

University of Utah

2023-04-07

Whoami

  • Research Assistant Professor of Epidemiology.
  • Ph.D. in Biostatistics from USC and M.Sc. in Economics from Caltech.
  • Methodologist working at the intersection between Statistical Computing and Complex Systems Modeling

    (recently: social network analysis, evolution of gene functions, ABM, and mechanistic machine learning)

  • About 15 years of working on networks, software dev, and data science.

You can download the slides from
ggv.cl/slides/sci2023

Network visualization with netplot

netplot In a nutshell

  • What: An R package for network visualization inspired by Gephi.

  • Why: Opinionated way to visualize graphs.1

  • Where: You can get the dev version on GitHub (USCCANA/netplot) or the stable version on CRAN.

netplot In a nutshell (cont.)

Current parameters in nplot (main function)

nplot(
  edgelist,
  layout,
  vertex.size             = 1,
  vertex.nsides           = 10,
  vertex.color            = grDevices::hcl.colors(1),
  vertex.size.range       = c(0.01, 0.03),
  vertex.frame.color      = ...,
  vertex.rot              = 0,
  vertex.frame.prop       = 0.2,
  vertex.label            = NULL,
  vertex.label.fontsize   = NULL,
  vertex.label.color      = "black",
  vertex.label.fontfamily = "HersheySans",
  vertex.label.fontface   = "bold",
  vertex.label.show       = 0.3,
  vertex.label.range      = c(5, 15)
)

netplot In a nutshell (cont.)

Current parameters in nplot (main function)

nplot(
  edgelist,
  layout,
  vertex.size             = 1,
  vertex.nsides           = 10,
  vertex.color            = grDevices::hcl.colors(1),
  vertex.size.range       = c(0.01, 0.03),
  vertex.frame.color      = ...,
  vertex.rot              = 0,
  vertex.frame.prop       = 0.2,
  vertex.label            = NULL,
  vertex.label.fontsize   = NULL,
  vertex.label.color      = "black",
  vertex.label.fontfamily = "HersheySans",
  vertex.label.fontface   = "bold",
  vertex.label.show       = 0.3,
  vertex.label.range      = c(5, 15),
  edge.width              = 1,
  edge.width.range        = c(1, 2),
  edge.arrow.size         = NULL,
  edge.color              = ~ego(alpha = 0.25, col = "gray") + alter,
  edge.curvature          = pi/3,
  edge.line.lty           = "solid",
  edge.line.breaks        = 5,
  sample.edges            = 1
)

netplot In a nutshell (cont.)

Current parameters in nplot (main function)

nplot(
  edgelist,
  layout,
  vertex.size             = 1,
  vertex.nsides           = 10,
  vertex.color            = grDevices::hcl.colors(1),
  vertex.size.range       = c(0.01, 0.03),
  vertex.frame.color      = ...,
  vertex.rot              = 0,
  vertex.frame.prop       = 0.2,
  vertex.label            = NULL,
  vertex.label.fontsize   = NULL,
  vertex.label.color      = "black",
  vertex.label.fontfamily = "HersheySans",
  vertex.label.fontface   = "bold",
  vertex.label.show       = 0.3,
  vertex.label.range      = c(5, 15),
  edge.width              = 1,
  edge.width.range        = c(1, 2),
  edge.arrow.size         = NULL,
  edge.color              = ~ego(alpha = 0.25, col = "gray") + alter,
  edge.curvature          = pi/3,
  edge.line.lty           = "solid",
  edge.line.breaks        = 5,
  sample.edges            = 1,
  bg.col                  = "transparent",
  skip.vertex             = FALSE,
  skip.edges              = FALSE,
  skip.arrows             = skip.edges,
  add                     = FALSE,
  zero.margins            = TRUE,
  ...
)

Main features

  • Visualization engine: The grid system1 based on “graphical objects” (grobs).
  • Highly customizable and barebones: “If it is available in grid, then it is in netplot”.
  • Default behavior maximizes plotting area.
  • Vertices and edges are scaled wrt to the plotting area.
  • Edge-color mixer can improve visuals.

Main features (cont.)

Example code+output

The personal friendship network of a faculty of a UK university, consisting of 81 vertices (individuals) and 817 directed and weighted connections. The school affiliation of each individual is stored as a vertex attribute. This dataset can serve as a testbed for community detection algorithms. (source: R package igraphdata)

How?

We start by loading data + colors…

library(netplot)
library(igraph)
data("UKfaculty", package="igraphdata")

# Vertex colors f(group)
vcols <- V(UKfaculty)$Group 
vcols <- palette.colors(
  n       = length(unique(vcols)),
  palette = "Alphabet"
)[vcols]

How?

We start by loading data + colors… and then plot!

library(netplot)
library(igraph)
data("UKfaculty", package="igraphdata")

# Vertex colors f(group)
vcols <- V(UKfaculty)$Group 
vcols <- palette.colors(
  n       = length(unique(vcols)),
  palette = "Alphabet"
)[vcols]

# Netplot call
set.seed(323)
nplot(
  UKfaculty,
  vertex.color = vcols
  )

Not a simple drawing

  • Graphical objects (Grobs) are the building blocks in grid

  • Benefit: Plots are not mere figures, are modifiable objects.

  • netplot objects are a collection of grobs:

List of 11
 $ .xlim        : num [1:2] -1 1
 $ .ylim        : num [1:2] -0.5 0.5
 $ .layout      : num [1:81, 1:2] 0.6661 0.0201 0.7327 0.5399 -0.4903 ...
 $ .edgelist    : num [1:817, 1:2] 57 76 12 43 28 58 7 40 5 48 ...
 $ .N           : int 81
 $ .M           : int 817
 $ name         : chr "graph.2"
 $ gp           : NULL
 $ vp           : NULL
 $ children     :List of 2
  ..$ background:List of 10
  .. ..- attr(*, "class")= chr [1:3] "rect" "grob" "gDesc"
  ..$ graph     :List of 5
  .. ..- attr(*, "class")= chr [1:3] "gTree" "grob" "gDesc"
  ..- attr(*, "class")= chr "gList"
 $ childrenOrder: chr [1:2] "background" "graph"
 - attr(*, "class")= chr [1:4] "netplot" "gTree" "grob" "gDesc"

Example 2: Playing with patterns

netplot supports advanced patterns. The figures feature radial gradients (vertices), lineal gradients, and repeated patterns (background).

Alternatives

Comparison of default igraph, sna, ggraph, and netplot default call. nplot fills completely the plotting area, and adjusts vertex size, edge width, and edge arrows’ size accordingly to the plotting area and plotting device.

Alternatives (patterns)

In the case of ggplot2 (and thus, ggraph)

  • Patterns in R’s grid system are not directly available.1

While ggplot2 uses grid underneath it’s grammar API, these features are generally not directly available in ggplot2.
– Thomas Lin Pedersen, author of ggraph (source: tidyverse.org)

  • But the gggrid package does:

The ‘ggplot2’ package does not yet have an interface for pattern fills, but the ‘gggrid’ package (Murrell, 2022) allows us to combine raw ‘grid’ output with the ‘ggplot2’ plot.
– Paul Murrel, author of grid (source: Vectorised Pattern Fills in R Graphics)

Counting motifs with barry

barry in a nutshell

Your go-to motif accountant1

  • What: A C++ header-only template library for motif counting (and more.)

  • Why: Implement Discrete Exponential Family Models DEFMs for phylogenetics and social networks analysis.

  • Where: You can get it on GitHub (USCBiostats/barry)

Main features

About 11 K lines of C++ code built for statistical modeling:

  • Header-only template library (i.e., single file, multiple types.)
  • Building-blocks are the arrays (dense or sparse) and change statistics.
  • Motif count using change statistics (we will return to that.)
  • Full and constrained enumeration of 0/1 arrays.
  • Probability function for Discrete Exponential-Family Models [DEFMs].
  • Memory and computationally efficient for pooled models.

DEFMs

  • Basic idea: Entries in an array are not IID
  • Highly-developed through ERGMs.
  • Not about individual entries, but about patterns.
  • Help reducing computational complexity and improving interpretation (# of parameters)

DEFMs as ERGMs

A key component for fitting these models are change statistics.

Features: Change statistics

Let’s look into the change statistics edgecount, triangles, and gender-homophily when we remove tie 33-69.

Features: Change statistics (cont.)

s() y- y+ change
Edgecount 816 817 1
Triangles 5366 5399 33
Group-homophily 664 665 1

Features: Memory and Comp. Efficiency

  • A fundamental feature of pooled models (multiple graphs/arrays).
  • A single model may feature thousands of networks.
  • But if all have the same number of nodes (and other features)… we only need to enumerate once.

Building DEFMs

Discrete Exponential-Family Models

  • Step 1: add counters + constraints1
flowchart LR
  start[Create the<br>Model] --> count[Add<br>Counters]
  count --> count_done{Done?}
  count_done --> |Yes| const[Add\nconstraints]
  count_done --> |No| count
  const --> const_done{Done?}
  const_done --> |Yes| init((End))
  const_done --> |No| const
  • Step 2: add the arrays
flowchart LR
  start[Hash the\narray] --> exists{Present?}
  exists --> |Yes| ptr[[Link\nPointer]]
  exists --> |No| addarray[[Compute\nsupport]]
  addarray --> save[[Save\nsupport]]
  ptr --> End((End))
  save --> End

Computing support

  • Computationally intensive.
  • Biggest model calculated featured a matrix with 30 entries, i.e. \(2^{30} = 1,073,741,824\) (just over a billion) and took about 5 minutes.
  • For a single array, this is the procedure:
flowchart LR
  start[[Static<br>constraints]] --> emptymat[Initialize<br>array]
  emptymat --> edges[Set edge<br>sequence]
  edges --> startc((Start<br>counting))
flowchart LR
  nextes((Start)) --> anyleft{Any edge<br>left}
  anyleft --> |Yes| adde[Add edge]
  anyleft --> |No| ende((End))
  adde --> change[[Run<br>counters]]
  change --> indyn{Within dyn<br>constraint?}
  indyn --> |Yes| save[Include] -->anyleft
  indyn --> |No| anyleft

Current implemented models

In principle, any dataset we can represent as 0/1 arrays can be modeled with barry. We have applications in social networks, phylogenetics, and health outcomes.

Current implemented models: DEFMs

  • DEFMs for multiple correlated outcomes (Markov Random Fields; on development with Drs. MJ Pugh and Tom Valente.)

Current implemented models: Social networks

Current implemented models: Phylogenetics

  • Modeling the evolution of gene functions in terms of transition between functional states (research grant submitted to National Human Genome Research Institute NHGRI).

Other projects

fmcmc | ergmito | aphylo | netdiffuseR | ABCoptim
slurmR | barry | rgexf | geese

Other projects (cont.)

Using rgexf (my most popular R package w/ 600K downloads)

Networks co-session network at INSNA’s Sunbelt 2022: Nodes are colored according to their roles: speaker, session chair, and session organizer.–(source: gvegayon/gallery)

Thanks!

gvegayon
@gvegayon
ggvy.cl

Presentation created with quarto::revealjs

A1. Change statistics formals

  • The change statistic is defined as a real-valued vector where the \(k\)-th entry equals the observed change when the \(ij\)-th tie is removed from the network. Formally:

    \[ \delta(y_{ij}: 0\to 1) = s(\mathbf{y})_{ij}^+ - s(\mathbf{y})_{ij}^- \]

    Where \(s(\cdot)\) is a function returning graph \(\mathbf{y}\)’s observed statistics, and \(s(\mathbf{y})_{ij}^+\) represents its value when \(y_{ij} = 1\).

A1. Change statistics formals (cont.)

Furthermore, conditioning on the rest of the network (or array,) a tie transition probability equates a logit:

\[\begin{equation} \mbox{logit}\left({\mathbb{P}\left(y_{ij} = 1|y_{-ij}\right) }\right) = {\theta}^\mathbf{t}\Delta\delta\left(y_{ij}:0\to 1\right), \end{equation}\]

\[\begin{equation} {\mathbb{P}\left(y_{ij} = 1|y_{-ij}\right) } = \frac{1}{1 + \mbox{exp}\left\{-{\theta}^\mathbf{t}\Delta\delta\left(y_{ij}:0\to 1\right)\right\}} \end{equation}\]