The WebGraph Framework I: Compression Techniques

Paolo Boldi, Sebastiano Vigna (2004):

WebGraph Framework:

Features of the links of a web graph:

The compression format

The final WebGraph compression format uses a mix of methods above.

Random access

Reference chains: to access random contents of the graph, we need to decompress all the information from the compression format and follow reference links. This might slow down the access process (worst-case: decompress all the info) so we need to put a limit on the lengths of the reference chains.

We also need to store an offset array of size $N$ if we want to access the graph in random order instead of sequentially.

Implementation details

Lazy iterator: the data is filtered as you request it (it doesn't go through all the list immediately).

Layered Label Propagation

Paolo Boldi, Marco Rosa, Massimo Santini (2011):

Most compression algorithms exploit properties such as similarity, locality, etc. hence the way nodes are ordered influence a lot the result.

Problem definition

Goal: find a node numbering $\pi$ minimising $P_A(G, \pi)$.

Such optimization problem is NP-Hard, meaning we can only expect heuristics.

Empirical insight: it is important for high compression ratio to use an ordering of vertex IDs such that the vertices from the same host are close to one another.

Note: coordinate-free algorithm means it achieves almost the same compression performance starting from any initial ordering.

Label Propagation Algorithms (LPA)

The algorithm executes in rounds, every node has a label corresponding to the cluster it belongs to (at the beginning every node has a different label).

At each round, each node updates its label according to some rule. The algorithm terminates as soon as no more updates.

Layered Label Propagation (LLP)

LLP is built on APM (Absolute Potts Model) which is built on LPA => LLP iteratively executes APM with various $\gamma$ values.

Implementation detail: the node order during rounds is random, hence there is no obstacle in updating several nodes in parallel.

Note: most of the time is spent on sampling values of $\gamma$ to produce base clusterings => you can reduce the numbers of $\gamma$ values to speed up the process.


Compressing Graphs and Indexes with Recursive Graph Bisection

Laxman Dhulipala, Igor Kabiljo, Brian Karrer, Giuseppe Ottaviano, Sergey Pupyrev, Alon Shalita (2016):

Previous work on compression techniques:

Inverted index (also known as postings file): database index storing a mapping from content (eg: word, content) to its location (eg: table, document). Allows fast full-text searches (use case: search engine indexing algorithm).

Problem definition

Graph reordering is a combinational optimization problem with a goal to find a layout (= order = numbering = arrangement) of the input graph to optimize a cost function: \(\pi\colon\mathbb V\to\mathbb {1 \ldots n}\)

Empiric insight: it is desirable that similar vertices are close in $\pi$.

Minimum linear arrangement (MLA): minimize \(\displaystyle\sum_{(u,v)\in E} |\pi(u) - \pi(v)|\)

Minimum logarithmic arrangement (MLogA): minimize \(\displaystyle\sum_{(u,v)\in E} \log |\pi(u) - \pi(v)|\)

In practice, adjacency list is stored using an encoding, so the gaps induced by consecutive neighbors are important for compression hence the MLogGapA problem.

Minimum logarithmic gap arrangement (MLogGapA): minimize \(\displaystyle\sum_{u\in V} f_{\pi}(u, out(u))\) where \(f_{\pi}(u, out(u)) = \displaystyle\sum_{i=1}^{k} \log | \pi(u_{i+1}) - \pi(u_{i}) |\)

MLA and MLogA were known to be NP-Hard, this paper proves that MLogGapA is also NP-Hard.

Model for graph/index compression

Bipartite minimum logarithmic arrangement (BiMLogA): generalize MLogA and MLogGapA, minimize \(\displaystyle\sum_{q \in Q}\sum_{i=1}^{k} \log (\pi(u_{i+1}) - \pi(u_{i}))\) where $k$ is the degree, $Q$ query vertices, $D$ data vertices and the graph \(G = (Q \cup D, E)\).

Note: BiMLogA and MLogGapA differs since the latter does not distinguish between query and data vertices.

Graph input model
Graph input model


Graph bisection: initially split $D$ into two sets of same size $V_1$ and $V_2$, and compute a cost of how "compression-friendly" the partition is. Next, swap vertices in $V_1$ and $V_2$ to improve the cost (=> compute the "move gain").


Main algorithm steps:

  1. Find graph bisection $G_1$ and $G_2$ of $G$
  2. Recursively find linear arrangements for $G_1$ and $G_2$
  3. Concatenate the resulting order

Time complexity: \(O(m\log n + n\log^2 n)\)

In the implementation, the algorithm can be parallelized/distributed:


Compressing Inverted Indexes with Recursive Graph Bisection: a reproducibility study

Joel Mackenzie, Antonio Mallia, Matthias Petri, J. Shane Culpepper, and Torsten Suel (2019)

Note: focuses on inverted indexes more than graph


Multiple ordering techniques:

BP ordering runs in \(O(m\log n + n\log^2 n)\) and yield excellent arrangements in practice by approximating the solution to the BiMLogA problem.

implementation/pseudo-code details

optimization details

Reproducibility notes: