During the Google Summer of Code 2019, I worked with Software Heritage: an
ambitious research project whose goal is to collect, preserve, and share
the whole publicly accessible Free/Open Source Software (FOSS) source
code.
The Software
Heritage data model is a big Merkle DAG made of
nodes like revisions, releases, directories, etc. It is a very big
graph, with ~12 B nodes and ~160 B edges, which makes it hard to fit in
memory using naive approaches.
Graph compression techniques have been successfully used to compress
the Web graph (which is slightly larger than the Software Heritage one)
and make it fit in memory. The goal of this GSoC is review existing
graph compression techniques and apply the most appropriate one to the
Software Heritage case, enabling in-memory processing of its Merkle
DAG.