Google Summer of Code 2019

During the Google Summer of Code 2019, I worked with Software Heritage: an ambitious research project whose goal is to collect, preserve, and share the whole publicly accessible Free/Open Source Software (FOSS) source code.

My mentors were: Stefano Zacchiroli and Antoine Pietri.

Subject

The Software Heritage data model is a big Merkle DAG made of nodes like revisions, releases, directories, etc. It is a very big graph, with ~12 B nodes and ~160 B edges, which makes it hard to fit in memory using naive approaches.

Graph compression techniques have been successfully used to compress the Web graph (which is slightly larger than the Software Heritage one) and make it fit in memory. The goal of this GSoC is review existing graph compression techniques and apply the most appropriate one to the Software Heritage case, enabling in-memory processing of its Merkle DAG.

/img/gsoc2019/swh_data_model.svg — Software Heritage Merkle DAG

What was done

Git repo: swh-graph (list of my commits)
Research on graph compression: evaluate/experiment feasibility and compression rate of multiple techniques
Docker environment and scripts to automate the entire compression pipeline
REST API to query the graph
- Java server side: load the compressed graph, run API endpoints traversal algorithms, unit tests, javadoc
- Python client side: integration with SWH infrastructure, integration tests
Automated benchmarking tools with statistical analysis
General documentation on docker environment, compression steps, graph query use-cases, etc.

Highlights:

Compression rates: 4.48 bits/edge (transposed graph) and 4.91 bits/edge (direct graph)
Memory requirements for loading both graphs: 200GB of RAM
Total compression time: 1 week
Node successors lookup times: below 2μs

Timeline

Program announced: November 13, 2018
Organization Application Period: January 15, 2019 - February 6, 2019
Organization Announced: February 26, 2019
Student Application Period: March 25, 2019 - April 9, 2019
Application Review Period: April 9, 2019 - May 6, 2019
Student Projects Announced: May 6, 2019
Community Bonding: May 6 - 27, 2019
Coding: May 27, 2019 - August 19, 2019
Evaluations: June 24 - 28, 2019 and July 22 - 26, 2019
Students Submit Code and Final Evaluations: August 19 - 26, 2019
Mentors Submit Final Evaluations: August 26, 2019 - September 2, 2019
Results Announced: September 3, 2019

Subject

What was done

Links

Timeline