The goal was to become more familiar with SWH infrastructure, code review and the community. I joined the IRC channel #swh-devel and the mailing list and started working on getting the SWH development setup running. I focused on the "Easy hacks" tasks (D1226, D1235) and documentation fixes (D1217, D1242) at first, before studying more in depth the subject I chose: Graph compression on the development history of software. This allowed me to experiment for a bit with the SWH codebase, run tests environment, solve some issues, and talk with SWH staffers!
I contacted the mentor for the subject early, and sent my resume. He quickly got back to me, and gave me papers to read on graph compression techniques:
They already had some early results, thanks to Boldi & Vigna (the authors of the WebGraph framework/paper) and I tried replicating it on my own laptop. Unfortunately, I only have 8GB of RAM so running Java programs to compress big files quickly gets OOM killed.
My mentor told me he can maybe give me access to a VM so I could run my experiments, in the meantime I automated the setup process and testing using Docker and a bash script to retrieve datasets and compress them using WebGraph.
FROM maven:3.6.0-jdk-8 WORKDIR /app # Download webgraph binary RUN curl -O http://webgraph.di.unimi.it/webgraph-3.6.1-bin.tar.gz RUN tar xvfz webgraph-3.6.1-bin.tar.gz RUN cp webgraph-3.6.1/webgraph-3.6.1.jar . # Download webgraph dependencies RUN curl -O http://webgraph.di.unimi.it/webgraph-deps.tar.gz RUN tar xvfz webgraph-deps.tar.gz # Download missing LAW dependency RUN curl -O http://law.di.unimi.it/software/download/law-2.5-bin.tar.gz RUN tar xvfz law-2.5-bin.tar.gz RUN cp law-2.5/law-2.5.jar . WORKDIR /graph COPY compress_graph .
#!/bin/bash DATASET=release_to_obj # Download the edge list curl -O https://annex.softwareheritage.org/public/dataset/swh-graph-2019-01-28/edges/$DATASET.csv.gz # Uncompress the edge list gunzip $DATASET.csv.gz # Compute the node list tr ' ' '\n' <$DATASET.csv | sort -u >$DATASET.nodes # Build a function (MPH) that maps node names to node numbers in lexicographic order (output: $DATASET.mph) java -cp /app/'*' it.unimi.dsi.sux4j.mph.GOVMinimalPerfectHashFunction $DATASET.mph /data/$DATASET.nodes # Build the graph in BVGraph format (output: $DATASET.*) java -cp /app/'*' it.unimi.dsi.webgraph.ScatteredArcsASCIIGraph -f $DATASET.mph $DATASET </data/$DATASET.csv # Create a symmetrized version of the graph (output: $DATASET-s.*) java -cp /app/'*' it.unimi.dsi.webgraph.Transform symmetrizeOffline $DATASET $DATASET-s # Find a better permutation through Layered LPA (output: $DATASET.llpa) java -cp /app/'*' it.unimi.dsi.law.graph.LayeredLabelPropagation $DATASET-s $DATASET.llpa # Permute the graph accordingly (output: $DATASET-compr.*) java -cp /app/'*' it.unimi.dsi.webgraph.Transform mapOffline $DATASET $DATASET-compr $DATASET.llpa # Compute graph statistics (output: $DATASET.stats) java -cp /app/'*' it.unimi.dsi.webgraph.Stats -s $DATASET # Clean up the current directory mkdir $DATASET mv $DATASET.* $DATASET mv $DATASET-s.* $DATASET mv $DATASET-compr.* $DATASET
I also took the time to dive more into research papers and SWH infrastructure:
A few days before the official application period, I begin writing mine to get early feedbacks once the deadline hits. Here is the final pdf: proposal.pdf.
My mentor gave me access to a first VM with 140GB of RAM and 72vCPUs so I could do some early experiments. But I quickly got limited by the disk space (first time I have more RAM than disk space :D), so they gave me access to another VM. A beefy one: 2TB RAM and 128vCPUs with 1TB of disk storage (for 2 weeks)!
In the same time, they heard about a new reference paper on the graph compression problem: Compressing Graphs and Indexes with Recursive Graph Bisection. I went through it quickly but from their paper: "we do not release the datasets and our source code due to corporate restrictions". Couldn't find an open source implementation, I contacted the author to have insights on such implementation but didn't get any response.
Back to the WebGraph framework, I tried using the -big version (64bit) since some parts of the DAG contain billions of nodes/edges, but I ran into a runtime exception. To get some results I used the 32bit version to see if I still had this exception. The process ran fine but it was really slow, and as I contacted the authors of the WebGraph framework they pointed out a useless redirection that was intensely slowing down the process (because of swapping).
It's time to gather some data now!