Graph archive:
- file content node (≈ 200 TB, individually
compressed files)
-
- stored as a key-value object storage that uses as keys the intrinsic
node identifiers of the Merkle DAG
- allows easy horizontal scaling
- rest of the graph (≈ 5 TB, .gz files)
-
- stored in a relational database management system (RDBMS): one table
per type of node (directories, revisions, snapshots, etc.)
- each table uses as primary key the intrinsic node identifier
- no particular reason to store this graph in a RDBMS but graph db
tech are not ready for a graph with this size and kind
RDBMS specific:
- master/slave replication
-
- enables data from one master server to be copied to slaves
servers
- used to improve recovery guarantees and performance
-
- load balancing: all writes updates happen on master server, but read
operations can take place in any slave server
- point-in-time recovery
- prevent hash collisions by using multiple checksums algorithms
(
sha1
, sha256
, sha1_git
,
blake2s256
)
Archiver
https://docs.softwareheritage.org/devel/swh-archiver/archiver-blueprint.html
Goal: keeps track of how many copies of a given file content exist
and where each of them is.
- enforce retention policies
-
- "each file content must exist in at least 3 copies"
- periodically swipe all objects to check adherence, and makes copies
if necessary
- heal object corruption (due to storage media decay)
-
- periodically swipe all copies of objects and verify checksums (and
use copies to heal objects if data corruption is found)
Characteristics
- append-only archival
- peer-to-peer topology: every storage can be used as a source or a
destination
- asynchronous archival: if new objects are added during archival run
(making copies of existing objects), these new objects will only be
copied in future archival run
- integrity at archival time: before adding new objects, verify if
possible to decompress and compute checksums
- persistent archival status:
-
- locations={master, slave_1, ..., slave_n}, each pair of <obj,
dst> has:
-
- status: missing, ongoing, present, corrupted
- mtime: timestamp of last status change
Implementation
- archiver director
-
- periodically runs via cron
- group objects in need of archival into batches and spawn archive
worker for each batch
- archiver worker
-
- executed on demand (celery worker) to archive a set of objects
- compute map: (source, destination) -> contents
- launch copier for each transfer in the batch
- archiver copier
-
- executed on demand by archiver workers to transfer file batches from
a given source to a given destination
- the copying process ensures that concurrent transfer of the same
file by multiple copier do not result in corrupted files (atomic
operation + file granularity)
Indexer
https://docs.softwareheritage.org/devel/swh-indexer/metadata-workflow.html
- Goal: extract information from multiple SWH objects:
-
- content: mimetype, ctags, language, fossology-license, metadata
- revision: metadata
The indexer stores these info into a db (expected to be called
swh-indexer-dev)
Examples of supported intrinsic metadata: codemeta.json, pom.xml,
package.json, PGK-INFO, .gemspec
These metadata are translated using CodeMeta's crosswalk table
Journal
Goal: logs changes to the archive
Publish-subscribe support using Apache Kafka.
Listers
https://docs.softwareheritage.org/devel/swh-lister/tutorial.html
Goal: enumerate the software origins available at a source code
distribution place.
Current supported listers: bitbucket, debian, github, gitlab, npm,
pypi
A lister follow these steps:
- Network request to a service endpoint
- Convert the response to a canonical format
- Setup work queue to fetch/ingest the repositories
Loaders
Goal: read and import/update a source code origin in the SWH
archive
Current supported loaders: debian, dir, git, mercurial, pypi, svn,
tar
Model
Goal: implementation of Merkle DAG to store artifacts. Defines
persistent identifier (PID).
hashutil.py
: defines a MultiHash
class in
charge to compute all swh hashes (supports sha1
,
sha256
, sha1_git
,
blake2s256
)
- intrinsic identifier: the
sha1
, sha1_git
and sha256
checksums of its data.
Data model
Crawling-related information:
- origins: (type, url) where type is git, svn, ...
- snapshots (aka branches, stable/dev packages, ...)
- visits: link origins with snapshots (every time an origin is
consulted, new visit object is created with corresponding snapshot and
timestamp)
SWH archive is a single Merkle Direct Acyclic Graph. Inherited
properties from this:
- data deduplication
- SWH data unified as a whole
Persistent identifier (PID)
Every SWH object can be uniquely identified by an intrinsic
identifier that is guaranteed to remain stable over time.
Syntax:
swh:<scheme_version>:<object_type>:<object_id>
Examples:
swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2
swh:1:dir:d198bc9d7a6bcf6db04f476d29314f157507d505
swh:1:rev:309cf2674ee7a0749978cf8265ab91a60aea0f7d
object_id
: sha1 of object's content and metadata
SWH web app can be used to resolve a PID: https://archive.softwareheritage.org/<identifier>
Objstorage
Goal: API to manipulate SWH object storage (add, restore, get, check,
delete, etc.)
Scheduler
Goal: keep track of scheduled tasks (eg: next listing/loading)
Implementation is based on Celery.
Storage
https://docs.softwareheritage.org/devel/swh-storage/sql-storage.html
https://docs.softwareheritage.org/devel/swh-storage/archive-copies.html
Goal: abstraction layer over the archive to access artifacts and
their metadata
Vault
Goal: allows to retrieve parts of the archive as self-contained
bundles (e.g., individual releases, entire repository snapshots,
etc.)
Web
Goal: web apps to browse the archive