Infrastructure

Graph archive:

RDBMS specific:

Archiver

https://docs.softwareheritage.org/devel/swh-archiver/archiver-blueprint.html

Goal: keeps track of how many copies of a given file content exist and where each of them is.

Characteristics

Implementation

Indexer

https://docs.softwareheritage.org/devel/swh-indexer/metadata-workflow.html

Goal: extract information from multiple SWH objects:

The indexer stores these info into a db (expected to be called swh-indexer-dev)

Intrinsic metadata

Examples of supported intrinsic metadata: codemeta.json, pom.xml, package.json, PGK-INFO, .gemspec

These metadata are translated using CodeMeta's crosswalk table

Journal

Goal: logs changes to the archive

Publish-subscribe support using Apache Kafka.

Listers

https://docs.softwareheritage.org/devel/swh-lister/tutorial.html

Goal: enumerate the software origins available at a source code distribution place.

Current supported listers: bitbucket, debian, github, gitlab, npm, pypi

A lister follow these steps:

Loaders

Goal: read and import/update a source code origin in the SWH archive

Current supported loaders: debian, dir, git, mercurial, pypi, svn, tar

Model

Goal: implementation of Merkle DAG to store artifacts. Defines persistent identifier (PID).

Data model

Crawling-related information:

SWH archive is a single Merkle Direct Acyclic Graph. Inherited properties from this:

Persistent identifier (PID)

Every SWH object can be uniquely identified by an intrinsic identifier that is guaranteed to remain stable over time.

Syntax: swh:<scheme_version>:<object_type>:<object_id>

Examples:

object_id: sha1 of object's content and metadata

SWH web app can be used to resolve a PID: https://archive.softwareheritage.org/<identifier>

Objstorage

Goal: API to manipulate SWH object storage (add, restore, get, check, delete, etc.)

Scheduler

Goal: keep track of scheduled tasks (eg: next listing/loading)

Implementation is based on Celery.

Storage

https://docs.softwareheritage.org/devel/swh-storage/sql-storage.html https://docs.softwareheritage.org/devel/swh-storage/archive-copies.html

Goal: abstraction layer over the archive to access artifacts and their metadata

Vault

Goal: allows to retrieve parts of the archive as self-contained bundles (e.g., individual releases, entire repository snapshots, etc.)

Web

Goal: web apps to browse the archive