Life of a repository

This document describes how our backend systems clone and update repositories from a code host.

High level

An admin configures a code host configuration.
repo-updater periodically syncs all repository metadata from configured code hosts.
We poll the code host's API based on the configuration.
We add/update/remove entries in our repo table.
All repositories in our repo table are in a scheduler on repo-updater which ensures they are cloned and updated on gitserver.

Our guiding principle is to ensure all repositories configured by a site administrator are cloned and up to date. However, we need to avoid overloading a code host with API and Git requests.

Services

Repo Updater

repo-updater is a singleton service. It is responsible for:

Communicating with code host APIs to coordinate the state we synchronize from them.
Maintaining the repo table which other services read.
Scheduling clones/fetches on gitserver.
Anything which communicates with a code host API.

Our batch changes and background permissions syncers are also located in repo-updater as they require communication with code host APIs.

Gitserver

gitserver is a scaleable stateful service which clones and updates git repositories and can run git commands against them.

All data maintained on this service is from cloning an upstream repository. We shard the set of repositories across the gitserver replicas, but do not support replication. All communication with gitserver from other services should be done via the gitserver client interface.

It is responsible for the state of the gitserver_repos table. The main process which handles this is the background job that runs on each gitserver instance, see SyncRepoState.

Discovery

Before we can clone a repository, we first must discover that it exists. This is configured by a site administrator setting code host configuration. Typically a code host will have an API as well as git endpoints. A code host configuration typically will specify how to communicate with the API and which repositories to ask the API for. For example:

{
  "url": "https://github.com",
  "token": "deadbeef",
  "repositoryQuery": ["affiliated"],
}

This is a GitHub code host configuration for github.com using the private access token deadbeef. It will ask GitHub for all affiliated repositories. Follow GithubSource.listRepositoryQuery to find the actual API call we do.

Discovering the repositories for each codehost/configuration is abstracted in the Source interface.

// A Source yields repositories to be stored and analysed by Sourcegraph.
// Successive calls to its ListRepos method may yield different results.
type Source interface {
	// ListRepos sends all the repos a source yields over the passed in channel
	// as SourceResults
	ListRepos(context.Context, chan SourceResult)
	// ExternalServices returns the ExternalServices for the Source.
	ExternalServices() ExternalServices
}

Syncing

We keep a list of all repositories on Sourcegraph in the repo table. This is to provide a code host independent list of repositories on Sourcegraph that we can quickly query. repo-updater will periodically sync each code host connection in the background. It compares the list of repos configured with those in our repo table and ensures that they are consistent. The syncer respects limits set in the site config for userRepos.maxPerSite (20000 by default) and userRepos.maxPerUser (2000 by default) and if either of these limits are exceeded, the code host connection will stop syncing until the limits are increased or the excess repositories are removed.

See Syncer.SyncExternalServices for details.

Git Update Scheduler

We can't clone all repositories concurrently due to resource constraints in Sourcegraph and on the code host. So repo-updater has an update scheduler. Cloning and fetching are treated in the same way, but priority is given to newly discovered repositories.

The scheduler is divided into two parts:

updateQueue is a priority queue of repositories to clone/fetch on gitserver.
schedule which places repositories onto the updateQueue when it thinks it should be updated. This is what paces out updates for a repository. It contains heuristics such that recently updated repositories are more frequently checked.

Repositories can also be placed onto the updateQueue if we receive a webhook indicating the repository has changed. (By default, we don't set up webhooks when integrating into a code host.) When a user directly visits a repository on Sourcegraph, we also enqueue it for update.

The update scheduler has a number of workers equal to the value of conf.GitMaxConcurrentClones, which process the updateQueue and issue git clone/fetch commands via an RPC call to the appropriate gitserver instance. It is important to remember that the updateQueue only exists in memory in repo-updater. gitserver has no knowledge of the queue and only handles requests to update repositories.

See this diagram which shows the relationship between the scheduler and update queue.

Identity Coherence

Repositories can be referenced using an internal ID that is coherent across updates, deletes, and even re-adding the original repository name to Sourcegraph after deleting. This ID refers to the primary key column id in the repo table.