How Sourcegraph auto-indexes source code

Auto-indexing is enabled only in the Cloud environment and are written to work well for the usage patterns found there. Once we have proven that auto-indexing would also be beneficial in private instances, we will consider making the feature available there as well.

Scheduling

The IndexabilityUpdater periodically updates a database table that aggregates code intelligence events into a list of repositories orderable by their popularity (or a close proxy thereof).

The IndexScheduler will periodically query the table maintained by the indexability updater for a batch of repositories to index. The ordering expression for this query takes several parameters into account:

The time since the last index task was enqueued for this repository
The number of precise code intel results for this repository in the last week
The number of search-based code intel results for this repository in the last week
The ratio of precise code intel results over total code intel results for this repository in the last week

Once the set of repositories to index have been determined, the set of steps required to index the repository are determined.

If a user has explicitly configured indexing steps for this repository, the configuration may be found in the database (configured via the UI), or in the sourcegraph.yaml configuration file in the root of the repository.

If no explicit configuration exists, the steps are inferred from the repository structure. We currently support detection of projects in the following languages:

The steps to index the repository are serialized into an index record and inserted into a task queue to be processed asynchronously by a pool of task executors.

Processing

Because indexing an arbitrary code base may require arbitrary commands to be run (e.g., dependency gathering, compilation steps, code generation, etc), we process each index job in a Firecracker virtual machine managed by Weave Ignite. These virtual machines are coordinated by the executor service which is deployed directly on GCP compute nodes.

The executor, deployed externally to the rest of the cluster, makes requests to the executor-queue and gitserver services via a proxy routes in the public API which are protected by a shared token.

When idle, the executor process will periodically poll the executor-queue asking for an index job. If one exists, the executor-queue will open a long-running database transaction in order to lock the record during processing. A periodic heartbeat request between the executor and the executor-queue will ensure that transactions do not stay permanently open if the executor crashes or becomes partitioned from the Sourcegraph instance.

On dequeue, a queued but unlocked row in the lsif_indexes table is locked and the record is transformed into a generic (non-code-intel-specific) task to be sent back to the executor. This payload consists of a sequence of docker and src-cli commands to run.

Once the executor receives a job, it will clone the target repository and checkout a target commit. A Firecracker virtual machine is started and the local git clone is copied into it. The commands determined by the executor-queue task translation layer are invoked inside of the virtual machine. Once done, the virtual machine is removed and a request is made ot the executor-queue to mark the index as successfully processed.

Code appendix

Executor: Handle, setupFirecracker, formatFirecrackerCommand, teardownFirecracker
Executor Proxy: newInternalProxyHandler
Executor Queue: handleDequeue, setLogContents, handleMarkComplete, handleHeartbeat, transformRecord