How Sourcegraph auto-indexes source code

Auto-indexing is enabled only in the Cloud environment and are written to work well for the usage patterns found there. Once we have proven that auto-indexing would also be beneficial in private instances, we will consider making the feature available there as well.

Scheduling

Currently, scheduling is based primarily around repository groups (but the configuration and details are actively being worked on).

Once the set of repositories to index have been determined, the set of steps required to index the repository are determined.

If a user has explicitly configured indexing steps for this repository, the configuration may be found in the database (configured via the UI), or in the sourcegraph.yaml configuration file in the root of the repository.

If no explicit configuration exists, the steps are inferred from the repository structure. We currently support detection of projects in the following languages:

The steps to index the repository are serialized into an index record and inserted into a task queue to be processed asynchronously by a pool of task executors.

Processing

Because indexing an arbitrary code base may require arbitrary commands to be run (e.g., dependency gathering, compilation steps, code generation, etc), we process each index job in a Firecracker virtual machine managed by Weave Ignite. These virtual machines are coordinated by the executor service which is deployed directly on GCP compute nodes.

The executor, deployed externally to the rest of the cluster, makes requests to the frontend and to gitserver via proxy routes in the frontend protected by a shared token.

When idle, the executor process will periodically poll the frontend asking for an index job from a specific queue (configured via an environment variable on the executor). A periodic heartbeat request between the executor and the frontend will ensure that jobs do not stay permanently locked if the executor crashes or becomes partitioned from the Sourcegraph instance.

On dequeue, a row from the lsif_indexes table is transformed into a generic (non-code-intel-specific) task to be sent back to the executor. This payload consists of a sequence of docker and src-cli commands to run.

Once the executor receives a job, it will clone the target repository and checkout a target commit. A Firecracker virtual machine is started and the local git clone is copied into it. The commands encoded in the dequeued job are invoked inside of the virtual machine. Once done, the virtual machine is removed and a request is made to the frontend to mark the index as successfully processed.

Code appendix

Executor: Handle
Frontend: newExecutorQueueHandler, handleDequeue, handleHeartbeat, handleAddExecutionLogEntry, handleMarkComplete, transformRecord
Firecracker: setupFirecracker, teardownFirecracker, formatFirecrackerCommand