An LSIF indexer produces a file containing the definition, reference, hover, and diagnostic data for a project. Users upload this index file to a Sourcegraph instance, which converts it into an internal format that can support code intelligence queries.
The sequence of actions required to to upload and convert this data is shown below (click to enlarge).
The API used to upload an LSIF index is modeled after the S3 multipart upload API. Many LSIF uploads can be fairly large and the network is generally not reliable. To get around frequent failure of large uploads (and to get around uploads limits in Cloudflare), the upload is broken into multiple, independently gzipped chunks. Each chunk is uploaded in sequence to the instances, where it is concatenated into a single file on the remote end. This allows us to retry chunks independently in the case of an upload failure without sacrificing the entire operation.
An initial request adds an upload into the database with the uploading
state and marks the number of upload chunks it expects to see. The subsequent requests specify the upload identifier (returned in the initial request), and the index of the chunk that is being uploaded. If this upload part successfully makes it to disk, it is marked as received in the upload record. The last request is a request marking upload completion from the client. At this point, the frontend ensures that all the expected chunks have been received and reside on disk. The frontend informs the bundle manager to concatenate the files, and the upload record is moved from the uploading
state to the queued
state, where it is made visible to the worker process.
The worker process polls Postgres for upload records in the queued
state. When such a record is available, it is marked as processing
and is locked in a transaction to ensure that it is not double-processed by another worker instance. The worker asks the bundle manager for the raw LSIF upload data. Because this data is generally large, the data is streamed to the worker while it is being processed (and retry logic inside the bundle manager client will retry the request from the last byte it received on transient failures).
The worker then converts the raw LSIF data into a SQLite database, producing a set of packages that the indexed source code defines and a set of packages that the indexed source code depends on. This portion of the conversion is omitted from the diagram as it remains within the worker process (with one exception), but is explained below.
The set of packages defined by and depended on by this index can be constructed from reading the package information attached to export and import monikers, respectively, from the correlated data. This data is inserted into Postgres to enable cross-repository definition and reference queries.
Duplicate uploads (with the same repository, commit, and root) are removed to prevent the frontend from querying multiple indexes for the same data. This can happen if a user re-uploads the same index, or if an index is re-uploaded as part of a CI step that was re-run. In these cases we prefer to keep the newest upload.
The repository is marked as dirty, which informs a process that runs periodically to re-calculate the set of uploads visible to each commit. This process will refresh the commit graph for this repository stored in Postgres.
The SQLite database is sent to the bundle manager in chunks, as described in the previous section.
Finally, if the previous steps have all completed without error, the transaction is committed, moving the upload record from the processing
state to the completed
state, where it is made visible to the frontend to answer code intelligence queries. If an error does occur, the upload record is instead moved to the errored
state and marked with a failure reason.