Life of a search query

This document describes how our backend systems serve search results to clients. There are multiple kinds of searches (e.g. text, repository, file, symbol, diff, commit), but this document will focus on text searches.

Clients

There are a few ways to perform a search with Sourcegraph:

  1. Typing a query into the search bar of the Sourcegraph web application.
  2. Typing a query into your browser's location bar after configuring a browser search engine shortcut.
  3. Using the src CLI command.

Clients use either the Streaming API or the search query in our GraphQL API. Both are exposed in our frontend service.

Frontend

The frontend implements the Streaming API here. The Streaming API is used by the browser. Historically we served results via GraphQL and there are still many clients who use this API. Internally Sourcegraph search is streaming based.

First, the frontend takes the query and creates a plan of jobs to execute concurrently. A job is a specific query against a backend. For example here we convert a Sourcegraph query into a Zoekt query for our indexed search backend. The comments in the job creation function are worth reading for more details. Additionally you can experiment with the debug command at ./dev/internal/cmd/search-plan to understand the jobs that are generated.

Most Sourcegraph queries can be directly translated into Zoekt queries without consulting our database of repositories. However, not all repositories or revisions are indexed. So we need to work out what isn't indexed by Zoekt so we can do unindexed queries against Searcher. So the frontend determines which repository@revision combinations are indexed by Zoekt by consulting an in-memory cache. See RepoSubsetTextSearch for how searcher is queried.

zoekt-webserver serves search requests by iterating through matches in the index. It watches the index directory and loads/unloads index files as they come and go.

To decide what to index zoekt-sourcegraph-indexserver sends an HTTP Get request to the frontend internal API at most once per minute to fetch the list of repository names to index. For each repository the indexserver will compare what Sourcegraph wants indexed (commit, configuration, etc.) to what is already indexed on disk and will start an index job for anything that is missing.

What we index in a repository is affected by admin configuration (branches, large file allow list). For each repository the indexserver asks the frontend for configuration. It uses git shallow clones via gitserver to fetch the contents of the branches to index. It then calls out to zoekt-git-index which creates 1 or more shards containing the indexes used by zoekt-webserver.

Searcher is a horizontally scalable stateless service that performs non-indexed code search. Each request is a search on a single repository (the frontend searches multiple repositories by sending one concurrent request per repository). To serve a search request, it first fetches a zip archive of the repo at the desired commit from gitserver and then iterates through the files in the archive to perform the actual search.

Searcher will offload work to Zoekt to speed up response times. It will ask gitserver for a diff of what has changed between the indexed commit and the current request. Using that information it only needs to search a subset of changed files, the rest can be searched by zoekt.