Life of a search query

This document describes how our backend systems serve search results to clients. There are multiple kinds of searches (e.g. text, repository, file, symbol, diff, commit), but this document will focus on text searches.

Clients

There are a few ways to perform a search with Sourcegraph:

  1. Typing a query into the search bar of the Sourcegraph web application.
  2. Typing a query into your browser's location bar after configuring a browser search engine shortcut.
  3. Using the src CLI command.

In all cases, clients use the search query in our GraphQL API that is exposed by our frontend service.

Frontend

The frontend implements the GraphQL search resolver here.

First, the frontend resolves which repositories need to be searched. It parses the query for any repository filters and then queries the database for the list of repositories that match those filters. If no filters are provided then all repositories are searched, as long as the number of repositories doesn't exceed the configured limit. Private instances default to an unlimited number of repositories, but sourcegraph.com has smaller configured limit ("maxReposToSearch": 400 at the time of writing, but you can check the site config for the current value) because it isn't cost effective for us to to search/index all open source code on GitHub.

Next, the frontend determines which repository@revision combinations are indexed by zoekt by consulting an in-memory cache that is kept up-to-date with regular asynchronous polling. It concurrently queries zoekt for indexed repositories and queries searcher for non-indexed repositories.

zoekt-webserver serves search requests by iterating through matches in the index. It watches the index directory and loads/unloads index files as they come and go.

To decide what to index zoekt-sourcegraph-indexserver sends an HTTP Get request to the frontend internal API at most once per minute to fetch the list of repository names to index. For each repository the indexserver will compare what Sourcegraph wants indexed (commit, configuration, etc.) to what is already indexed on disk and will start an index job for anything that is missing.

What we index in a repository is affected by admin configuration (branches, large file allow list). For each repository the indexserver asks the frontend for configuration. It fetches git data by calling another internal frontend API which redirects to the archive on gitserver. If indexing multiple branches, it instead relies on git shallow clones.

Searcher is a horizontally scalable stateless service that performs non-indexed code search. Each request is a search on a single repository (the frontend searches multiple repositories by sending one concurrent request per repository). To serve a search request, it first fetches a zip archive of the repo at the desired commit from gitserver and then iterates through the files in the archive to perform the actual search.