This document describes at a high-level how our backend implements search pagination.
Search pagination is an experimental API we added as part of Sourcegraph 3.9 and is currently only used for programmatic consumption of search results, not in the web UI for example.
Read the life of a search query document first, as search pagination effectively acts as a layer on top of our search architecture. We will only talk broadly about the concepts applied to perform pagination, while the previously mentioned document goes into more depth about the actual codepaths we reference here.
Additionally, you should first read and understand how an end-user would make use of our pagination API by reading the search API documentation.
First, we detect a search query as being paginated at the primary GraphQL entrypoint and initialize a pagination struct here: https://sourcegraph.com/github.com/sourcegraph/[email protected]/paginated-search/-/blob/cmd/frontend/graphqlbackend/search.go#L76-92
Next, the actual pagination begins when
paginatedResults is called: https://sourcegraph.com/search?q=repo:%5Egithub%5C.com/sourcegraph/sourcegraph%24%40sg/paginated-search+paginatedResults
For the purposes of clarity in this document, we will use the following terms:
We use the term “search backend” to describe a function which performs a search for a specific result
type:. The function can perform an indexed (e.g. zoekt text search or symbol search) or unindexed (e.g. commit/diff search). The list of these at the time of writing are:
performCodemodfor code modification find/replace (read-only) results.
The cursor is a base64 opaque string from a clients point of view. It contains metadata that is passed back and forth between the client and the server in order for the server to know where the client left off and where the server should begin looking for more results. Its actual definition in code is here.
Our cursors are considered to be:
UserBtries to make use of
Cursor123it may produce a different result set (e.g. due to repository permissions).
Note: “cursor-based pagination” is an established concept and you can find resources describing different cursor-based pagination approaches elsewhere online.
Shared between both paginated and non-paginated search are the various search backends. These backends take a list of repositories to search through and produce a set of results (
searchResultResolver) and a metadata structure about those results (
searchResultsCommon) describing some properties like if any repositories timed out during the search, if we hit a limit during the search, etc.
In the case of non-paginated search, we perform a basic process:
type:, concurrently invoke the search backend to search across all repositories.
If you want a large number of results with the above, you must specify a larger enough
timeout: parameter in your search query or else you may not get back enough. In some cases, it is not possible to specify a large enough
timeout which has a maximum allowed value of 2 minutes.
Paginated search effectively reimplements the same process described above, just with alterations needed to make the process one that is done between the client and server via multiple requests and in a determistic way.
For a paginated request, we perform the following process:
#2 above, while sounding simple in practice, is where the vast majority of the complexity lies:
To address the above, we use a relatively simple solution which is applicable to all search backends today with tuning parameters we can apply to each search backend to handle their different performance characteristics – while still allowing us to make use of search backends in the future with true pagination support.
The repository pagination planner is what wraps existing search backends to provide result-level pagination.
Consider that you have 10 repositories added to Sourcegraph that are searchable by you (in practice, this would be a much larger number like say 10,000 to 80,000):
[A, B, C, D, E, F, G, H, I, J]
Each search backend is capable of taking a subset of this repository list and searching over it for results. The question is, how many do we search at a time?
Ideally, we would search over N repositories determined by:
And this is exactly what
repoPaginationPlan describes: a plan for executing a search function that searches only over a set of repositories (i.e. the search function offers no pagination or result-level pagination capabilities) to ultimately provide result-level pagination. That is, if you have a function which can provide a complete list of results for a given repository, then the planner can be used to implement result-level pagination on top of that function.
To determine how many of the repositories to search, it uses the following tunable factors:
numTotalReposOnSourcegraph() / searchBucketDivisorrepositories at a given time
searchBucketMin=> search no less than this many at a given time
searchBucketMax=> search no more than this many repository at a given time
The repository pagination planner is also responsible for producing a cursor for where a subsequent request could begin searching again in the globally-sorted list of repositories. This is tricky because our client is asking for result-level pagination but our search backends effectively only perform repository-level pagination. For example, consider that the planner searches a batch of repositories and finds the following results for repositories (A-Z) and results (0-9):
[A1, A2, A3, B1, B2, B3, C1, C2, C3]
If the client makes an initial paginated request and asks for
first:5 then we need to send them results
B2 and a cursor that indicates we will continue searching again at
B3 so they get the last result in repository
For this reason, the cursor that we give to users in responses has both a global repository offset and result offset within the first repository. In the above example, we would return a cursor with
RepositoryOffset: 1 to indicate that we should resume searching in repository
ResultOffset: 2 to indicate that we should resume searching starting at result 3.