Current limitations of Code Insights

There are a few existing limitations.

If you have strong feedback, please do let us know.

Limitations that are no longer current are documented at the bottom for the benefit of customers who have not yet upgraded.

Performance speed considerations for a data series running over all repositories

To accurately return historical data for insights running over all of your repositories, the backend service must run a large number of Sourcegraph searches. This means that unlike code insights running over just a few repositories, results are not returned instantly, but more often on the scale of 20-120 minutes, depending on:

  • N: how many repositories you have connected to your instance; in our tests, we used 26,400 repositories
  • q: the performance and resources of your Sourcegraph code insights instance in queries-per-second; in our tests, 7 queries per second was average
  • c: how well we can “compress” repositories so we don’t need to re-run queries every month (e.g., if a repository hasn’t changed in two months); in our tests, C = ~2

A very general formula for estimating how long an individual data series (query) will take to run on your instance in seconds N * 1/c * 1/q.

On our test instance, we find a code insight data series takes approximately:

26,400 repositories * 12 compression factor * 17 queries per second = 31 minutes

The number of insights you have does not affect the overall speed at which they run: it will take the same total time to run all of them whether or not you let each one finish before creating the next one. Insights currently populate in parallel, prioritizing most-recent-in-time datapoints first.

Creating insights over very large repositories (<3.42)

In some cases, depending on the size of the Sourcegraph instance and the size of the repo, you may see odd behavior or timeout errors if you try to create a code insight running over a single large repository. In this case, it’s best to try:

  1. Create the insight, but check the box to “run over all repositories.” (This sends the Insight backfilling jobs to the backend Sourcegraph instance worker which will handle them datapoint-by-datapoint. Running over an individual repository otherwise currently runs the jobs in bulk to generate its live preview.)
  2. After the insight has finished running, filter the insight to the specific repo you originally wanted to use. The filter resolves instantly.

If this does not solve your problem, please reach out directly to your Sourcegraph contact or in your shared slack channel, as there are experimental solutions we are currently working on to further improve our handling of large repositories.

Accuracy considerations for an insight query returning a large result set

If you create an insight with a search query that returns a large result set that exceeds the search timeout (generally when there are over 1,000,000 results), non-historical data points may report undercounted numbers. This behaviour is tracked in this issue. This is because non-historical data points are recorded with a global search query as opposed to per-repo queries we run for backfilling. For a large result set (e.g. a query for test with millions of results) the global query will be disadvantaged by the global search timeout. You can find more information on search timeouts in the docs.

You can determine if this issue may be affecting your query by just running the query in the Search UI on /search with a count:all – if your search is returning x results in 60s (or the upper limit max timeout is configured to) then the search will time out on insights as well. Note that the duration could be more or less 60s, e.g. you could encounter 60.02s as well.

In this case, you may want to try:

  • Using a more granular query
  • Changing your site configuration so that the timeout is increased, provided your instance setup allows it. More information on timeouts.

Feature parity limitations

Features currently available only on insights over all your repositories

  • Filtering insights: available in 3.41+ we do not yet allow filtering for insights that run over explicitly defined lists of repositories, except for “detect and track” insights.

Features currently available only on insights over explicitly defined repository lists

Because these insights need to run dramatically fewer queries than insights over thousands of repositories, you will have access to a number of features not yet supported for insights over all repositories. These are:

  • Live previews: showing the preview of your insight in real time
  • [Released] Dynamic x-axis ranges: available in 3.35+ set a custom amount of historical data you care about
  • [Released] Editing data series queries after creation: available in 3.35+ for insights over all repositories, you must make a new insight if you wish to run a different query
  • [Released] “Diff click”: available in 3.36+ click a datapoint on your insight and be taken to a diff search showing any changes contributing to the difference between a datapoint and the prior one

Limitations specific to “Detect and track patterns” insights (automatically generated data series)

Please see Current limitations of automatically generated data series.

There are currently a few subtle differences in how code insights and Sourcegraph web app searches handle defaults when searching over all repositories. Refer to Common reasons code insights may not match search results.

Known bugs

Known bugs we plan to fix are tracked in our GitHub repository here.

Older versions’ limitations

Version 3.30 (July 2021) or older

Search-based Code Insights can only run over ~50-70 repositories

Because this version of the prototype runs on frontend API calls to Sourcegraph searches, it may run slowly (or possibly timeout) if you’re using it over many repositories or with many data series for each insight.

The max match count is 5,000 matches per repository

The current limit on searching over historical versions of repositories, which is an unindexed search, is 5,000 results per repository. If there are more than 5,000 matches, the search stops and returns a count of 5,000, and the code insight graph will calculate the overall chart using 5,000 as the match count for that repository. (This means if you query over two repositories and one of them hits this limit, the value shown on the graph will be 5,000 + [the match count in the other repository]).

This limit was lifted in the August 2021 release of Sourcegraph 3.31