Sourcegraph Architecture Overview
This is a high level overview of our architecture at Sourcegraph so you can understand how our services fit together.
Here are the services that compose Sourcegraph.
Application data is stored in our Postgresql database.
Session data is stored in redis.
Typically there are multiple replicas running in production to scale with load.
frontend tends to use a large amount of memory. For example our search architecture does a scatter and gather amongst the search backends in the frontend. The gathering of results can result in a lot of memory usage, even though the final result set returned to the user is much smaller. There are a few more examples of these since our frontend has a monolithic architecture. Additionally we haven’t optimized for memory usage since it hasn’t caused us issues in production since we can just scale it out.
Proxies all requests to github.com to keep track of rate limits and prevent triggering abuse mechanisms.
There is only one replica running in production. However, we can have multiple replicas to increase our rate limits (rate limit is per IP).
Mirrors repositories from their code host. All other Sourcegraph services talk to gitserver when they need data from git. Requests for fetch operations, however, should go through repo-updater.
gitserver’s memory usage consists of short lived git subprocesses.
This is an IO and compute heavy service since most Sourcegraph requests will trigger 1 or more git commands. As such we shard requests for a repo to a specific replica. This allows us to horizontally scale out the service.
The service is stateful (maintaining git clones). However, it only contains data mirrored from upstream code hosts.
Sourcegraph extensions add features to Sourcegraph, including language support. Many extensions rely, in turn, on language servers (implementing the Language Server Protocol) to provide code intelligence (hover tooltips, jump to definition, find references).
Periodically runs saved searches and sends notification emails. Only one replica should be running.
Repo-updater (which may get renamed since it does more than that) tracks the state of repos, and is responsible for automatically scheduling updates (“git fetch” runs) using gitserver. Other apps which desire updates or fetches should be telling repo-updater, rather than using gitserver directly, so repo-updater can take their changes into account. Only one replica should be running.
Provides on-demand search for repositories. It scans through a git archive fetched from gitserver to find results.
This service should be scaled up the more on-demand searches that need to be done at once. For a search the frontend will scatter the search for each [email protected] across the replicas. The frontend will then gather the results. Like gitserver this is an IO and compute bound service. However, its state is a cache which can be lost at anytime.
Provides search results for repositories that have been indexed.
This service can only have one replica. Typically large customers provision a large node for it since it is memory and CPU heavy. Note: We could shard across multiple replicas to scale out. However, we haven’t had a customer were this is necessary yet so haven’t written the code for it yet.
Indexes symbols in repositories using Ctags. Similar in architecture to searcher, except over ctags output.
Syntect is a Rust service that is responsible for syntax highlighting.
Horizontally scalable, but typically only one replica is necessary.
We publish browser extensions for Chrome, Firefox, and Safari, that provide code intelligence (hover tooltips, jump to definition, find references) when browsing code on code hosts. By default it works for open-source code, but it also works for private code if your company has a Sourcegraph deployment.
It uses GraphQL APIs exposed by the frontend to fetch data.
docs)Editor extensions (
Our editor extensions provide lightweight hooks into Sourcegraph, currently.