How to add monitoring

This guide documents how to add monitoring to Sourcegraph's source code. Sourcegraph employees should also refer to the handbook's monitoring section for Sourcegraph-specific documentation. The developing observability page contains relevant documentation as well.

Metrics

Service-side, metrics should be made available over HTTP for Prometheus to scrape. By default, Prometheus expects metrics to be exported on $SERVICEPORT/metrics—for example, run your local Sourcegraph dev server and metrics should be available on http://localhost:$SERVICEPORT/metrics. How this is configured varies across the various Sourcegraph deployment options—see tracking a new service.

Tracking a new service

In deploy-sourcegraph, Prometheus uses the Kubernetes API to discover endpoints to scrape. Just add the following annotations to your service definition:

metadata:
  annotations:
    prometheus.io/port: "$SERVICEPORT" # replace with the port your service runs on
    sourcegraph.prometheus/scrape: "true"

In deploy-sourcegraph-docker, Prometheus relies on targets defined in the prometheus_targets configuration file—you will need to add your service here.

Alerts, dashboards, and documentation

Creating alerts, dashboards, and documentation for monitoring is powered by the Sourcegraph monitoring generator, which requires monitorings to be defined in our monitoring definitions package. The monitoring generator provides a lot of features and integrations with the Sourcegraph monitoring ecosystem for free.

This section documents how to use develop monitoring definitions for a Sourcegraph service. To get started, you should read:

the Sourcegraph monitoring pillars for some of the principles we try to uphold when developing monitoring
relevant reference documentation for the monitoring generator

Set up an observable

Monitoring is build around "observables"—something you wish to observe. The generator API exposes this concept through the Observable type.

You can decide where to put your new observable by looking for an existing dashboard that your information should go in. Think "when this number shows something bad, which service logs are likely to be most relevant?". If you are just editing an existing observable,

Existing dashboards can be viewed by either:

Visiting Grafana on an existing Sourcegraph instance that you have site admin permissions for, e.g. example.sourcegraph.com/-/debug/grafana—see the metrics for site administrators documentation for more details.
Running the monitoring stack locally

Once you have found a home for your observable, open that service's monitoring definition (e.g. monitoring/frontend.go, monitoring/git_server.go) in your editor. Declare your Observable by:

adding it to an existing Row in the file
adding a new Row
adding a new Group entirely

Here's an example Observable that we will use throughout this guide to get you started:

{
  Name:        "some_metric_behaviour",
  Description: "some behaviour of a metric",
}

Write a query

Use the Grafana Explore page on a Sourcegraph instance where you have site administrator access (/-/debug/grafana/explore) to start writing your Prometheus query.

{
    Name:        "some_metric_behaviour",
-   Description: "some behaviour of a metric",
+   Description: "some behaviour of a metric over 5m",
+   Query:       `histogram_quantile(0.99, sum by (le)(rate(search_request_duration{status="success}[5m])))`,
}

Make sure to update your description to reflect the query you end up with where relevant.

Configure panel options

Panel options can be used to customize the visualization of your observable in Grafana. This step is optional, but highly recommended.

There are not many panel options (intentionally) to keep things simple. The primary thing you'll use is to change the Grafana display from plain numbers to a unit like seconds:

{
    Name:        "some_metric_behaviour",
    Description: "some behaviour of a metric over 5m",
    Query:       `histogram_quantile(0.99, sum by (le)(rate(search_request_duration{status="success}[5m])))`,
+   Panel:       monitoring.Panel().LegendFormat("duration").Unit(monitoring.Seconds),
}

The default monitoring.Panel() configures a panel for your observable using recommended defaults, and provides a set of recommended customization options through ObservablePanel.

Additional customizations can be made to your observable's panel using ObservablePanel.With() and ObservablePanelOption.

Add an alert

Alerts can be defined at two levels: warning, and critical. They are used to provide Sourcegraph health notifications for site administrators. This step is optional, but highly recommended. If you opt not to include an alert, you must explicitly set NoAlert: true and provide relevant documentation for this observable.

To get started, refer to understanding alerts for what your alert should indicate. Then make a guess about what a good or bad value for your query is—it's OK if this isn't perfect, just do your best. You can then use the ObservableAlertDefinition to add an alert to your Observable, for example:

{
    Name:        "some_metric_behaviour",
    Description: "some behaviour of a metric over 5m",
    Query:       `histogram_quantile(0.99, sum by (le)(rate(search_request_duration{status="success}[5m])))`,
    Panel:       monitoring.Panel().LegendFormat("duration").Unit(monitoring.Seconds),
+   Warning:     monitoring.Alert().GreaterOrEqual(20),
}

Options like only alerting after a certain duration (.For(time.Duration)) are also available—refer to the monitoring library reference.

Add documentation

It's best if you also add some Markdown documentation with your best guess of what someone might consider doing if they observe the alert firing (again, just your best guess is good enough here):

{
    Name:        "some_metric_behaviour",
    Description: "some behaviour of a metric over 5m",
    Query:       `histogram_quantile(0.99, sum by (le)(rate(search_request_duration{status="success}[5m])))`,
    Warning:     monitoring.Alert().GreaterOrEqual(20),
    Panel:       monitoring.Panel().LegendFormat("duration").Unit(monitoring.Seconds),
+   PossibleSolutions: `
+       - Look at 'SERVICE' logs for details on the slow search queries.
+   `,
}

{
    Name:        "some_metric_behaviour",
    Description: "some behaviour of a metric over 5m",
    Query:       `histogram_quantile(0.99, sum by (le)(rate(search_request_duration{status="success}[5m])))`,
    NoAlert:     true,
    Panel:       monitoring.Panel().LegendFormat("duration").Unit(monitoring.Seconds),
+   Interpretation: `
+       This value might be high under X, Y, and Z conditions.
+   `,
}

Validate your observable

Run the monitoring generator from the root Sourcegraph directory:

RELOAD=false sg run monitoring-generator

This will validate your Observable configuration and let you know of any changes you need to make if required. If the generator runs successfully, you should now run the monitoring stack locally to validate the output and results of your observable by hand.

Once everything looks good, open a pull request with your observable to the main Sourcegraph codebase!

Centralized observability

You can opt-in to Sourcegraph Cloud centralized observability's multi-instance overviews dashboard by setting MultiInstance: true on your Observable:

{
    Name:        "some_metric_behaviour",
    Description: "some behaviour of a metric over 5m",
    Query:       `histogram_quantile(0.99, sum by (le)(rate(search_request_duration{status="success}[5m])))`,
    Panel:       monitoring.Panel().LegendFormat("duration").Unit(monitoring.Seconds),

+   MultiInstance: true,
}

Multi-instance panels are best used on panels with only 1 or very few time series, since each Cloud instance gets its own, separate time series for the Observable's Query - for hundreds of instances, panels with multiple time series can become unreadable or very slow to load.