Sourcegraph monitoring: metrics reference

High-level alerting metrics

Sourcegraph's metrics include a single high-level metric alert_count which indicates the number of level=critical and level=warning alerts each service has fired over time for each Sourcegraph service. This is the same metric presented on the Overview monitoring dashboard:

Overview monitoring dashboard screenshot

`alert_count`

Description: The number of alerts each service has fired and their severity level. The severity levels are defined as follows:

critical: something is definitively wrong with Sourcegraph.
- Examples: Database inaccessible, running out of disk space, running out of memory.
- Suggested action: Page a site administrator to investigate.
warning: something could be wrong with Sourcegraph.
- Examples: High latency, high search timeouts.
- Suggested action: Email a site administrator to investigate and monitor when convenient.

Values:

Although the values of alert_count are floating-point numbers, only their whole numbers have meaning. For example: 0.5 and 0.7 indicate no alerts are firing, while 1.2 indicates exactly one alert is firing and 3.0 indicates exactly three alerts firing.

Labels:

level: either critical or warning, as defined above.
service_name: the name of the service that fired the alert, one of the following constants:
- "frontend"
- "github-proxy"
- "gitserver"
- "lsif-serer"
- "query-runner"
- "replacer"
- "repo-updater"
- "searcher"
- "symbols"
- "zoekt-indexserver"
- "zoekt-webserver"
- "syntect-server"
name: the name of the alert that the service fired (chosen by the service)
description: a human-readable description of the alert
instance: identifies the Kubernetes pod, Docker container, or host machine from which the alert came.

Complete reference

A complete reference of Sourcegraph's vast set of Prometheus metrics is not yet available. If you are interested in this, please reach out by filing an issue or contacting us at [email protected].

On this page:

Sourcegraph monitoring: metrics reference

High-level alerting metrics

alert_count

Complete reference

`alert_count`