Troubleshoot Sourcegraph with Kubernetes

If Sourcegraph with Kubernetes does not start up or shows unexpected behavior, there are a variety of ways you can determine the root cause of the failure.

See our operations guide for more useful commands and operations.

Common errors

Error: Error from server (Forbidden): error when creating "base/frontend/sourcegraph-frontend.Role.yaml": roles.rbac.authorization.k8s.io "sourcegraph-frontend" is forbidden: attempt to grant extra privileges.

The account you are using to apply the Kubernetes configuration doesn't have sufficient permissions to create roles, which can be resolved by creating a cluster-admin role for your user with the following command:

$ kubectl create clusterrolebinding cluster-admin-binding \
  --clusterrole cluster-admin \
  --user $YOUR_EMAIL
  --namespace $YOUR_NAMESPACE

"kubectl get pv" shows no Persistent Volumes, and/or "kubectl get events" shows a Failed to provision volume with StorageClass "sourcegraph" error.

Make sure a storage class named "sourcegraph" exists in your cluster within the same zone.

$ kubectl get storageclass sourcegraph -o=yaml \
  --namespace $YOUR_NAMESPACE

Error: error retrieving RESTMappings to prune: invalid resource networking.k8s.io/v1, Kind=Ingress, Namespaced=true: no matches for kind "Ingress" in version "networking.k8s.io/v1".

Run kubectl version to verify the Client Version matches the Server Version.

Run kubectl get ingresses -A to check if there is more than one ingress for sourcegraph-frontend. You can delete the duplicate with kubectl delete ingress sourcegraph-frontend --namespace $YOUR_NAMESPACE

Error: error when creating "base/cadvisor/cadvisor.ClusterRoleBinding.yaml": subjects[0].namespace: Required value

Add namespace: default to the base/cadvisor/cadvisor.ClusterRoleBinding.yaml file under subjects.

Multiple pods are stuck in Pending.

Lack of resources could be a contributing factor. Dump current cluster state and look for error messages. Below is an example of a message that indicates the cluster is currently under provisioned.

# dump.txt
  "Reason": "FailedScheduling",
  "Message": "0/3 nodes are available: 1 Insufficient memory, 3 Insufficient cpu.",

ImagePullBackOff / 429 Too Many Requests Errors.

This indicates the instance is getting rate-limited by Docker Hub(link), where our images are stored, as unauthenticated users are limited to 100 image pulls within a 6 hour period. Possible solutions included:

  1. Create a Docker Hub account with a higher rate limit
  2. Configure an ImagePullSecrets K8S object with your Docker Hub service that contains your docker credentials (link to tutorial)
  3. Add these credentials to the default service account within the same namespace as your Sourcegraph deployment (link to tutorial)

Alternatively, you can wait until the rate limits are reset.

[OPTIONAL] You can also upgrade your account to a Docker Pro or Team subscription with higher rate-limits. (See Docker Hub for more information).

Irrelevant cAdvisor metrics are causing strange alerts and performance issues.

This is most likely due to cAdvisor picking up other metrics from the cluster. A workaround is available: Filtering cAdvisor metrics.

I don't see any metrics on my Grafana Dashboard.

Missing metrics indicate Sourcegraph is having issues connecting to the Kubernetes API. For instance, running a Sourcegraph instance as non-privileged prevents services from picking up metrics through the Kubernetes API. One of the potential solutions is to grant Prometheus and cAdvisor root access.

Which metrics are using the most resources?

  1. Access the UI for Prometheus temporarily with port-forward:
    $ kubectl port-forward svc/prometheus 9090:30090
    
  2. Open http://localhost:9090/ in your browser
    $ open http://localhost:9090
    
  3. Run topk(10, count by (__name__)({__name__=~".+"})) to check the values

You can't access Sourcegraph.

Make sure the namespace of the ingress-controller is ingress-nginx. See the Troubleshooting ingress-nginx docs for more information.

Healthcheck failing with Strconv.Atoi: parsing "{$portName}": invalid syntax error

This can occur when the Readiness or Liveness probe is referring to a port that is not defined. Please ensure the port name is consistent with upstream. Foe example:

ports:
  - containerPort: 3188
    name: minio
...
livenessProbe:
  httpGet:
    path: /minio/health/live
    port: minio   #this port name MUST exist in the same spec

Service mesh

Known issues when using a service mesh (e.g. Istio, Linkerd, etc.)

Error message: Git command [git rev-parse HEAD] failed (stderr: ""): strconv.Atoi: parsing "": invalid syntax

This error occurs because Envoy, the proxy used by Istio, drops proxied trailers for the requests made over HTTP/1.1 protocol by default. To resolve this issue, enable trailers in your instance following the examples provided for Kubernetes and Kubernetes with Helm.

Symbols sidebar and hovers are not working

In a service mesh like Istio, communication between services is secured using a feature called mutual Transport Layer Security (mTLS). mTLS relies on services communicating with each other using DNS names, rather than IP addresses, to identify the specific services or pods that the communication is intended for.

To illustrate this, consider the following examples of communication flows between the "frontend" component and the "symbols" component:

Example 1: Approved Communication Flow

  1. Frontend sends a request to http://symbol_pod_ip:3184
  2. The Envoy sidecar intercepts the request
  3. Envoy looks up the upstream service using the DNS name "symbols"
  4. Envoy forwards the request to the symbols component

Example 2: Disapproved Communication Flow

  1. Frontend sends a request to http://symbol_pod_ip:3184
  2. The Envoy sidecar intercepts the request
  3. Envoy tries to look up the upstream service using the IP address symbol_pod_ip
  4. Envoy is unable to find the upstream service because it's an IP address not a DNS name
  5. Envoy will not forward the request to the symbols component

To resolve this issue, the solution is to redeploy the frontend after specifying the service address for symbols by setting the SYMBOLS_URL environment variable in frontend.

Please make sure the old frontend pods are removed.

SYMBOLS_URL=http:symbols:3184

Squirrel.LocalCodeIntel http status 502

The issue described is related to the Code Intel hover feature, where it may get stuck in a loading state or return a 502 error with the message `Squirrel.LocalCodeIntel http status 502`. This is caused by the same issue described in [Symbols sidebar and hovers are not working](#symbols-sidebar-and-hovers-are-not-working"). See that section for solution.

Help request

Still need additional help? Please contact us using one of the methods listed below: