Troubleshooting

Troubleshooting notes for Quarks Operator

There might be situations where things might go wrong, this section is providing tips to debug issues in your cluster and your applications deployed with quarks-operator.

You should also check the known issues for the release you are using.

Also, checkout these awesome tools that can help you while debugging your cluster.

Kubernetes events

The quarks-operator streams results of its actions throughout logs and Kubernetes events. If you notice issues, or if you are just unsure on what is happening, check first the event logs of all the namespaces (or the namespace where you are deploying your bosh release)

1
$> kubectl get events -A

Checking logs

Catching logs from all the quarks-operator components might be somehow challenging, we suggest to use stern.

For example, to stream all the logs of all pods in your cluster in your terminal:

1
$> stern --all-namespaces .

With kubectl, you can check all container logs with:

1
$> kubectl logs -n namespace -f pod --all-containers

Debugging BOSH Releases

The quarks operator exposes a debug parameter to enable hooking into a bosh release before starting. In this way you can open a shell in a pod for further debugging. To do so, you have to specify quarks.debug=true to the relevant part you are interested in debugging, for example, in your Configmap:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: quarks-gora-manifest
data:
  manifest: |
    ---
    name: quarks-gora-deployment
    releases:
    - name: quarks-gora
        ....
    instance_groups:
    - name: quarks-gora
      jobs:
      - name: quarks-gora
        release: quarks-gora
        properties:
          quarks:
            debug: true

Checking containers state

We strongly suggest to use k9s. It gives a quick glance over your cluster state, and quickly allows to introspect pods container statuses.

In most cases, while you are debugging a BOSH deployment you will have to check the states of all the containers running in a pod.

You can do that in k9s by navigating into the pod and pressing [Enter] , in plain kubectl commands, you can describe a resource to get all its events and in case of pods, an overview of all the containers belonging to it

1
$> kubectl describe pod my-bosh-release

or, if you are checking a QuarksSecret state:

1
$> kubectl describe qsec my-quarks-secret

My pod looks stuck

There might be various reasons why this could happen, here are few suggestions:

  • Check if your pod is having a container with the wait-for prefix, means it’s waiting for a specific service which can be identified by the container name.
  • Check if your pod is depending on a secret which isn’t being generated yet. If the secret is generated by a QuarksSecret, check the events related to the QuarksSecret which should generate it (with kubectl describe ..., or kubectl events).

Cluster CA

The cf-operator assumes that the cluster root CA is also used for signing CSRs via the certificates.k8s.io API and will embed this CA in the generated certificate secrets. If your cluster is set up to use a different cluster-signing CA the generated certificates will have the wrong CA embedded. See https://kubernetes.io/docs/tasks/tls/managing-tls-in-a-cluster/ for more information on cluster trust.

Recovering from a crash

If the operator pod crashes from unrecoverable errors, it cannot be restarted in the same namespace before the existing mutating webhook configuration for that namespace is removed. The operator uses mutating webhooks to modify pods on the fly and Kubernetes fails to create pods if the webhook server is unreachable. The webhook configurations are installed cluster wide and don’t belong to a single namespace, just like custom resources.

To remove the webhook configurations for the cf-operator namespace run:

1
2
3
CF_OPERATOR_NAMESPACE=cf-operator
kubectl delete mutatingwebhookconfiguration "cf-operator-hook-$CF_OPERATOR_NAMESPACE"
kubectl delete validatingwebhookconfiguration "cf-operator-hook-$CF_OPERATOR_NAMESPACE"

From Kubernetes 1.15 onwards, it is possible to instead patch the webhook configurations for the cf-operator namespace via:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
CF_OPERATOR_NAMESPACE=cf-operator
kubectl patch mutatingwebhookconfigurations "cf-operator-hook-$CF_OPERATOR_NAMESPACE" -p '
webhooks:
- name: mutate-pods.quarks.cloudfoundry.org
  objectSelector:
    matchExpressions:
    - key: name
      operator: NotIn
      values:
      - "cf-operator"
'

See also how to monitor admission webhooks if you are debugging the changes introduced by a Quarks operator webhook