For BOSH instance groups, one can increase the instance count, which results in a k8s statefulset with that number set as replica count. There is no guarantee that pod replicas in a statefulset will be started on different Kubernetes nodes, but affinity can be used to control pod placement.
Even more redundancy can be achieved by using multiple availability zones, as described in the qsts controller docs.
In short, Multi AZ uses labeled nodes and spawns
replica count pods in each AZ. Quarks creates a separate statefulset in each AZ.
In order for things to work correctly across versions and AZs, we need ClusterIP
Services that select for Instance Group
(from Services And DNS Addresses )
The services select pods via the pod-ordinal label.
Service selector example for “nats-0” service:
Quarks also generates an additional `nats’ service for all of them.
When the pod is created, the Quarks statefulset pod mutator sets the
pod-ordinal label to the suffix from the pod name from the pod’s name.
The pod ordinal is later passed to BOSH job template rendering.
It’s always 0 for errands.
A new statefulset in another AZ, will start again with pod-ordinal 0.
Zone Index (az-index)
Each statefulset belongs to one AZ.
The Quarks statefulset (qsts) reconciler sets the
az-index label, when creating the statefulsets.
It’s 0 if the instance group does not have any AZs. Otherwise it just starts at zero and increments by one.
Example: Given the zones “z1” and “z2”, Quarks will use the zone indexes “z0” and “z1” in resource names. The
az-index labels will contain “0” and “1”.
az-index pod label starts at 0, the
AZ_INDEX env var on containers starts at 1.
The replica count is initially set it the
Instances number from the instance group manifest.
QuarksStatefulSet reconciler might also overwrite the replica env count if
injectReplicaEnv is true.
template-render sub-command, will increase the relplica count to match the given pod-ordinal, if necessary.
BOSH Job Template Rendering
This is how BOSH manifest variables are translated into job template rendering variables:
- Quarks operator builds ig manifests, using the
- Quarks operator create resources, like qsts from manifest
- Quarks statefulset manages statefulsets and adapts values when reconciling
- sts starts pods
- quarks sts pod mutator adapts values
- pod labels are mounted as env variables
- environment variables are used as args when calling the
template-rendersub-command in an init container
instance-group and the
template-render sub-command need to build an array of all possible BOSH job
spec properties for every instance of the instance group.
Template rendering computes a ‘spec index’, from az-index and pod-ordinal, to find the matching spec.
True if this instance is the first instance of its group. (BOSH jobs)
The pod that starts first in a statefulset must have the bootstrap flag. It’s used to initialize databases and such.
After the initial deployment, the last pod has the bootstrap flag. If the pod is restarted, the bootstrap flag stays the same, since the pod ordinal doesn’t change in a statefulset.
Bootstrap is only run once, regardless the number of AZs.
Instance index. Use spec.bootstrap to determine the first instead of checking whether the index is 0. Additionally, there is no guarantee that instances will be numbered consecutively, so that there are no gaps between different indices. (BOSH jobs)
In the past using the replica count lead to unnecessary restarts, so a very large value is used instead:
// azindex podOrdinal specIndex // 0 0 0 // 0 1 1 // 1 0 0 // 1 1 1 // 2 0 10000 // 2 1 10001
Default network address (IPv4, IPv6 or DNS record) for the instance (BOSH jobs)
In Quarks the spec.address should always be the advertisable address of the pod.
The address is the name of the pod, which already includes the pod-ordinal and optionally the az-index.
Some BOSH jobs use this to find their local interface, like the NATS release. This works since spec.address matches the hostname entry in /etc/hosts and gives the local ip address.
Other Usage of Pod Ordinal
It’s also available to
- bosh-pre-start Init Containers
- bpm-pre-start Init Containers
- bpm Process Container
Removes startup-ordinal and just uses pod-ordinal.
Introduces startup-ordinal to fix bugs from 6.1. Keeps workaround.
Open problems with updating kubecf
- nats release binding to spec.address
- diego-cell rep-rep, too?
Had a workaround for HA.