Design for exponential backoff

Signed-off-by: irbekrm <irbekrm@gmail.com>
This commit is contained in:
irbekrm 2022-01-18 09:09:58 +00:00
parent af47ae4e5b
commit 184a16da10

View File

@ -0,0 +1,231 @@
---
title: Exponential Backoff for Certificate Issuance
authors:
- "@irbekrm"
reviewers:
- @jetstack/team-cert-manager
approvers:
- @jetstack/team-cert-manager
editor: "@irbekrm"
creation-date: 2022-01-18
last-updated: 2022-01-19
status: implementable
---
# Exponential Backoff for Certificate Issuance
## Table of Contents
<!-- toc -->
- [Summary](#summary)
- [Motivation](#motivation)
* [Goals](#goals)
* [Non-Goals](#non-goals)
* [Must Not](#must-not)
- [Proposal](#proposal)
* [API Changes](#api-changes)
* [Examples](#examples)
- [Issuance fails then succeeds](#Issuance-fails-then-succeeds)
- [Manually triggered reissuance succeeds](#Manually-triggered-reissuance-succeeds)
- [Manually triggered reissuance fails](#Manually-triggered-reissuance-fails)
- [Example certificate statuses](#example-certificate-statuses)
* [Upgrading](#upgrading)
* [Test Plan](#test-plan)
* [Assumptions](#assumptions)
* [Feature Gate](#feature-gate)
* [Misc](#misc)
<!-- /toc -->
## Summary
Implement a way to apply exponential backoff when retrying failed certificate
issuances. Issuance in this context refers to the period of time during which
the `Issuing` condition on a `Certificate` is set to true and for which a new
set of issuance-specific resources (`CertificateRequest`s, `Order`s, `Challenge`s etc) is
created.
## Motivation
Currently failed issuances are retried once an hour without a backoff or time limit. This means that 1) continuous failures in large installations can overwhelm external services 2) rate limits can be easily hit in case of longer lasting issuance problems (see [Let'sEncrypt rate limts](https://letsencrypt.org/docs/rate-limits/))
### Goals
- Ensure that retrying failed issuances does not overwhelm external services and is less likely to hit rate limits by adding exponentially increasing delays between issuance retries
- Ensure that when the backoff is being applied, users have a way to find out when the issuance will be next retried and that the backoff mechanism does not introduce extra complexity for debugging and fits in with the already existing 1 hour backoffs
- Ensure that [`cmctl renew`](https://cert-manager.io/docs/usage/cmctl/#renew) can still be used to manually force an immediate issuance attempt, so that in cases where the issuance was failing due to a setup error (i.e DNS setup) and a user believes that they have fixed it, they have a way to verify the fix without waiting up to 32h
### Non-Goals
- Make the backoff period configurable as this would add a lot of extra complexity. For context Kubernetes pod crashloopbackoff period is _not_ configurable (although it is a very demanded feature [k/k#57291](https://github.com/kubernetes/kubernetes/issues/57291))
- Make it possible to reset the backoff period (However, it would be possible to force re-issuance to be retried immediately using `cmctl renew` and, if that succeeded, the backoff would be reset)
### Must Not
- Cause all `Certificate`s whose issuances are currently failing to be re-issued at once after cert-manager controller restart
- Cause all `Certificate`s whose issuances are currently failing to be re-issued at once after upgrading to a cert-manager version that implements exponential backoff
## Proposal
Exponential backoff will be implemented by exponentially increasing the delays between a failed issuance ([`Issuing` condition set to false in `certificates-issuing` controller](https://github.com/jetstack/cert-manager/blob/196d0011ca46037186a826365bcd6316d9b9462a/pkg/controller/certificates/issuing/issuing_controller.go#L341)) and a new issuance ([`Issuing` condition set to true in `certificates-trigger` controller](https://github.com/jetstack/cert-manager/blob/d5503c2ed2df272ec1bd94ebd223408fad29df1f/pkg/controller/certificates/trigger/trigger_controller.go#L184)). From a user perspective, this will correspond to the delay between a `CertificateRequest` having failed and new `CertificateRequest`s being created.
A new `IssuanceAttempts` status field will be added to `Certificate` that will be used to record the number of failed issuances for this [revision](https://github.com/jetstack/cert-manager/blob/5019aaacfcaa4da2cba32218fdf583eede00f224/pkg/apis/certmanager/v1/types_certificate.go#L411-L426).
Similarly to [`status.LastFailureTime`](https://github.com/jetstack/cert-manager/blob/196d0011ca46037186a826365bcd6316d9b9462a/pkg/apis/certmanager/v1/types_certificate.go#L385-L391), `status.IssuanceAttempts` field will only be set for a `Certificate` whose issuance is currently failing and will be removed after a successful issuance.
`IssuanceAttempts` will be set by [`certificates-issuing` controller](https://github.com/jetstack/cert-manager/tree/ce1424162ea4f363bdb7aa4f201432ec63da1145/pkg/controller/certificates/issuing) after a failed issuance by either bumping the already existing value by 1 or setting it to 1 (first failure). In case of a succeeded issuance, `certificates-issuing` controller will ensure that `status.IssuanceAttempts` is not set.
The delay till the next issuance will then be calculated by [`certificates-trigger` controller](https://github.com/jetstack/cert-manager/tree/ce1424162ea4f363bdb7aa4f201432ec63da1145/pkg/controller/certificates/trigger) using the formula `if status.LastFailureTime != nil then next_issuance_attempt_time = status.LastFailureTime + time.Hour x 2 ^ (status.IssuanceAttempts- 1)` (binary exponential- so the sequence will be 1h, 2h, 4h, 8h etc). This ensures that the first delay is 1 hour from the last failure time which is the current behaviour. In case of continuous failures, the delay should keep increasing up to a maximum backoff period of 32h, after which it should be retried every 32h whilst the failures persist.
`certificates-trigger` controller should also throw an event on the `Certificate` with a message that includes the calculated delay period.
### API changes
A new `IssuanceAttempts` field will be added to `Certificate`'s status.
```
type CertificateStatus {
// EXISTING FIELDS
// ...
// NEW FIELDS
// IssuanceAttempts represents the number of times certificate issuance has been attempted and failed during this revision.
// This field is used to calculate the backoff period after which issuance will be attempted again.
IssuanceAttempts int `json:issuanceAttempts,omitempty`
}
```
### Examples
Large part of the these examples show what is already the _current_ behaviour, the only changes are the parts where `IssuanceAttempts` field is being managed and where the delay between issuances is calculated with an exponential backoff.
#### Issuance fails then succeeds:
1. A `CertificateRequest` with [revision number](https://github.com/jetstack/cert-manager/blob/196d0011ca46037186a826365bcd6316d9b9462a/pkg/apis/certmanager/v1/types_certificate.go#L409-L424) 20 fails. This is the 3rd failed issuance for this revision
2. `certificates-issuing` controller reconciles the failed `CertificateRequest`, bumps the `status.IssuanceAttempts` by 1 as well as updating the `status.LastFailureTime` to the time when `CertificateRequest` failed and setting the [`Issuing` condition](https://github.com/jetstack/cert-manager/blob/196d0011ca46037186a826365bcd6316d9b9462a/pkg/apis/certmanager/v1/types_certificate.go#L480-L495) to false ([here-ish](https://github.com/jetstack/cert-manager/blob/196d0011ca46037186a826365bcd6316d9b9462a/pkg/controller/certificates/issuing/issuing_controller.go#L326-L351))
3. `certificates-trigger` controller parses the `Certificate` with the false `Issuing` condition, calculates the backoff period (in this case it will be status.LastFailureTime + 2h ^ (3 - 1), so roughly in 4 hours) [here-ish](https://github.com/jetstack/cert-manager/blob/8dc603e7f5ef64288478b2e7a769a5415ae54ab0/pkg/controller/certificates/trigger/trigger_controller.go#L201) and enqueues the `Certificate` to be reconciled in 4 hours ([here](https://github.com/jetstack/cert-manager/blob/master/pkg/controller/certificates/trigger/trigger_controller.go#L161))
4. In 4 hours, `Certificate` gets reconciled again and `certificates-trigger` controller sets the `Issuing` condition to true. This time the `CertificateRequest` succeeds.
5. `certificates-issuing` controller reconciles the `Certificate` and the succeeded `CertificateRequest` and removes the `status.IssuanceAttempts` as well as `status.LastFailureTime` and `Issuing` condition
6. `certificates-trigger` controller determines that backoff is not needed and re-queues `Certificate` to be renewed based on `status.RenewalTime`
#### Manually triggered reissuance succeeds
1. A `CertificateRequest` with revision 20 fails. This is the 3rd failed issuance for revision 20
2. `certificates-issuing` controller reconciles the failed `CertificateRequest`, bumps the `status.IssuanceAttempts` by 1 as well as updating the `status.LastFailureTime` to the time when `CertificateRequest` failed and setting the `Issuing` condition to false
3. `certificates-trigger` controller parses the `Certificate` with the false `Issuing` condition, calculates the backoff period (in this case it will be `status.LastFailureTime + 2h ^ (3 - 1)`, so roughly in 4 hours) [here-ish](https://github.com/jetstack/cert-manager/blob/8dc603e7f5ef64288478b2e7a769a5415ae54ab0/pkg/controller/certificates/trigger/trigger_controller.go#L201) and enqueues the `Certificate` to be reconciled in 4 hours ([here](https://github.com/jetstack/cert-manager/blob/master/pkg/controller/certificates/trigger/trigger_controller.go#L161))
4. User fixes the reason for failure (i.e some networking setup) and runs `cmctl renew <certificate-name>` to force immediate re-issuance, which [adds `Issuing` condition to the `Certificate`](https://github.com/jetstack/cert-manager/blob/ce1424162ea4f363bdb7aa4f201432ec63da1145/cmd/ctl/pkg/renew/renew.go#L203) thus signalling the other controllers that issuance is in progress and bypassing the `certificates-issuing` controller's [check for whether a backoff is needed](https://github.com/jetstack/cert-manager/blob/ce1424162ea4f363bdb7aa4f201432ec63da1145/pkg/controller/certificates/trigger/trigger_controller.go#L158-L163)
5. A new `CertificateRequest` is created and succeeds
6. `certificates-issuing` controller reconciles the `Certificate` and the succeeded`CertificateRequest` and removes the `status.IssuanceAttempts` as well as `status.LastFailureTime` and `Issuing` condition
7. `certificates-trigger` controller parses the `Certificate`, determines that backoff is not needed and requeues `Certificate` to be renewed based on `status.RenewalTime`
#### Manually triggered reissuance fails
1. A `CertificateRequest` with revision 20 fails. This is the 3rd failed issuance for revision 20
2. `certificates-issuing` controller reconciles the failed `CertificateRequest`, bumps the `status.IssuanceAttempts` by 1 as well as updating the `status.LastFailureTime` to the time when `CertificateRequest` failed and setting the `Issuing` condition to false
3. `certificates-trigger` controller parses the `Certificate` with the false `Issuing` condition, calculates the backoff period (in this case it will be `status.LastFailureTime + 2h ^ (3 - 1)`, so roughly in 4 hours) [here-ish](https://github.com/jetstack/cert-manager/blob/8dc603e7f5ef64288478b2e7a769a5415ae54ab0/pkg/controller/certificates/trigger/trigger_controller.go#L201) and enqueues the `Certificate` to be reconciled in 4 hours ([here](https://github.com/jetstack/cert-manager/blob/master/pkg/controller/certificates/trigger/trigger_controller.go#L161))
4. User thinks that they have fixed the failure (i.e some networking setup) and runs `cmctl renew <certificate-name>` to force immediate re-issuance, which [adds `Issuing` condition to the `Certificate`](https://github.com/jetstack/cert-manager/blob/ce1424162ea4f363bdb7aa4f201432ec63da1145/cmd/ctl/pkg/renew/renew.go#L203) thus signalling the other controllers that issuance is in progress and bypassing the `certificates-issuing` controller's [check for whether a backoff is needed](https://github.com/jetstack/cert-manager/blob/ce1424162ea4f363bdb7aa4f201432ec63da1145/pkg/controller/certificates/trigger/trigger_controller.go#L158-L163)
5. A new `CertificateRequest` is created and fails again
6. `certificates-issuing` controller reconciles the `Certificate` and the failed `CertificateRequest`, bumps `status.IssuanceAttempts` to 4, sets the `Issuing` condition to false and sets `status.LastFailureTime` to now
7. `certificates-trigger` controller parses the `Certificate` with the false `Issuing` condition, calculates the backoff period (in this case it will be `status.LastFailureTime + 2h ^ (4 - 1)`, so roughly in 8 hours) [here-ish](https://github.com/jetstack/cert-manager/blob/8dc603e7f5ef64288478b2e7a769a5415ae54ab0/pkg/controller/certificates/trigger/trigger_controller.go#L201) and enqueues the `Certificate` to be reconciled in 8 hours ([here](https://github.com/jetstack/cert-manager/blob/master/pkg/controller/certificates/trigger/trigger_controller.go#L161))
#### Example certificate statuses
(These examples are based on what the statuses already look like after a failed/succeeded issuance. The only change is the `issuanceAttempts` field)
1. A `Certificate` where issuance has failed 3 times for revision 20:
```
Status:
Conditions:
- LastTransitionTime: <timestamp>
Message: <message>
ObservedGeneration: 1
Reason: Ready
Status: "True"
Type: Ready
- LastTransitionTime: <timestamp>
Message: <message>
ObservedGeneration: 1
Reason: Failed
Status: "False"
Type: Issuing # Issuing condition remains set, but false after a failed issuance
NotAfter: <timestamp>
NotBefore: <timestamp>
RenewalTime: <timestamp>
IssuanceAttempts: 3
LastFailureTime: <timestamp> # Last failed issuance (i.e when a `CertificateRequest` failed)
Revision: 19 # The latest succeeded revision. Currently we are attempting revision 20
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning IssuanceFailed 4s cert-manager Certificate issuance has failed. It will be retried in 4 hours.
```
2. A `Certificate` where the latest issuance for revision 20 succeeded and no issuances are being attempted now:
```
Status:
Conditions:
- LastTransitionTime: <timestamp>
Message: <message>
ObservedGeneration: 1
Reason: Ready
Status: "True"
Type: Ready
NotAfter: <timestamp>
NotBefore: <timestamp>
RenewalTime: <timestamp>
Revision: 20
Events:
```
### Test Plan
The example flows described in [Examples](#Examples) and [Upgrading](#Upgrading) will be tested via integration tests ([similar to the current integration tests for certificates](https://github.com/jetstack/cert-manager/tree/master/test/integration/certificates))
### Upgrading
Upgrading to a cert-manager version that implements exponential backoff or downgrading to one that does not, should not require any extra steps or cause unnecessary re-issuances.
To ensure that `Certificate`s whose issuance is currently failing don't get renewed all at once after upgrading to cert-manager version that implements exponential backoff, `certificates-trigger` controller should fall back to 1 hour delay for all `Certificate`s that have `status.LastFailureTime` set, but don't have the `status.IssuanceAttempts` set.
## Assumptions
- Although work on this was prompted by wanting to limit calls to ACME servers, users of other types of issuers will benefit from it
- 1h is an acceptable initial delay between issuances (keeping this to 1h ensures that at least for the first retry attempt, we keep the current behaviour, however perhaps for short lived certs it would be useful to start with a shorter initial delay?)
- 32h is an acceptable maximum delay between issuances
- In case of exponential backoff being applied, an event on the `Certificate` as well as controller logs will be sufficient for users trying to debug this
## Feature Gate
The current assumption is that exponential backoff would _not_ be placed behind a feature gate, however this should be considered.
Some reasons for putting it behind a feature gate:
- Would the API fields added for this feature change?
- Would the delay periods (min, max and how they are calculated) change and could this change be breaking?
If the feature gate was to be implemented, it would mean adding a new flag to controller binary (i.e `--enable-exponential-backoff`) and adding some if-statements to `certificates-issuing` and `certificates-trigger` controllers to not add the `status.IssuanceAttempts` field and not parse it, unless the feature gate is enabled. (Question: would `IssuanceAttempts` still be added to the v1 `Certificate` API?)
## Misc
- Slack conversation about storing the current delay in memory vs using a status field https://kubernetes.slack.com/archives/CDEQJ0Q8M/p1642178582273300