timeouts proposal
Signed-off-by: Ashley Davis <ashley.davis@jetstack.io> Co-authored-by: Maël Valais <mael@vls.dev>
This commit is contained in:
parent
c16b3cca7b
commit
ab09488c22
317
design/20220614-timeouts.md
Normal file
317
design/20220614-timeouts.md
Normal file
@ -0,0 +1,317 @@
|
||||
# Increasing Timeouts for Issuer Operations
|
||||
|
||||
- [Release Signoff Checklist](#release-signoff-checklist)
|
||||
- [Summary](#summary)
|
||||
- [Motivation](#motivation)
|
||||
- [Goals](#goals)
|
||||
- [Non-Goals](#non-goals)
|
||||
- [Proposal](#proposal)
|
||||
- [Risks and Mitigations](#risks-and-mitigations)
|
||||
- [Design Details](#design-details)
|
||||
- [Test Plan](#test-plan)
|
||||
- [Graduation Criteria](#graduation-criteria)
|
||||
- [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
|
||||
- [Supported Versions](#supported-versions)
|
||||
- [Production Readiness](#production-readiness)
|
||||
- [Drawbacks](#drawbacks)
|
||||
- [Alternatives](#alternatives)
|
||||
|
||||
## Release Signoff Checklist
|
||||
|
||||
This checklist contains actions which must be completed before a PR implementing this design can be merged.
|
||||
|
||||
|
||||
- [ ] This design doc has been discussed and approved
|
||||
- [ ] Test plan has been agreed upon and the tests implemented
|
||||
- [ ] Feature gate status has been agreed upon (whether the new functionality will be placed behind a feature gate or not)
|
||||
- [ ] Graduation criteria is in place if required (if the new functionality is placed behind a feature gate, how will it graduate between stages)
|
||||
- [ ] User-facing documentation has been PR-ed against the release branch in [cert-manager/website]
|
||||
|
||||
|
||||
## Summary
|
||||
|
||||
Users of several different issuers (including some deployments of Venafi ([5108][], [4893][]), and ACME
|
||||
issuers ([4452][]) such as Sectigo ([comment][4893-comment]) and ZeroSSL ([comment][website-583-comment]) have been running
|
||||
into timeout errors which manifest as `context deadline exceeded` errors in logs.
|
||||
|
||||
These errors are caused because the relevant cert-manager controllers hardcode short timeouts into their sync/reconcile
|
||||
functions, and some operations which must be done during a sync/reconcile can take longer than that short timeout.
|
||||
|
||||
This manifests during issuer creation and during certificate issuance. Put simply: for some issuers, initialisation or
|
||||
certificate issuance just takes longer than the maximum amount of time we currently allow.
|
||||
|
||||
[5108]: https://github.com/cert-manager/cert-manager/issues/5108
|
||||
[4893]: https://github.com/cert-manager/cert-manager/issues/4893
|
||||
[4452]: https://github.com/cert-manager/cert-manager/issues/4452
|
||||
[4893-comment]: https://github.com/cert-manager/cert-manager/issues/4893#issuecomment-1115143729
|
||||
[website-583-comment]: https://github.com/cert-manager/website/issues/583#issuecomment-1011664334
|
||||
|
||||
## Motivation
|
||||
|
||||
Making a change here is justified since there are users today who want to use certain standards-compliant ACME issuers are
|
||||
who're unable to do so with cert-manager, through no fault of their own.
|
||||
|
||||
There's likely to be strong demand for this; in the various issues linked in the summary above, we've seen a huge amount
|
||||
of engagement through reactions to posts (which isn't a perfect indicator of demand but these issues are unusually
|
||||
popular relative to what we normally see).
|
||||
|
||||
This design will largely talk about ACME since ACME issuer users are almost certainly by far the biggest section of the
|
||||
current cert-manager userbase, but the principles here apply equally to the Venafi issuer, where an instance of
|
||||
Venafi TPP might be deployed on-prem and could be slow for any number of reasons. We should address Venafi issuers in
|
||||
a similar way, but in a separate piece of work.
|
||||
|
||||
### Goals
|
||||
|
||||
#### Remain Good Cluster Citizens
|
||||
|
||||
The justification for a short default timeout on each sync/reconcile operation is that we avoid a situation whereby we
|
||||
create many "hanging" threads which could overload the node on which cert-manager is running.
|
||||
|
||||
The simplest solution to this issue would be to simply remove all timeouts or to make them all very long (say, 1 hour),
|
||||
but that could lead to high resource use.
|
||||
|
||||
#### Merge Before 1.9 and Backport
|
||||
|
||||
This timeout issue is present in all currently supported versions of cert-manager (1.7 and 1.8 at the time of writing).
|
||||
Ideally we'd like to get this fix merged before we release 1.9 (currently scheduled for Jul 06, 2022) and to backport
|
||||
to other supported releases, so that we'll be able to support ACME issuers like Sectigo in multiple versions.
|
||||
|
||||
### Non-Goals
|
||||
|
||||
#### Individually Configurable Timeouts for Every Operation or API Call
|
||||
|
||||
Having such timeouts might be attractive at some point, but specifically we want a broader approach for this solution
|
||||
pending some evidence that more specific timeout configuration is a thing that people actually want.
|
||||
|
||||
#### Adding New Timeouts Where We Don't Currently Have Them
|
||||
|
||||
While we should almost certainly have an upper-limit timeout for every single controller to prevent runaway resource
|
||||
usage through infinitely hanging sync operations, it would be preferable to keep the scope of this PR minimal so that
|
||||
it'll pass the threshold for simplicity to be backported.
|
||||
|
||||
## Proposal
|
||||
|
||||
Note that this proposal is intentionally kept simple in terms of the scope of what we should actually change.
|
||||
|
||||
This is because we take an _extremely_ conservative approach to backports where we seek to absolutely minimise the
|
||||
amount and complexity of changes we introduce to already released versions.
|
||||
|
||||
Hence this proposal will focus on a simple approach that would apply everywhere, and which we could build upon in any
|
||||
future releases if we so chose.
|
||||
|
||||
Note also that the specific numeric values for timeouts are given and justified later after the changes are proposed.
|
||||
This makes the timeouts easier to change. Timeouts are instead given Greek letter names such as `α` and `β`.
|
||||
|
||||
An executive summary of the changes would be to remove a few middleware timeouts and then increase the hardcoded
|
||||
defaults we currently have in a few places. Really all of the below is just saying that in long form. The justification
|
||||
is in the [Design Details](#design-details) section below.
|
||||
|
||||
### Increase Some Existing Controller Timeouts
|
||||
|
||||
Note that links in this section are for the `release-1.8` branch but these changes would also apply similarly to our
|
||||
other currently-supported version of cert-manager (v1.7) and also for the `master` branch.
|
||||
|
||||
We propose to increase the existing timeouts for the [issuers][] and [clusterissuers][] controllers to
|
||||
`α` seconds.
|
||||
|
||||
[issuers]: https://github.com/cert-manager/cert-manager/blob/2918f68664be98f93b8092afbbc503e42d9b7588/pkg/controller/issuers/sync.go#L45
|
||||
[clusterissuers]: https://github.com/cert-manager/cert-manager/blob/2918f68664be98f93b8092afbbc503e42d9b7588/pkg/controller/clusterissuers/sync.go#L45
|
||||
|
||||
### Remove Middleware Timeouts
|
||||
|
||||
We propose to remove the context timeouts we currently have in our ACME logging middleware; that would look like
|
||||
(or be) [this unmerged commit][b1953].
|
||||
|
||||
The affected controllers appear to be those which have an `accounts.Getter`:
|
||||
|
||||
- [acmechallenges](https://github.com/cert-manager/cert-manager/blob/c16b3cca7b418ba0d0b2bf1066514b8762984517/pkg/controller/acmechallenges/controller.go#L48)
|
||||
- [acmeorders](https://github.com/cert-manager/cert-manager/blob/c16b3cca7b418ba0d0b2bf1066514b8762984517/pkg/controller/acmeorders/controller.go#L50)
|
||||
|
||||
These timeouts have two issues. One is that the location they're added is unintuitive; the timeouts are added in
|
||||
_logging_ middleware which which doesn't otherwise mention that it also introduces timeouts.
|
||||
|
||||
That's confusing; we might reasonably expect a timeout on writing the logs themselves (i.e. the actual operation of
|
||||
writing to a log) but this functionality doesn't manage that.
|
||||
|
||||
The second issue is that these timeouts effectively duplicate HTTP client timeouts.
|
||||
|
||||
HTTP client timeouts belong on the underlying HTTP client; that's where we could set more finegrained controls such as
|
||||
TLS handshake, dialer and overall HTTP request timeouts. HTTP client timeouts are [desirable](https://blog.cloudflare.com/the-complete-guide-to-golang-net-http-timeouts/),
|
||||
and we [actually already have them](https://github.com/cert-manager/cert-manager/blob/e116d416f3b14863d05753739cbdf72d66923357/pkg/acme/accounts/client.go#L58-L75)
|
||||
for our ACME clients.
|
||||
|
||||
It's worth noting that if we simply removed the logging middleware timeout, the HTTP client timeouts would start to
|
||||
take effect. They're longer than the context timeout and so wouldn't ever be relevant on cert-manager today, but would
|
||||
be relevant without the middleware context timeout. We also propose below to tweak those HTTP timeouts in any case.
|
||||
|
||||
[b1953]: https://github.com/cert-manager/cert-manager/pull/5206/commits/b195382a67c5e548752407d12851b84b4d6b8e28
|
||||
|
||||
### Increase ACME HTTP Timeouts
|
||||
|
||||
We propose to update the overall timeout for our HTTP clients for ACME requests to `β` seconds and - at least for now -
|
||||
**not** to make this configurable by users.
|
||||
|
||||
As mentioned above, we already have HTTP timeouts on the HTTP clients we build for use with ACME clients, as seen
|
||||
[here](https://github.com/cert-manager/cert-manager/blob/e116d416f3b14863d05753739cbdf72d66923357/pkg/acme/accounts/client.go#L58-L75).
|
||||
|
||||
The dialer and TLS handshake timeouts are set to 30 and 10 seconds respectively, and both are likely fine to keep as
|
||||
they are and in any case unlikely to be a problem for people experiencing the issues detailed in [#5080](https://github.com/cert-manager/cert-manager/issues/5080).
|
||||
|
||||
The overall HTTP timeout (see [this blog](https://blog.cloudflare.com/the-complete-guide-to-golang-net-http-timeouts/)
|
||||
for a diagram showing the different client timeouts) is [set to 30s](https://github.com/cert-manager/cert-manager/blob/e116d416f3b14863d05753739cbdf72d66923357/pkg/acme/accounts/client.go#L73)
|
||||
which is too short to address the issues we're trying fix here.
|
||||
|
||||
### Risks and Mitigations
|
||||
|
||||
The risk here is that the timeouts we have already, as currently configured, might _already_ be the last line of defence
|
||||
for some controllers in some clusters. By that we mean that by increasing the timeout we could pass some critical
|
||||
threshold where pods start to run out of resources and crash.
|
||||
|
||||
This could be in a cluster where someone is trying to create many slow-to-create Issuers at once, or to issue many
|
||||
Certificates at once from issuers which are slow, or to issue many Certificates from a single issuer which becomes
|
||||
overloaded and slows down.
|
||||
|
||||
## Design Details
|
||||
|
||||
Most of the above proposals involve tweaking integers or else they have [a commit](https://github.com/cert-manager/cert-manager/pull/5206/commits/b195382a67c5e548752407d12851b84b4d6b8e28)
|
||||
which can be viewed. The actual PR here won't be difficult to understand - that's intentional, to minimise complexity
|
||||
of a change we want to backport.
|
||||
|
||||
The proposed value for `β` - the ACME HTTP timeout - is 90s. That's longer than the 60s which seems to be around the
|
||||
upper limit of what people have reported seeing with Sectigo, which seems to be on the slower side.
|
||||
|
||||
The proposed value for `α` - the context timeout - is 120s. This is long enough to fit in a slow ACME request and then
|
||||
do other work relating to issuance.
|
||||
|
||||
### Justification
|
||||
|
||||
#### Crossplane
|
||||
|
||||
The above values are mostly inspired by crossplane.io, which uses context timeouts very heavily in basically all
|
||||
controllers (which is something that cert-manager should likely aspire to).
|
||||
|
||||
Here's a selection of timeouts in various crossplane controllers:
|
||||
|
||||
| Timeout | Reconciliation loop |
|
||||
|---------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|
||||
| 2 mins | [definition/reconciler.go#L55](https://github.com/crossplane/crossplane/blob/60fc7df4b3c9d0f11e9ea719b63b52bec56d3853/internal/controller/apiextensions/definition/reconciler.go#L55) |
|
||||
| 2 mins | [definition/reconciler.go#L43](https://github.com/crossplane/crossplane/blob/134ec72e0a5649147bb95e76d063d3df8c0b4e5f/internal/controller/rbac/definition/reconciler.go#L43) |
|
||||
| 2 mins | [composition/reconciler.go#L45](https://github.com/crossplane/crossplane/blob/134ec72e0a5649147bb95e76d063d3df8c0b4e5f/internal/controller/apiextensions/composition/reconciler.go#L45) |
|
||||
| 2 mins | [binding/reconciler.go#L46](https://github.com/crossplane/crossplane/blob/134ec72e0a5649147bb95e76d063d3df8c0b4e5f/internal/controller/rbac/provider/binding/reconciler.go#L46) |
|
||||
| 2 mins | [namespace/reconciler.go#L43](https://github.com/crossplane/crossplane/blob/134ec72e0a5649147bb95e76d063d3df8c0b4e5f/internal/controller/rbac/namespace/reconciler.go#L43) |
|
||||
| 2 mins | [roles/reconciler.go#L45](https://github.com/crossplane/crossplane/blob/134ec72e0a5649147bb95e76d063d3df8c0b4e5f/internal/controller/rbac/provider/roles/reconciler.go#L45) |
|
||||
| 2 mins | [composite/reconciler.go#L45](https://github.com/crossplane/crossplane/blob/dd23304b466690f89a82b57825ca8a9870767972/internal/controller/apiextensions/composite/reconciler.go#L45) |
|
||||
| 1 min | [offered/reconciler.go#L53](https://github.com/crossplane/crossplane/blob/2c9872543c5074928ba739e147b02e09cd1f3090/internal/controller/apiextensions/offered/reconciler.go#L53) |
|
||||
| 1 min | [claim/reconciler.go#L46](https://github.com/crossplane/crossplane/blob/dd23304b466690f89a82b57825ca8a9870767972/internal/controller/apiextensions/claim/reconciler.go#L46) |
|
||||
| 1 min | [manager/reconciler.go#L45](https://github.com/crossplane/crossplane/blob/1344a86018f03fcbba6b9365c63696dcd47ebcf6/internal/controller/pkg/manager/reconciler.go#L45) |
|
||||
| 1 min | [resolver/reconciler.go#L47](https://github.com/crossplane/crossplane/blob/243f1f47ca0b17503b08ef77770f41a89defeb82/internal/controller/pkg/resolver/reconciler.go#L47) |
|
||||
| 3 mins | [revision/reconciler.go#L54](https://github.com/crossplane/crossplane/blob/1344a86018f03fcbba6b9365c63696dcd47ebcf6/internal/controller/pkg/revision/reconciler.go#L54) |
|
||||
|
||||
Crossplane also has an [open issue](https://github.com/crossplane/crossplane/issues/2564) relating to making this value
|
||||
configurable.
|
||||
|
||||
The idea here is "if it's good enough for crossplane why should it not be good enough for us?"
|
||||
|
||||
The current cert-manager timeouts are arbitrary. Likely the crossplane timeouts are also arbitrary. We can at least
|
||||
have confidence that a big project with a tonne of controllers and CRDs is using longer timeouts and clearly not seeing
|
||||
world-ending problems, and people want to _increase_ the timeouts from that base too, as envidenced by the above open
|
||||
issue.
|
||||
|
||||
Another relevant timeout is certbot, which has a [45s](https://github.com/certbot/certbot/blob/295fc5e33a68c945d2f62e84ed8e6aaecfe93102/acme/acme/client.py#L46)
|
||||
limit on network requests by default and a 90s limit on polling after the fact. We go higher here because people have
|
||||
reported their requests taking longer than 45
|
||||
|
||||
#### Kubernetes
|
||||
|
||||
In investigation we couldn't find any evidence that there are timeouts used in kubernetes controllers such as the Deployments
|
||||
controller. That implies they don't think this topic is particularly scary, which in turn provides some comfort that picking
|
||||
a higher timeout isn't likely to bring about the end of the world.
|
||||
|
||||
#### User Reports
|
||||
|
||||
User reports of time taken vary slightly:
|
||||
|
||||
- One user reported times of around 30s: https://github.com/cert-manager/cert-manager/issues/4452#issuecomment-1144561276
|
||||
- Another reported similar: https://github.com/cert-manager/cert-manager/issues/4452#issuecomment-1143889347
|
||||
- One said "at least 60s, sometimes more": https://github.com/cert-manager/cert-manager/issues/5080#issuecomment-1137259557
|
||||
|
||||
Since the largest time is 'sometimes more than 60s', our lowest timeout should likely be at least 60s.
|
||||
|
||||
#### General Justifications
|
||||
|
||||
There's a risk with work like this that we can focus too much on what's already present. The 10s timeout which is set for these
|
||||
controllers on `master` was committed with absolutely no justification we can find in the PR, commit message, design doc or anywhere else.
|
||||
|
||||
We should the same amount of attachment to that 10 second timeout as there was justification given for it when it was added originally - absolutely none.
|
||||
|
||||
That's not to say it was a mistake or bad to add these timeouts originally - the controllers which have them are better than almost every other controller
|
||||
we have just by virtue of having a timeout at all!
|
||||
|
||||
### Test Plan
|
||||
|
||||
As far as we can tell, none of this is tested currently, nor would we plan to add any new tests in the PR.
|
||||
|
||||
Down the road it would be great to add some tests for this stuff but that feels like a bigger project than this short
|
||||
term bug fix.
|
||||
|
||||
### Graduation Criteria
|
||||
|
||||
There's nothing to feature gate and no graduation required.
|
||||
|
||||
### Upgrade / Downgrade Strategy
|
||||
|
||||
A user upgrading will notice no changes except that the failing ACME issuers they currently have should start working.
|
||||
|
||||
A user downgrading to an unsupported version (1.6 and below) could start to see timeout issues since the timeouts in
|
||||
unsupported releases will be lower than those in 1.7 and above.
|
||||
|
||||
### Supported Versions
|
||||
|
||||
cert-manager 1.7+
|
||||
|
||||
K8s support irrelevant
|
||||
|
||||
## Production Readiness
|
||||
|
||||
### How can this feature be enabled / disabled for an existing cert-manager installation?
|
||||
|
||||
This will take effect by default in all versions and won't be able to be disabled.
|
||||
|
||||
### Does this feature depend on any specific services running in the cluster?
|
||||
|
||||
Yes, cert-manager
|
||||
|
||||
### Will enabling / using this feature result in new API calls (i.e to Kubernetes apiserver or external services)?
|
||||
|
||||
This change will result in fewer API calls to external services and to Kubernetes. Since issuance on slow ACME issuers
|
||||
will be more likely to succeed, we won't have long retry loops where we'll continually retry and get timed out.
|
||||
|
||||
Put another way: instead of 20 requests where 19 of them time out and one succeeds, we'll just have one request that
|
||||
succeeds.
|
||||
|
||||
### Will enabling / using this feature result in increasing size or count of the existing API objects?
|
||||
|
||||
Potentially fewer Orders and Challenges, since we'll fail less often.
|
||||
|
||||
### Will enabling / using this feature result in significant increase of resource usage? (CPU, RAM...)
|
||||
|
||||
If there's a massive queue of blocked requests there could be some increase in resource usage. Hard to say how it would
|
||||
change — if anything there's likely to be a _decrease_ in total resource usage since longer-running requests will
|
||||
succeed and we won't need to pay the cost of running a whole Sync function again after an ACME request times out.
|
||||
|
||||
## Drawbacks
|
||||
|
||||
Once we bump the default max timeout we can't realistically reduce the default again, since that could be perceived as
|
||||
backwards-incompatible.
|
||||
|
||||
To reduce timeouts we'd likely need to add a flag for configuring the timeouts and let cluster admins make the choice.
|
||||
|
||||
## Alternatives
|
||||
|
||||
The obvious alternative is to add a flag or multiple flags to control these timeouts. That was rejected because it's a
|
||||
larger change and we want to backport these changes.
|
||||
|
||||
Another option is to increase the hardcoded timeouts everywhere _and_ add a flag to configure this on master, so that
|
||||
the fix is applied on older branches but we have the functionality of a flag. This was rejected because it's more work
|
||||
and we don't have any evidence that we need to add a flag. We'd also have to face the difficulty of deciding how to
|
||||
scope the flags (i.e. should there be a contextTimeout which applies to all controllers everywhere? Is that useful?)
|
||||
Loading…
Reference in New Issue
Block a user