timeouts proposal

Signed-off-by: Ashley Davis <ashley.davis@jetstack.io> Co-authored-by: Maël Valais <mael@vls.dev>
2022-06-14 16:33:54 +01:00 · 2022-06-14 16:33:54 +01:00 · ab09488c22
commit ab09488c22
parent c16b3cca7b
1 changed files with 317 additions and 0 deletions
--- a/design/20220614-timeouts.md
+++ b/design/20220614-timeouts.md
@ -0,0 +1,317 @@
+# Increasing Timeouts for Issuer Operations
+
+- [Release Signoff Checklist](#release-signoff-checklist)
+- [Summary](#summary)
+- [Motivation](#motivation)
+  - [Goals](#goals)
+  - [Non-Goals](#non-goals)
+- [Proposal](#proposal)
+  - [Risks and Mitigations](#risks-and-mitigations)
+- [Design Details](#design-details)
+  - [Test Plan](#test-plan)
+  - [Graduation Criteria](#graduation-criteria)
+  - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
+  - [Supported Versions](#supported-versions)
+- [Production Readiness](#production-readiness)
+- [Drawbacks](#drawbacks)
+- [Alternatives](#alternatives)
+
+## Release Signoff Checklist
+
+This checklist contains actions which must be completed before a PR implementing this design can be merged.
+
+
+- [ ] This design doc has been discussed and approved
+- [ ] Test plan has been agreed upon and the tests implemented
+- [ ] Feature gate status has been agreed upon (whether the new functionality will be placed behind a feature gate or not)
+- [ ] Graduation criteria is in place if required (if the new functionality is placed behind a feature gate, how will it graduate between stages)
+- [ ] User-facing documentation has been PR-ed against the release branch in [cert-manager/website]
+
+
+## Summary
+
+Users of several different issuers (including some deployments of Venafi ([5108][], [4893][]), and ACME
+issuers ([4452][]) such as Sectigo ([comment][4893-comment]) and ZeroSSL ([comment][website-583-comment]) have been running
+into timeout errors which manifest as `context deadline exceeded` errors in logs.
+
+These errors are caused because the relevant cert-manager controllers hardcode short timeouts into their sync/reconcile
+functions, and some operations which must be done during a sync/reconcile can take longer than that short timeout.
+
+This manifests during issuer creation and during certificate issuance. Put simply: for some issuers, initialisation or
+certificate issuance just takes longer than the maximum amount of time we currently allow.
+
+[5108]: https://github.com/cert-manager/cert-manager/issues/5108
+[4893]: https://github.com/cert-manager/cert-manager/issues/4893
+[4452]: https://github.com/cert-manager/cert-manager/issues/4452
+[4893-comment]: https://github.com/cert-manager/cert-manager/issues/4893#issuecomment-1115143729
+[website-583-comment]: https://github.com/cert-manager/website/issues/583#issuecomment-1011664334
+
+## Motivation
+
+Making a change here is justified since there are users today who want to use certain standards-compliant ACME issuers are
+who're unable to do so with cert-manager, through no fault of their own.
+
+There's likely to be strong demand for this; in the various issues linked in the summary above, we've seen a huge amount
+of engagement through reactions to posts (which isn't a perfect indicator of demand but these issues are unusually
+popular relative to what we normally see).
+
+This design will largely talk about ACME since ACME issuer users are almost certainly by far the biggest section of the
+current cert-manager userbase, but the principles here apply equally to the Venafi issuer, where an instance of
+Venafi TPP might be deployed on-prem and could be slow for any number of reasons. We should address Venafi issuers in
+a similar way, but in a separate piece of work.
+
+### Goals
+
+#### Remain Good Cluster Citizens
+
+The justification for a short default timeout on each sync/reconcile operation is that we avoid a situation whereby we
+create many "hanging" threads which could overload the node on which cert-manager is running.
+
+The simplest solution to this issue would be to simply remove all timeouts or to make them all very long (say, 1 hour),
+but that could lead to high resource use.
+
+#### Merge Before 1.9 and Backport
+
+This timeout issue is present in all currently supported versions of cert-manager (1.7 and 1.8 at the time of writing).
+Ideally we'd like to get this fix merged before we release 1.9 (currently scheduled for Jul 06, 2022) and to backport
+to other supported releases, so that we'll be able to support ACME issuers like Sectigo in multiple versions.
+
+### Non-Goals
+
+#### Individually Configurable Timeouts for Every Operation or API Call
+
+Having such timeouts might be attractive at some point, but specifically we want a broader approach for this solution
+pending some evidence that more specific timeout configuration is a thing that people actually want.
+
+#### Adding New Timeouts Where We Don't Currently Have Them
+
+While we should almost certainly have an upper-limit timeout for every single controller to prevent runaway resource
+usage through infinitely hanging sync operations, it would be preferable to keep the scope of this PR minimal so that
+it'll pass the threshold for simplicity to be backported.
+
+## Proposal
+
+Note that this proposal is intentionally kept simple in terms of the scope of what we should actually change.
+
+This is because we take an _extremely_ conservative approach to backports where we seek to absolutely minimise the
+amount and complexity of changes we introduce to already released versions.
+
+Hence this proposal will focus on a simple approach that would apply everywhere, and which we could build upon in any
+future releases if we so chose.
+
+Note also that the specific numeric values for timeouts are given and justified later after the changes are proposed.
+This makes the timeouts easier to change. Timeouts are instead given Greek letter names such as `α` and `β`.
+
+An executive summary of the changes would be to remove a few middleware timeouts and then increase the hardcoded
+defaults we currently have in a few places. Really all of the below is just saying that in long form. The justification
+is in the [Design Details](#design-details) section below.
+
+### Increase Some Existing Controller Timeouts
+
+Note that links in this section are for the `release-1.8` branch but these changes would also apply similarly to our
+other currently-supported version of cert-manager (v1.7) and also for the `master` branch.
+
+We propose to increase the existing timeouts for the [issuers][] and [clusterissuers][] controllers to
+`α` seconds.
+
+[issuers]: https://github.com/cert-manager/cert-manager/blob/2918f68664be98f93b8092afbbc503e42d9b7588/pkg/controller/issuers/sync.go#L45
+[clusterissuers]: https://github.com/cert-manager/cert-manager/blob/2918f68664be98f93b8092afbbc503e42d9b7588/pkg/controller/clusterissuers/sync.go#L45
+
+### Remove Middleware Timeouts
+
+We propose to remove the context timeouts we currently have in our ACME logging middleware; that would look like
+(or be) [this unmerged commit][b1953].
+
+The affected controllers appear to be those which have an `accounts.Getter`:
+
+- [acmechallenges](https://github.com/cert-manager/cert-manager/blob/c16b3cca7b418ba0d0b2bf1066514b8762984517/pkg/controller/acmechallenges/controller.go#L48)
+- [acmeorders](https://github.com/cert-manager/cert-manager/blob/c16b3cca7b418ba0d0b2bf1066514b8762984517/pkg/controller/acmeorders/controller.go#L50)
+
+These timeouts have two issues. One is that the location they're added is unintuitive; the timeouts are added in
+_logging_ middleware which which doesn't otherwise mention that it also introduces timeouts.
+
+That's confusing; we might reasonably expect a timeout on writing the logs themselves (i.e. the actual operation of
+writing to a log) but this functionality doesn't manage that.
+
+The second issue is that these timeouts effectively duplicate HTTP client timeouts.
+
+HTTP client timeouts belong on the underlying HTTP client; that's where we could set more finegrained controls such as
+TLS handshake, dialer and overall HTTP request timeouts. HTTP client timeouts are [desirable](https://blog.cloudflare.com/the-complete-guide-to-golang-net-http-timeouts/),
+and we [actually already have them](https://github.com/cert-manager/cert-manager/blob/e116d416f3b14863d05753739cbdf72d66923357/pkg/acme/accounts/client.go#L58-L75)
+for our ACME clients.
+
+It's worth noting that if we simply removed the logging middleware timeout, the HTTP client timeouts would start to
+take effect. They're longer than the context timeout and so wouldn't ever be relevant on cert-manager today, but would
+be relevant without the middleware context timeout. We also propose below to tweak those HTTP timeouts in any case.
+
+[b1953]: https://github.com/cert-manager/cert-manager/pull/5206/commits/b195382a67c5e548752407d12851b84b4d6b8e28
+
+### Increase ACME HTTP Timeouts
+
+We propose to update the overall timeout for our HTTP clients for ACME requests to `β` seconds and - at least for now -
+**not** to make this configurable by users.
+
+As mentioned above, we already have HTTP timeouts on the HTTP clients we build for use with ACME clients, as seen
+[here](https://github.com/cert-manager/cert-manager/blob/e116d416f3b14863d05753739cbdf72d66923357/pkg/acme/accounts/client.go#L58-L75).
+
+The dialer and TLS handshake timeouts are set to 30 and 10 seconds respectively, and both are likely fine to keep as
+they are and in any case unlikely to be a problem for people experiencing the issues detailed in [#5080](https://github.com/cert-manager/cert-manager/issues/5080).
+
+The overall HTTP timeout (see [this blog](https://blog.cloudflare.com/the-complete-guide-to-golang-net-http-timeouts/)
+for a diagram showing the different client timeouts) is [set to 30s](https://github.com/cert-manager/cert-manager/blob/e116d416f3b14863d05753739cbdf72d66923357/pkg/acme/accounts/client.go#L73)
+which is too short to address the issues we're trying fix here.
+
+### Risks and Mitigations
+
+The risk here is that the timeouts we have already, as currently configured, might _already_ be the last line of defence
+for some controllers in some clusters. By that we mean that by increasing the timeout we could pass some critical
+threshold where pods start to run out of resources and crash.
+
+This could be in a cluster where someone is trying to create many slow-to-create Issuers at once, or to issue many
+Certificates at once from issuers which are slow, or to issue many Certificates from a single issuer which becomes
+overloaded and slows down.
+
+## Design Details
+
+Most of the above proposals involve tweaking integers or else they have [a commit](https://github.com/cert-manager/cert-manager/pull/5206/commits/b195382a67c5e548752407d12851b84b4d6b8e28)
+which can be viewed. The actual PR here won't be difficult to understand - that's intentional, to minimise complexity
+of a change we want to backport.
+
+The proposed value for `β` - the ACME HTTP timeout - is 90s. That's longer than the 60s which seems to be around the
+upper limit of what people have reported seeing with Sectigo, which seems to be on the slower side.
+
+The proposed value for `α` - the context timeout - is 120s. This is long enough to fit in a slow ACME request and then
+do other work relating to issuance.
+
+### Justification
+
+#### Crossplane
+
+The above values are mostly inspired by crossplane.io, which uses context timeouts very heavily in basically all
+controllers (which is something that cert-manager should likely aspire to).
+
+Here's a selection of timeouts in various crossplane controllers:
+
+| Timeout |                                                                                   Reconciliation loop                                                                                   |
+|---------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| 2 mins  | [definition/reconciler.go#L55](https://github.com/crossplane/crossplane/blob/60fc7df4b3c9d0f11e9ea719b63b52bec56d3853/internal/controller/apiextensions/definition/reconciler.go#L55)   |
+| 2 mins  | [definition/reconciler.go#L43](https://github.com/crossplane/crossplane/blob/134ec72e0a5649147bb95e76d063d3df8c0b4e5f/internal/controller/rbac/definition/reconciler.go#L43)            |
+| 2 mins  | [composition/reconciler.go#L45](https://github.com/crossplane/crossplane/blob/134ec72e0a5649147bb95e76d063d3df8c0b4e5f/internal/controller/apiextensions/composition/reconciler.go#L45) |
+| 2 mins  | [binding/reconciler.go#L46](https://github.com/crossplane/crossplane/blob/134ec72e0a5649147bb95e76d063d3df8c0b4e5f/internal/controller/rbac/provider/binding/reconciler.go#L46)         |
+| 2 mins  | [namespace/reconciler.go#L43](https://github.com/crossplane/crossplane/blob/134ec72e0a5649147bb95e76d063d3df8c0b4e5f/internal/controller/rbac/namespace/reconciler.go#L43)              |
+| 2 mins  | [roles/reconciler.go#L45](https://github.com/crossplane/crossplane/blob/134ec72e0a5649147bb95e76d063d3df8c0b4e5f/internal/controller/rbac/provider/roles/reconciler.go#L45)             |
+| 2 mins  | [composite/reconciler.go#L45](https://github.com/crossplane/crossplane/blob/dd23304b466690f89a82b57825ca8a9870767972/internal/controller/apiextensions/composite/reconciler.go#L45)     |
+| 1 min   | [offered/reconciler.go#L53](https://github.com/crossplane/crossplane/blob/2c9872543c5074928ba739e147b02e09cd1f3090/internal/controller/apiextensions/offered/reconciler.go#L53)         |
+| 1 min   | [claim/reconciler.go#L46](https://github.com/crossplane/crossplane/blob/dd23304b466690f89a82b57825ca8a9870767972/internal/controller/apiextensions/claim/reconciler.go#L46)             |
+| 1 min   | [manager/reconciler.go#L45](https://github.com/crossplane/crossplane/blob/1344a86018f03fcbba6b9365c63696dcd47ebcf6/internal/controller/pkg/manager/reconciler.go#L45)                   |
+| 1 min   | [resolver/reconciler.go#L47](https://github.com/crossplane/crossplane/blob/243f1f47ca0b17503b08ef77770f41a89defeb82/internal/controller/pkg/resolver/reconciler.go#L47)                 |
+| 3 mins  | [revision/reconciler.go#L54](https://github.com/crossplane/crossplane/blob/1344a86018f03fcbba6b9365c63696dcd47ebcf6/internal/controller/pkg/revision/reconciler.go#L54)                 |
+
+Crossplane also has an [open issue](https://github.com/crossplane/crossplane/issues/2564) relating to making this value
+configurable.
+
+The idea here is "if it's good enough for crossplane why should it not be good enough for us?"
+
+The current cert-manager timeouts are arbitrary. Likely the crossplane timeouts are also arbitrary. We can at least
+have confidence that a big project with a tonne of controllers and CRDs is using longer timeouts and clearly not seeing
+world-ending problems, and people want to _increase_ the timeouts from that base too, as envidenced by the above open
+issue.
+
+Another relevant timeout is certbot, which has a [45s](https://github.com/certbot/certbot/blob/295fc5e33a68c945d2f62e84ed8e6aaecfe93102/acme/acme/client.py#L46)
+limit on network requests by default and a 90s limit on polling after the fact. We go higher here because people have
+reported their requests taking longer than 45
+
+#### Kubernetes
+
+In investigation we couldn't find any evidence that there are timeouts used in kubernetes controllers such as the Deployments
+controller. That implies they don't think this topic is particularly scary, which in turn provides some comfort that picking
+a higher timeout isn't likely to bring about the end of the world.
+
+#### User Reports
+
+User reports of time taken vary slightly:
+
+- One user reported times of around 30s: https://github.com/cert-manager/cert-manager/issues/4452#issuecomment-1144561276
+- Another reported similar: https://github.com/cert-manager/cert-manager/issues/4452#issuecomment-1143889347
+- One said "at least 60s, sometimes more": https://github.com/cert-manager/cert-manager/issues/5080#issuecomment-1137259557
+
+Since the largest time is 'sometimes more than 60s', our lowest timeout should likely be at least 60s.
+
+#### General Justifications
+
+There's a risk with work like this that we can focus too much on what's already present. The 10s timeout which is set for these
+controllers on `master` was committed with absolutely no justification we can find in the PR, commit message, design doc or anywhere else.
+
+We should the same amount of attachment to that 10 second timeout as there was justification given for it when it was added originally - absolutely none.
+
+That's not to say it was a mistake or bad to add these timeouts originally - the controllers which have them are better than almost every other controller
+we have just by virtue of having a timeout at all!
+
+### Test Plan
+
+As far as we can tell, none of this is tested currently, nor would we plan to add any new tests in the PR.
+
+Down the road it would be great to add some tests for this stuff but that feels like a bigger project than this short
+term bug fix.
+
+### Graduation Criteria
+
+There's nothing to feature gate and no graduation required.
+
+### Upgrade / Downgrade Strategy
+
+A user upgrading will notice no changes except that the failing ACME issuers they currently have should start working.
+
+A user downgrading to an unsupported version (1.6 and below) could start to see timeout issues since the timeouts in
+unsupported releases will be lower than those in 1.7 and above.
+
+### Supported Versions
+
+cert-manager 1.7+
+
+K8s support irrelevant
+
+## Production Readiness
+
+### How can this feature be enabled / disabled for an existing cert-manager installation?
+
+This will take effect by default in all versions and won't be able to be disabled.
+
+### Does this feature depend on any specific services running in the cluster?
+
+Yes, cert-manager
+
+### Will enabling / using this feature result in new API calls (i.e to Kubernetes apiserver or external services)?
+
+This change will result in fewer API calls to external services and to Kubernetes. Since issuance on slow ACME issuers
+will be more likely to succeed, we won't have long retry loops where we'll continually retry and get timed out.
+
+Put another way: instead of 20 requests where 19 of them time out and one succeeds, we'll just have one request that
+succeeds.
+
+### Will enabling / using this feature result in increasing size or count of the existing API objects?
+
+Potentially fewer Orders and Challenges, since we'll fail less often.
+
+### Will enabling / using this feature result in significant increase of resource usage? (CPU, RAM...)
+
+If there's a massive queue of blocked requests there could be some increase in resource usage. Hard to say how it would
+change — if anything there's likely to be a _decrease_ in total resource usage since longer-running requests will
+succeed and we won't need to pay the cost of running a whole Sync function again after an ACME request times out.
+
+## Drawbacks
+
+Once we bump the default max timeout we can't realistically reduce the default again, since that could be perceived as
+backwards-incompatible.
+
+To reduce timeouts we'd likely need to add a flag for configuring the timeouts and let cluster admins make the choice.
+
+## Alternatives
+
+The obvious alternative is to add a flag or multiple flags to control these timeouts. That was rejected because it's a
+larger change and we want to backport these changes.
+
+Another option is to increase the hardcoded timeouts everywhere _and_ add a flag to configure this on master, so that
+the fix is applied on older branches but we have the functionality of a flag. This was rejected because it's more work
+and we don't have any evidence that we need to add a flag. We'd also have to face the difficulty of deciding how to
+scope the flags (i.e. should there be a contextTimeout which applies to all controllers everywhere? Is that useful?)