19 KiB
Increasing Timeouts for Issuer Operations
- Release Signoff Checklist
- Summary
- Motivation
- Proposal
- Design Details
- Production Readiness
- Drawbacks
- Alternatives
Release Signoff Checklist
This checklist contains actions which must be completed before a PR implementing this design can be merged.
- This design doc has been discussed and approved
- Test plan has been agreed upon and the tests implemented
- Feature gate status has been agreed upon (whether the new functionality will be placed behind a feature gate or not)
- Graduation criteria is in place if required (if the new functionality is placed behind a feature gate, how will it graduate between stages)
- User-facing documentation has been PR-ed against the release branch in [cert-manager/website]
Summary
Users of several different issuers (including some deployments of Venafi (5108, 4893), and ACME
issuers (4452) such as Sectigo (comment) and ZeroSSL (comment) have been running
into timeout errors which manifest as context deadline exceeded errors in logs.
These errors are caused because the relevant cert-manager controllers hardcode short timeouts into their sync/reconcile functions, and some operations which must be done during a sync/reconcile can take longer than that short timeout.
This manifests during issuer creation and during certificate issuance. Put simply: for some issuers, initialisation or certificate issuance just takes longer than the maximum amount of time we currently allow.
Motivation
Making a change here is justified since there are users today who want to use certain standards-compliant ACME issuers are who're unable to do so with cert-manager, through no fault of their own.
There's likely to be strong demand for this; in the various issues linked in the summary above, we've seen a huge amount of engagement through reactions to posts (which isn't a perfect indicator of demand but these issues are unusually popular relative to what we normally see).
This design will largely talk about ACME since ACME issuer users are almost certainly by far the biggest section of the current cert-manager user base, but the principles here apply equally to the Venafi issuer, where an instance of Venafi TPP might be deployed on-prem and could be slow for any number of reasons. We should address Venafi issuers in a similar way, but in a separate piece of work.
Goals
Remain Good Cluster Citizens
The justification for a short default timeout on each sync/reconcile operation is that we avoid a situation whereby we create many "hanging" threads which could overload the node on which cert-manager is running.
The simplest solution to this issue would be to simply remove all timeouts or to make them all very long (say, 1 hour), but that could lead to high resource use.
Merge Before 1.9 and Backport
This timeout issue is present in all currently supported versions of cert-manager (1.7 and 1.8 at the time of writing). Ideally we'd like to get this fix merged before we release 1.9 (currently scheduled for Jul 06, 2022) and to backport to other supported releases, so that we'll be able to support ACME issuers like Sectigo in multiple versions.
Non-Goals
Individually Configurable Timeouts for Every Operation or API Call
Having such timeouts might be attractive at some point, but specifically we want a broader approach for this solution pending some evidence that more specific timeout configuration is a thing that people actually want.
Adding New Timeouts Where We Don't Currently Have Them
While we should almost certainly have an upper-limit timeout for every single controller to prevent runaway resource usage through infinitely hanging sync operations, it would be preferable to keep the scope of this PR minimal so that it'll pass the threshold for simplicity to be backported.
Proposal
Note that this proposal is intentionally kept simple in terms of the scope of what we should actually change.
This is because we take an extremely conservative approach to backports where we seek to absolutely minimise the amount and complexity of changes we introduce to already released versions.
Hence this proposal will focus on a simple approach that would apply everywhere, and which we could build upon in any future releases if we so chose.
Note also that the specific numeric values for timeouts are given and justified later after the changes are proposed.
This makes the timeouts easier to change. Timeouts are instead given Greek letter names such as α and β.
An executive summary of the changes would be to remove a few middleware timeouts and then increase the hardcoded defaults we currently have in a few places. Really all of the below is just saying that in long form. The justification is in the Design Details section below.
Increase Some Existing Controller Timeouts
Note that links in this section are for the release-1.8 branch but these changes would also apply similarly to our
other currently-supported version of cert-manager (v1.7) and also for the master branch.
We propose to increase the existing timeouts for the issuers and clusterissuers controllers to
α seconds.
Remove Middleware Timeouts
We propose to remove the context timeouts we currently have in our ACME logging middleware; that would look like (or be) this unmerged commit.
The affected controllers appear to be those which have an accounts.Getter:
These timeouts have two issues. One is that the location they're added is unintuitive; the timeouts are added in logging middleware which doesn't otherwise mention that it also introduces timeouts.
That's confusing; we might reasonably expect a timeout on writing the logs themselves (i.e. the actual operation of writing to a log) but this functionality doesn't manage that.
The second issue is that these timeouts effectively duplicate HTTP client timeouts.
HTTP client timeouts belong on the underlying HTTP client; that's where we could set more fine-grained controls such as TLS handshake, dialer and overall HTTP request timeouts. HTTP client timeouts are desirable, and we actually already have them for our ACME clients.
It's worth noting that if we simply removed the logging middleware timeout, the HTTP client timeouts would start to take effect. They're longer than the context timeout and so wouldn't ever be relevant on cert-manager today, but would be relevant without the middleware context timeout. We also propose below to tweak those HTTP timeouts in any case.
Increase ACME HTTP Timeouts
We propose to update the overall timeout for our HTTP clients for ACME requests to β seconds and - at least for now -
not to make this configurable by users.
As mentioned above, we already have HTTP timeouts on the HTTP clients we build for use with ACME clients, as seen
in BuildHTTPClient.
The dialer and TLS handshake timeouts are set to 30 and 10 seconds respectively, and both are likely fine to keep as they are and in any case unlikely to be a problem for people experiencing the issues detailed in #5080.
The overall HTTP timeout (see this blog for a diagram showing the different client timeouts) is set to 30s which is too short to address the issues we're trying fix here.
Risks and Mitigations
The risk here is that the timeouts we have already, as currently configured, might already be the last line of defence for some controllers in some clusters. By that we mean that by increasing the timeout we could pass some critical threshold where pods start to run out of resources and crash.
This could be in a cluster where someone is trying to create many slow-to-create Issuers at once, or to issue many Certificates at once from issuers which are slow, or to issue many Certificates from a single issuer which becomes overloaded and slows down.
Design Details
Most of the above proposals involve tweaking integers or else they have a commit which can be viewed. The actual PR here won't be difficult to understand - that's intentional, to minimise complexity of a change we want to backport.
The proposed value for β - the ACME HTTP timeout - is 90s. That's longer than the 60s which seems to be around the
upper limit of what people have reported seeing with Sectigo, which seems to be on the slower side.
The proposed value for α - the context timeout - is 120s. This is long enough to fit in a slow ACME request and then
do other work relating to issuance.
Justification
Crossplane
The above values are mostly inspired by crossplane.io, which uses context timeouts very heavily in basically all controllers (which is something that cert-manager should likely aspire to).
Here's a selection of timeouts in various crossplane controllers:
| Timeout | Reconciliation loop |
|---|---|
| 2 mins | definition/reconciler.go#L55 |
| 2 mins | definition/reconciler.go#L43 |
| 2 mins | composition/reconciler.go#L45 |
| 2 mins | binding/reconciler.go#L46 |
| 2 mins | namespace/reconciler.go#L43 |
| 2 mins | roles/reconciler.go#L45 |
| 2 mins | composite/reconciler.go#L45 |
| 1 min | offered/reconciler.go#L53 |
| 1 min | claim/reconciler.go#L46 |
| 1 min | manager/reconciler.go#L45 |
| 1 min | resolver/reconciler.go#L47 |
| 3 mins | revision/reconciler.go#L54 |
Crossplane also has an open issue relating to making this value configurable.
The idea here is "if it's good enough for crossplane why should it not be good enough for us?"
The current cert-manager timeouts are arbitrary. Likely the crossplane timeouts are also arbitrary. We can at least have confidence that a big project with a tonne of controllers and CRDs is using longer timeouts and clearly not seeing world-ending problems, and people want to increase the timeouts from that base too, as evidenced by the above open issue.
Another relevant timeout is certbot, which has a 45s limit on network requests by default and a 90s limit on polling after the fact. We go higher here because people have reported their requests taking longer than 45
Kubernetes
In investigation we couldn't find any evidence that there are timeouts used in kubernetes controllers such as the Deployments controller. That implies they don't think this topic is particularly scary, which in turn provides some comfort that picking a higher timeout isn't likely to bring about the end of the world.
User Reports
User reports of time taken vary slightly:
- One user reported times of around 30s: https://github.com/cert-manager/cert-manager/issues/4452#issuecomment-1144561276
- Another reported similar: https://github.com/cert-manager/cert-manager/issues/4452#issuecomment-1143889347
- One said "at least 60s, sometimes more": https://github.com/cert-manager/cert-manager/issues/5080#issuecomment-1137259557
Since the largest time is 'sometimes more than 60s', our lowest timeout should likely be at least 60s.
General Justifications
There's a risk with work like this that we can focus too much on what's already present. The 10s timeout which is set for these
controllers on master was committed with absolutely no justification we can find in the PR, commit message, design doc or anywhere else.
We should the same amount of attachment to that 10 second timeout as there was justification given for it when it was added originally - absolutely none.
That's not to say it was a mistake or bad to add these timeouts originally - the controllers which have them are better than almost every other controller we have just by virtue of having a timeout at all!
Test Plan
As far as we can tell, none of this is tested currently, nor would we plan to add any new tests in the PR.
Down the road it would be great to add some tests for this stuff but that feels like a bigger project than this short term bug fix.
Graduation Criteria
There's nothing to feature gate and no graduation required.
Upgrade / Downgrade Strategy
A user upgrading will notice no changes except that the failing ACME issuers they currently have should start working.
A user downgrading to an unsupported version (1.6 and below) could start to see timeout issues since the timeouts in unsupported releases will be lower than those in 1.7 and above.
Supported Versions
cert-manager 1.7+
K8s support irrelevant
Production Readiness
How can this feature be enabled / disabled for an existing cert-manager installation?
This will take effect by default in all versions and won't be able to be disabled.
Does this feature depend on any specific services running in the cluster?
Yes, cert-manager
Will enabling / using this feature result in new API calls (i.e to Kubernetes apiserver or external services)?
This change will result in fewer API calls to external services and to Kubernetes. Since issuance on slow ACME issuers will be more likely to succeed, we won't have long retry loops where we'll continually retry and get timed out.
Put another way: instead of 20 requests where 19 of them time out and one succeeds, we'll just have one request that succeeds.
Will enabling / using this feature result in increasing size or count of the existing API objects?
Potentially fewer Orders and Challenges, since we'll fail less often.
Will enabling / using this feature result in significant increase of resource usage? (CPU, RAM...)
If there's a massive queue of blocked requests there could be some increase in resource usage. Hard to say how it would change — if anything there's likely to be a decrease in total resource usage since longer-running requests will succeed and we won't need to pay the cost of running a whole Sync function again after an ACME request times out.
Drawbacks
Once we bump the default max timeout we can't realistically reduce the default again, since that could be perceived as backwards-incompatible.
To reduce timeouts we'd likely need to add a flag for configuring the timeouts and let cluster admins make the choice.
Alternatives
The obvious alternative is to add a flag or multiple flags to control these timeouts. That was rejected because it's a larger change and we want to backport these changes.
Another option is to increase the hardcoded timeouts everywhere and add a flag to configure this on master, so that the fix is applied on older branches but we have the functionality of a flag. This was rejected because it's more work and we don't have any evidence that we need to add a flag. We'd also have to face the difficulty of deciding how to scope the flags (i.e. should there be a contextTimeout which applies to all controllers everywhere? Is that useful?)