From 21d5698a90014f880130ea487751663d06f11650 Mon Sep 17 00:00:00 2001 From: Mridul Muralidharan Date: Wed, 20 Mar 2024 08:50:14 +0800 Subject: [PATCH] [CELEBORN-1339] Mark connection as timedOut in TransportClient.close ### What changes were proposed in this pull request? Importing details from https://github.com/apache/spark/pull/43162: -- This PR avoids a race condition where a connection which is in the process of being closed could be returned by the TransportClientFactory only to be immediately closed and cause errors upon use. This race condition is rare and not easily triggered, but with the upcoming changes to introduce SSL connection support, connection closing can take just a slight bit longer and it's much easier to trigger this issue. Looking at the history of the code I believe this was an oversight in https://github.com/apache/spark/pull/9853. -- ### Why are the changes needed? We are working towards adding TLS support, which is essentially based on Spark 4.0 TLS support, and this is one of the fixes from there. (I am yet to file the overall TLS support jira yet, but this is enabling work). ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit tests Closes #2400 from mridulm/add-SPARK-45375. Authored-by: Mridul Muralidharan Signed-off-by: SteNicholas --- .../apache/celeborn/common/network/client/TransportClient.java | 3 +++ 1 file changed, 3 insertions(+) diff --git a/common/src/main/java/org/apache/celeborn/common/network/client/TransportClient.java b/common/src/main/java/org/apache/celeborn/common/network/client/TransportClient.java index 88aba6992..f4b62d872 100644 --- a/common/src/main/java/org/apache/celeborn/common/network/client/TransportClient.java +++ b/common/src/main/java/org/apache/celeborn/common/network/client/TransportClient.java @@ -317,6 +317,9 @@ public class TransportClient implements Closeable { @Override public void close() { + // Mark the connection as timed out, so we do not return a connection that's being closed + // from the TransportClientFactory if closing takes some time (e.g. with SSL) + this.timedOut = true; // close is a local operation and should finish with milliseconds; timeout just to be safe channel.close().awaitUninterruptibly(10, TimeUnit.SECONDS); }