From 362865f2ce313dbec4798ed752bb2ddf825f5bbf Mon Sep 17 00:00:00 2001 From: sychen Date: Fri, 11 Oct 2024 14:13:49 +0800 Subject: [PATCH] [CELEBORN-1571] Fix flaky test - pushdata timeout will add to pushExcludedWorker ### What changes were proposed in this pull request? ### Why are the changes needed? Because the worker port is in use, the driver's worker status may change from shutdown status to unknown, causing the test to fail. https://github.com/apache/celeborn/actions/runs/10465286274/job/28980278764 ```java - celeborn spark integration test - pushdata timeout will add to pushExcludedWorkers *** FAILED *** WORKER_UNKNOWN did not equal PUSH_DATA_TIMEOUT_PRIMARY, and WORKER_UNKNOWN did not equal PUSH_DATA_TIMEOUT_REPLICA (PushDataTimeoutTest.scala:150) ``` unit-tests.log ``` 24/08/20 05:28:30,400 INFO [celeborn-dispatcher-7] Master: Receive ReportNodeFailure [ Host: localhost RpcPort: 41487 PushPort: 34259 FetchPort: 45713 ReplicatePort: 35107 InternalPort: 41487 24/08/20 05:29:29,414 WARN [celeborn-client-lifecycle-manager-change-partition-executor-3] WorkerStatusTracker: Reporting failed workers: Host:localhost:RpcPort:42267:PushPort:43741:FetchPort:46483:ReplicatePort:43587 PUSH_DATA_TIMEOUT_PRIMARY 2024-08-19T22:29:29.414-0700 Current unknown workers: Host:localhost:RpcPort:41487:PushPort:34259:FetchPort:45713:ReplicatePort:35107:InternalPort:41487 2024-08-19T22:29:29.108-0700 Current shutdown workers: Host:localhost:RpcPort:41487:PushPort:34259:FetchPort:45713:ReplicatePort:35107:InternalPort:41487 ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? GA Closes #2697 from cxzl25/CELEBORN-1571. Authored-by: sychen Signed-off-by: Shuang --- .../apache/celeborn/tests/spark/PushDataTimeoutTest.scala | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/tests/spark-it/src/test/scala/org/apache/celeborn/tests/spark/PushDataTimeoutTest.scala b/tests/spark-it/src/test/scala/org/apache/celeborn/tests/spark/PushDataTimeoutTest.scala index 5bf7c1303..ea398968c 100644 --- a/tests/spark-it/src/test/scala/org/apache/celeborn/tests/spark/PushDataTimeoutTest.scala +++ b/tests/spark-it/src/test/scala/org/apache/celeborn/tests/spark/PushDataTimeoutTest.scala @@ -144,9 +144,12 @@ class PushDataTimeoutTest extends AnyFunSuite .getLifecycleManager .workerStatusTracker .excludedWorkers + .asScala.filter { case (_, (code, _)) => + code != StatusCode.WORKER_UNKNOWN + }.toMap - assert(excludedWorkers.size() > 0) - excludedWorkers.asScala.foreach { case (_, (code, _)) => + assert(excludedWorkers.size > 0) + excludedWorkers.foreach { case (_, (code, _)) => assert(code == StatusCode.PUSH_DATA_TIMEOUT_PRIMARY || code == StatusCode.PUSH_DATA_TIMEOUT_REPLICA) }