celeborn/tests
gaoyajun02 6a097944cf [CELEBORN-2042] Fix FetchFailure handling when TaskSetManager is not found
### What changes were proposed in this pull request?
Fixes the FetchFailure handling logic in shouldReportShuffleFetchFailure method to properly handle cases where TaskSetManager cannot be found for a given task ID.

### Why are the changes needed?
The current implementation incorrectly reports FetchFailure when TaskSetManager is not found, which leads to false positive failures in normal fault tolerance scenarios. This happens because:
1. Executor Lost scenarios: When executors are lost due to resource preemption or failures, the associated TaskSetManager gets cleaned up, making it unavailable for lookup
2. Stage cancellation: Cancelled or completed stages may have their TaskSetManager removed

These are all normal scenarios in Spark's fault tolerance mechanism and should not be treated as shuffle failures. The current behavior can cause unnecessary job failures and confusion in debugging actual shuffle issues.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
UT, Long-running Production Validation

Closes #3339 from gaoyajun02/CELEBORN-2042.

Authored-by: gaoyajun02 <gaoyajun02@meituan.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>
2025-06-18 10:22:10 -07:00
..
flink-it [CELEBORN-1912] Client should send heartbeat to worker for processing heartbeat to avoid reading idleness of worker which enables heartbeat 2025-05-08 10:09:50 +08:00
kubernetes-it [CELEBORN-1565] Introduce warn-unused-import in Scala 2024-08-29 13:43:17 +08:00
mr-it [CELEBORN-1434] Support MRAppMasterWithCeleborn to disable job recovery and job reduce slow start by default 2024-05-22 15:32:41 +08:00
spark-it [CELEBORN-2042] Fix FetchFailure handling when TaskSetManager is not found 2025-06-18 10:22:10 -07:00
tez-it [CELEBORN-1737] Support build tez client package 2024-12-30 11:01:19 +08:00