celeborn/tests
mingji 75b697d815 [CELEBORN-1838] Interrupt spark task should not report fetch failure
Backport https://github.com/apache/celeborn/pull/3070 to main branch.
## What changes were proposed in this pull request?
Do not trigger fetch failure if a spark task attempt is interrupted(speculation enabled). Do not trigger fetch failure if the RPC of getReducerFileGroup is timeout. This PR is intended for celeborn-0.5 branch.

## Why are the changes needed?
Avoid unnecessary fetch failures and stage re-runs.

## Does this PR introduce any user-facing change?
NO.

## How was this patch tested?
1. GA.
2. Manually tested on cluster with spark speculation tasks.

Here is the test case
```scala
sc.parallelize(1 to 100, 100).flatMap(i => {
        (1 to 150000).iterator.map(num => num)
      }).groupBy(i => i, 100)
      .map(i => {
        if (i._1 < 5) {
          Thread.sleep(15000)
        }
        i
      })
      .repartition(400).count
```

<img width="1384" alt="截屏2025-01-18 16 16 16" src="https://github.com/user-attachments/assets/adf64857-5773-4081-a7d0-fa3439e751eb" /> <img width="1393" alt="截屏2025-01-18 16 16 22" src="https://github.com/user-attachments/assets/ac9bf172-1ab4-4669-a930-872d009f2530" /> <img width="1258" alt="截屏2025-01-18 16 19 15" src="https://github.com/user-attachments/assets/6a8ff3e1-c1fb-4ef2-84d8-b1fc6eb56fa6" /> <img width="892" alt="截屏2025-01-18 16 17 27" src="https://github.com/user-attachments/assets/f9de3841-f7d4-4445-99a3-873235d4abd0" />

Closes #3070 from FMX/branch-0.5-b1838.

Authored-by: mingji <fengmingxiao.fmxalibaba-inc.com>

Closes #3080 from turboFei/b1838.

Lead-authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Co-authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>
2025-01-23 14:46:36 +08:00
..
flink-it [MINOR] Rename org.apache.celeborn.plugin.flink.readclient to org.apache.celeborn.plugin.flink.client 2025-01-03 20:53:54 +08:00
kubernetes-it [CELEBORN-1565] Introduce warn-unused-import in Scala 2024-08-29 13:43:17 +08:00
mr-it [CELEBORN-1434] Support MRAppMasterWithCeleborn to disable job recovery and job reduce slow start by default 2024-05-22 15:32:41 +08:00
spark-it [CELEBORN-1838] Interrupt spark task should not report fetch failure 2025-01-23 14:46:36 +08:00
tez-it [CELEBORN-1737] Support build tez client package 2024-12-30 11:01:19 +08:00