celeborn/project
mingji 75b697d815 [CELEBORN-1838] Interrupt spark task should not report fetch failure
Backport https://github.com/apache/celeborn/pull/3070 to main branch.
## What changes were proposed in this pull request?
Do not trigger fetch failure if a spark task attempt is interrupted(speculation enabled). Do not trigger fetch failure if the RPC of getReducerFileGroup is timeout. This PR is intended for celeborn-0.5 branch.

## Why are the changes needed?
Avoid unnecessary fetch failures and stage re-runs.

## Does this PR introduce any user-facing change?
NO.

## How was this patch tested?
1. GA.
2. Manually tested on cluster with spark speculation tasks.

Here is the test case
```scala
sc.parallelize(1 to 100, 100).flatMap(i => {
        (1 to 150000).iterator.map(num => num)
      }).groupBy(i => i, 100)
      .map(i => {
        if (i._1 < 5) {
          Thread.sleep(15000)
        }
        i
      })
      .repartition(400).count
```

<img width="1384" alt="截屏2025-01-18 16 16 16" src="https://github.com/user-attachments/assets/adf64857-5773-4081-a7d0-fa3439e751eb" /> <img width="1393" alt="截屏2025-01-18 16 16 22" src="https://github.com/user-attachments/assets/ac9bf172-1ab4-4669-a930-872d009f2530" /> <img width="1258" alt="截屏2025-01-18 16 19 15" src="https://github.com/user-attachments/assets/6a8ff3e1-c1fb-4ef2-84d8-b1fc6eb56fa6" /> <img width="892" alt="截屏2025-01-18 16 17 27" src="https://github.com/user-attachments/assets/f9de3841-f7d4-4445-99a3-873235d4abd0" />

Closes #3070 from FMX/branch-0.5-b1838.

Authored-by: mingji <fengmingxiao.fmxalibaba-inc.com>

Closes #3080 from turboFei/b1838.

Lead-authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Co-authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>
2025-01-23 14:46:36 +08:00
..
AddMetaInfLicenseFiles.scala [CELEBORN-1204] Update NOTICE year 2024 2024-01-02 15:55:52 +08:00
build.properties
BuildTools.scala
CelebornBuild.scala [CELEBORN-1838] Interrupt spark task should not report fetch failure 2025-01-23 14:46:36 +08:00
JDKTools.scala [CELEBORN-1092] Introduce JVM monitoring in Celeborn Worker using JVMQuake 2023-11-28 20:45:08 +08:00
plugins.sbt [CELEBORN-1666] Bump scala-protoc from 1.0.6 to 1.0.7 2024-10-24 11:16:37 +08:00