Backport https://github.com/apache/celeborn/pull/3070 to main branch. ## What changes were proposed in this pull request? Do not trigger fetch failure if a spark task attempt is interrupted(speculation enabled). Do not trigger fetch failure if the RPC of getReducerFileGroup is timeout. This PR is intended for celeborn-0.5 branch. ## Why are the changes needed? Avoid unnecessary fetch failures and stage re-runs. ## Does this PR introduce any user-facing change? NO. ## How was this patch tested? 1. GA. 2. Manually tested on cluster with spark speculation tasks. Here is the test case ```scala sc.parallelize(1 to 100, 100).flatMap(i => { (1 to 150000).iterator.map(num => num) }).groupBy(i => i, 100) .map(i => { if (i._1 < 5) { Thread.sleep(15000) } i }) .repartition(400).count ``` <img width="1384" alt="截屏2025-01-18 16 16 16" src="https://github.com/user-attachments/assets/adf64857-5773-4081-a7d0-fa3439e751eb" /> <img width="1393" alt="截屏2025-01-18 16 16 22" src="https://github.com/user-attachments/assets/ac9bf172-1ab4-4669-a930-872d009f2530" /> <img width="1258" alt="截屏2025-01-18 16 19 15" src="https://github.com/user-attachments/assets/6a8ff3e1-c1fb-4ef2-84d8-b1fc6eb56fa6" /> <img width="892" alt="截屏2025-01-18 16 17 27" src="https://github.com/user-attachments/assets/f9de3841-f7d4-4445-99a3-873235d4abd0" /> Closes #3070 from FMX/branch-0.5-b1838. Authored-by: mingji <fengmingxiao.fmxalibaba-inc.com> Closes #3080 from turboFei/b1838. Lead-authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com> Co-authored-by: Wang, Fei <fwang12@ebay.com> Signed-off-by: Wang, Fei <fwang12@ebay.com> |
||
|---|---|---|
| .. | ||
| AddMetaInfLicenseFiles.scala | ||
| build.properties | ||
| BuildTools.scala | ||
| CelebornBuild.scala | ||
| JDKTools.scala | ||
| plugins.sbt | ||