### What changes were proposed in this pull request? Currently, Celeborn uses replication to handle shuffle data lost for celeborn shuffle reader, this PR implements an alternative solution by Spark stage resubmission. Design doc: https://docs.google.com/document/d/1dkG6fww3g99VAb1wkphNlUES_MpngVPNg8601chmVp8/edit ### Why are the changes needed? Spark stage resubmission uses less resources compared with replication, and some Celeborn users are also asking for it ### Does this PR introduce _any_ user-facing change? a new config celeborn.client.fetch.throwsFetchFailure is introduced to enable this feature ### How was this patch tested? two UTs are attached, and we also tested it in Ant Group's Dev spark cluster Closes #1924 from ErikFang/Re-run-Spark-Stage-for-Celeborn-Shuffle-Fetch-Failure. Lead-authored-by: Erik.fang <fmerik@gmail.com> Co-authored-by: Cheng Pan <pan3793@gmail.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com> |
||
|---|---|---|
| .. | ||
| src | ||
| pom.xml | ||