celeborn/tests
Erik.fang aee41555c6 [CELEBORN-955] Re-run Spark Stage for Celeborn Shuffle Fetch Failure
### What changes were proposed in this pull request?
Currently, Celeborn uses replication to handle shuffle data lost for celeborn shuffle reader, this PR implements an alternative solution by Spark stage resubmission.

Design doc:
https://docs.google.com/document/d/1dkG6fww3g99VAb1wkphNlUES_MpngVPNg8601chmVp8/edit

### Why are the changes needed?
Spark stage resubmission uses less resources compared with replication, and some Celeborn users are also asking for it

### Does this PR introduce _any_ user-facing change?
a new config celeborn.client.fetch.throwsFetchFailure is introduced to enable this feature

### How was this patch tested?
two UTs are attached, and we also tested it in Ant Group's Dev spark cluster

Closes #1924 from ErikFang/Re-run-Spark-Stage-for-Celeborn-Shuffle-Fetch-Failure.

Lead-authored-by: Erik.fang <fmerik@gmail.com>
Co-authored-by: Cheng Pan <pan3793@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-11-26 16:47:58 +08:00
..
flink-it [CELEBORN-1022][TEST] Update log level from FATAL to ERROR for console output in unit tests 2023-10-09 15:56:05 +08:00
kubernetes-it [CELEBORN-983] Rename PrometheusMetric configuration 2023-10-13 13:28:58 +08:00
mr-it [CELEBORN-856] Add mapreduce integration test 2023-11-22 14:36:29 +08:00
spark-it [CELEBORN-955] Re-run Spark Stage for Celeborn Shuffle Fetch Failure 2023-11-26 16:47:58 +08:00