celeborn/client-spark
CodingCat 0b5a09a9f7 [CELEBORN-1896] delete data from failed to fetch shuffles
### What changes were proposed in this pull request?

it's a joint work with YutingWang98

currently we have to wait for spark shuffle object gc to clean disk space occupied by celeborn shuffles

As a result, if a shuffle is failed to fetch and retried , the disk space occupied by the failed attempt cannot really be cleaned , we hit this issue internally when we have to deal with 100s of TB level shuffles in a single spark application, any hiccup in servers can double even triple the disk usage

this PR implements the mechanism to delete files from failed-to-fetch shuffles

the main idea is actually simple, it triggers clean up in LifecycleManager when it applies for a new celeborn shuffle id for a retried shuffle write stage

the tricky part is that is to avoid delete shuffle files when it is referred by multiple downstream stages: the PR introduces RunningStageManager to track the dependency between stages

### Why are the changes needed?

saving disk space

### Does this PR introduce _any_ user-facing change?

a new config

### How was this patch tested?

we manually delete some files

![image](https://github.com/user-attachments/assets/4136cd52-78b2-44e7-8244-db3c5bf9d9c4)

from the above screenshot we can see that originally we have shuffle 0, 1 and after 1 faced with chunk fetch failure, it triggers a retry of 0 (shuffle 2), but at this moment, 0 has been deleted from the workers

![image](https://github.com/user-attachments/assets/7d3b4d90-ae5a-4a54-8dec-a5005850ef0a)

in the logs, we can see that in the middle the application , the unregister shuffle request was sent for shuffle 0

Closes #3109 from CodingCat/delete_fi.

Lead-authored-by: CodingCat <zhunansjtu@gmail.com>
Co-authored-by: Wang, Fei <fwang12@ebay.com>
Co-authored-by: Fei Wang <cn.feiwang@gmail.com>
Co-authored-by: Fei Wang <fwang12@ebay.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2025-05-21 11:23:11 +08:00
..
common [CELEBORN-1896] delete data from failed to fetch shuffles 2025-05-21 11:23:11 +08:00
spark-2 [CELEBORN-1800] Introduce ApplicationTotalCount and ApplicationFallbackCount metric to record the total and fallback count of application 2025-05-19 07:20:00 -07:00
spark-2-shaded [CELEBORN-1921] Broadcast large GetReducerFileGroupResponse to prevent Spark driver network exhausted 2025-04-01 08:29:21 -07:00
spark-3 [CELEBORN-1896] delete data from failed to fetch shuffles 2025-05-21 11:23:11 +08:00
spark-3-columnar-common [CELEBORN-1413][FOLLOWUP] Rename celeborn-client-spark-3-4 back to celeborn-client-spark-3 2025-03-04 22:25:10 +08:00
spark-3-columnar-shuffle [CELEBORN-1413][FOLLOWUP] Rename celeborn-client-spark-3-4 back to celeborn-client-spark-3 2025-03-04 22:25:10 +08:00
spark-3-shaded [CELEBORN-1921] Broadcast large GetReducerFileGroupResponse to prevent Spark driver network exhausted 2025-04-01 08:29:21 -07:00
spark-3.5-columnar-shuffle [CELEBORN-912][FOLLOWUP] Support columnar shuffle for Spark 3.5 2024-09-05 16:52:47 +08:00
spark-4-columnar-shuffle [CELEBORN-1413] Support Spark 4.0 2024-12-24 18:12:27 +08:00
spark-4-shaded [CELEBORN-1413][FOLLOWUP] Rename celeborn-client-spark-3-4 back to celeborn-client-spark-3 2025-03-04 22:25:10 +08:00