celeborn

History

CodingCat 0b5a09a9f7 [CELEBORN-1896] delete data from failed to fetch shuffles ### What changes were proposed in this pull request? it's a joint work with YutingWang98 currently we have to wait for spark shuffle object gc to clean disk space occupied by celeborn shuffles As a result, if a shuffle is failed to fetch and retried , the disk space occupied by the failed attempt cannot really be cleaned , we hit this issue internally when we have to deal with 100s of TB level shuffles in a single spark application, any hiccup in servers can double even triple the disk usage this PR implements the mechanism to delete files from failed-to-fetch shuffles the main idea is actually simple, it triggers clean up in LifecycleManager when it applies for a new celeborn shuffle id for a retried shuffle write stage the tricky part is that is to avoid delete shuffle files when it is referred by multiple downstream stages: the PR introduces RunningStageManager to track the dependency between stages ### Why are the changes needed? saving disk space ### Does this PR introduce _any_ user-facing change? a new config ### How was this patch tested? we manually delete some files ![image](https://github.com/user-attachments/assets/4136cd52-78b2-44e7-8244-db3c5bf9d9c4) from the above screenshot we can see that originally we have shuffle 0, 1 and after 1 faced with chunk fetch failure, it triggers a retry of 0 (shuffle 2), but at this moment, 0 has been deleted from the workers ![image](https://github.com/user-attachments/assets/7d3b4d90-ae5a-4a54-8dec-a5005850ef0a) in the logs, we can see that in the middle the application , the unregister shuffle request was sent for shuffle 0 Closes #3109 from CodingCat/delete_fi. Lead-authored-by: CodingCat <zhunansjtu@gmail.com> Co-authored-by: Wang, Fei <fwang12@ebay.com> Co-authored-by: Fei Wang <cn.feiwang@gmail.com> Co-authored-by: Fei Wang <fwang12@ebay.com> Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>		2025-05-21 11:23:11 +08:00
..
common	[CELEBORN-1896] delete data from failed to fetch shuffles	2025-05-21 11:23:11 +08:00
spark-2	[CELEBORN-1800] Introduce ApplicationTotalCount and ApplicationFallbackCount metric to record the total and fallback count of application	2025-05-19 07:20:00 -07:00
spark-2-shaded	[CELEBORN-1921] Broadcast large GetReducerFileGroupResponse to prevent Spark driver network exhausted	2025-04-01 08:29:21 -07:00
spark-3	[CELEBORN-1896] delete data from failed to fetch shuffles	2025-05-21 11:23:11 +08:00
spark-3-columnar-common	[CELEBORN-1413][FOLLOWUP] Rename `celeborn-client-spark-3-4` back to `celeborn-client-spark-3`	2025-03-04 22:25:10 +08:00
spark-3-columnar-shuffle	[CELEBORN-1413][FOLLOWUP] Rename `celeborn-client-spark-3-4` back to `celeborn-client-spark-3`	2025-03-04 22:25:10 +08:00
spark-3-shaded	[CELEBORN-1921] Broadcast large GetReducerFileGroupResponse to prevent Spark driver network exhausted	2025-04-01 08:29:21 -07:00
spark-3.5-columnar-shuffle	[CELEBORN-912][FOLLOWUP] Support columnar shuffle for Spark 3.5	2024-09-05 16:52:47 +08:00
spark-4-columnar-shuffle	[CELEBORN-1413] Support Spark 4.0	2024-12-24 18:12:27 +08:00
spark-4-shaded	[CELEBORN-1413][FOLLOWUP] Rename `celeborn-client-spark-3-4` back to `celeborn-client-spark-3`	2025-03-04 22:25:10 +08:00