celeborn

History

Wang, Fei 5e12b7d607 [CELEBORN-1921] Broadcast large GetReducerFileGroupResponse to prevent Spark driver network exhausted ### What changes were proposed in this pull request? For spark celeborn application, if the GetReducerFileGroupResponse is larger than the threshold, Spark driver would broadcast the GetReducerFileGroupResponse to the executors, it prevents the driver from being the bottleneck in sending out multiple copies of the GetReducerFileGroupResponse (one per executor). ### Why are the changes needed? To prevent the driver from being the bottleneck in sending out multiple copies of the GetReducerFileGroupResponse (one per executor). ### Does this PR introduce _any_ user-facing change? No, the feature is not enabled by defaults. ### How was this patch tested? UT. Cluster testing with `spark.celeborn.client.spark.shuffle.getReducerFileGroup.broadcast.enabled=true`. The broadcast response size should be always about 1kb. ![image](https://github.com/user-attachments/assets/d5d1b751-762d-43c8-8a84-0674630a5638) ![image](https://github.com/user-attachments/assets/4841a29e-5d11-4932-9fa5-f6e78b7bc521) Application succeed. ![image](https://github.com/user-attachments/assets/9b570f70-1433-4457-90ae-b8292e5476ba) Closes #3158 from turboFei/broadcast_rgf. Authored-by: Wang, Fei <fwang12@ebay.com> Signed-off-by: Wang, Fei <fwang12@ebay.com>	2025-04-01 08:29:21 -07:00
..
src/main/resources/META-INF	[INFRA] Remove incubator/incubating for graduation	2024-03-27 13:54:47 +08:00
pom.xml	[CELEBORN-1921] Broadcast large GetReducerFileGroupResponse to prevent Spark driver network exhausted	2025-04-01 08:29:21 -07:00

Wang, Fei 5e12b7d607 [CELEBORN-1921] Broadcast large GetReducerFileGroupResponse to prevent Spark driver network exhausted

### What changes were proposed in this pull request?

For spark celeborn application, if the GetReducerFileGroupResponse is larger than the threshold, Spark driver would broadcast the GetReducerFileGroupResponse to the executors, it prevents the driver from being the bottleneck in sending out multiple copies of the GetReducerFileGroupResponse (one per executor).

### Why are the changes needed?
To prevent the driver from being the bottleneck in sending out multiple copies of the GetReducerFileGroupResponse (one per executor).

### Does this PR introduce _any_ user-facing change?
No, the feature is not enabled by defaults.

### How was this patch tested?

UT.

Cluster testing with `spark.celeborn.client.spark.shuffle.getReducerFileGroup.broadcast.enabled=true`.

The broadcast response size should be always about 1kb.
![image](https://github.com/user-attachments/assets/d5d1b751-762d-43c8-8a84-0674630a5638)
![image](https://github.com/user-attachments/assets/4841a29e-5d11-4932-9fa5-f6e78b7bc521)
Application succeed.
![image](https://github.com/user-attachments/assets/9b570f70-1433-4457-90ae-b8292e5476ba)

Closes #3158 from turboFei/broadcast_rgf.

Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>

2025-04-01 08:29:21 -07:00

src/main/resources/META-INF

[INFRA] Remove incubator/incubating for graduation

2024-03-27 13:54:47 +08:00

pom.xml

[CELEBORN-1921] Broadcast large GetReducerFileGroupResponse to prevent Spark driver network exhausted

2025-04-01 08:29:21 -07:00