celeborn/docs/configuration
Wang, Fei 5e12b7d607 [CELEBORN-1921] Broadcast large GetReducerFileGroupResponse to prevent Spark driver network exhausted
### What changes were proposed in this pull request?

For spark celeborn application, if the GetReducerFileGroupResponse is larger than the threshold, Spark driver would broadcast the GetReducerFileGroupResponse to the executors, it prevents the driver from being the bottleneck in sending out multiple copies of the GetReducerFileGroupResponse (one per executor).

### Why are the changes needed?
To prevent the driver from being the bottleneck in sending out multiple copies of the GetReducerFileGroupResponse (one per executor).

### Does this PR introduce _any_ user-facing change?
No, the feature is not enabled by defaults.

### How was this patch tested?

UT.

Cluster testing with `spark.celeborn.client.spark.shuffle.getReducerFileGroup.broadcast.enabled=true`.

The broadcast response size should be always about 1kb.
![image](https://github.com/user-attachments/assets/d5d1b751-762d-43c8-8a84-0674630a5638)
![image](https://github.com/user-attachments/assets/4841a29e-5d11-4932-9fa5-f6e78b7bc521)
Application succeed.
![image](https://github.com/user-attachments/assets/9b570f70-1433-4457-90ae-b8292e5476ba)

Closes #3158 from turboFei/broadcast_rgf.

Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>
2025-04-01 08:29:21 -07:00
..
client.md [CELEBORN-1921] Broadcast large GetReducerFileGroupResponse to prevent Spark driver network exhausted 2025-04-01 08:29:21 -07:00
columnar-shuffle.md [CELEBORN-1051] Add isDynamic property for CelebornConf 2024-02-20 14:20:44 +08:00
ha.md [CELEBORN-1400] Bump Ratis version from 2.5.1 to 3.0.1 2024-05-30 17:22:22 +08:00
index.md [MINOR] Add documentation for CELEBORN_NO_DAEMONIZE 2024-12-23 10:31:37 +08:00
master.md [CELEBORN-1577][PHASE2] QuotaManager should support interrupt shuffle 2025-03-24 22:05:45 +08:00
metrics.md [CELEBORN-1745] Remove application top disk usage code 2024-11-28 10:55:34 +08:00
network-module.md [CELEBORN-1353] Document Celeborn security - authentication and SSL support 2024-04-30 14:37:56 +08:00
network.md [MINOR] Change config versions 2025-03-11 07:39:32 +08:00
quota.md [CELEBORN-1577][PHASE2] QuotaManager should support interrupt shuffle 2025-03-24 22:05:45 +08:00
worker.md [CELEBORN-1792][FOLLOWUP] Keep resume for a while after resumeByPinnedMemory 2025-03-05 09:37:59 +08:00