### What changes were proposed in this pull request? add data size limitation to inflight data by introducing a new configuration: `celeborn.client.push.maxBytesInFlight.perWorker/total` and defaults to `celeborn.client.push.buffer.max.size * celeborn.client.push.maxReqsInFlight.perWorker/total`. for backward compatibility, also add a control: `celeborn.client.push.maxReqsInFlight.enabled`. ### Why are the changes needed? celeborn do supports limiting the number of push inflight requests via `celeborn.client.push.maxReqsInFlight.perWorker/total`. this is a good constraint to memory usage where most requests do not exceed `celeborn.client.push.buffer.max.size`. however, in a vectorized shuffle (like blaze and gluten), a request might be greatly larger then the max buffer size, leading to too much inflight data and results OOM. ### Does this PR introduce _any_ user-facing change? Yes, add new config for client ### How was this patch tested? test on local env Closes #3248 from DDDominik/CELEBORN-1917. Lead-authored-by: DDDominik <1015545832@qq.com> Co-authored-by: SteNicholas <programgeek@163.com> Co-authored-by: DDDominik <zhuangxian@kuaishou.com> Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com> |
||
|---|---|---|
| .. | ||
| benchmarks | ||
| src | ||
| pom.xml | ||