### What changes were proposed in this pull request? `SlotsAllocator` supports policy for master to assign slots fallback to roundrobin with no available slots. ### Why are the changes needed? When the selected workers have no available slots, the loadaware policy could throw `MasterNotLeaderException`. It's recommended to support policy for master to assign slots fallback to roundrobin with no available slots. Meanwhile, the situation that there is no available slots would occur when the partition size has increased a lot in a short period of time. ``` Caused by: org.apache.celeborn.common.haclient.MasterNotLeaderException: Master:xx.xx.xx.xx:9099 is not the leader. Suggested leader is Master:xx.xx.xx.xx:9099. Exception:bound must be positive. at org.apache.celeborn.service.deploy.master.clustermeta.ha.HAHelper.sendFailure(HAHelper.java:58) at org.apache.celeborn.service.deploy.master.Master.executeWithLeaderChecker(Master.scala:236) at org.apache.celeborn.service.deploy.master.Master$$anonfun$receiveAndReply$1.applyOrElse(Master.scala:314) ... 7 more Caused by: java.lang.IllegalArgumentException: bound must be positive at java.util.Random.nextInt(Random.java:388) at org.apache.celeborn.service.deploy.master.SlotsAllocator.roundRobin(SlotsAllocator.java:202) at org.apache.celeborn.service.deploy.master.SlotsAllocator.offerSlotsLoadAware(SlotsAllocator.java:151) at org.apache.celeborn.service.deploy.master.Master.$anonfun$handleRequestSlots$1(Master.scala:598) at org.apache.celeborn.common.metrics.source.AbstractSource.sample(AbstractSource.scala:199) at org.apache.celeborn.common.metrics.source.AbstractSource.sample(AbstractSource.scala:189) at org.apache.celeborn.service.deploy.master.Master.handleRequestSlots(Master.scala:587) at org.apache.celeborn.service.deploy.master.Master$$anonfun$receiveAndReply$1.$anonfun$applyOrElse$12(Master.scala:314) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.celeborn.service.deploy.master.Master.executeWithLeaderChecker(Master.scala:233) ... 8 more ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? `SlotsAllocatorSuiteJ#testAllocateSlotsWithNoAvailableSlots` Closes #2108 from SteNicholas/CELEBORN-1136. Authored-by: SteNicholas <programgeek@163.com> Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com> |
||
|---|---|---|
| .. | ||
| src | ||
| pom.xml | ||