celeborn/master
zky.zhoukeyong a5dfd67d5b
[CELEBORN-1034] Offer slots uses random range of available workers instead of shuffling
### What changes were proposed in this pull request?
In original design, (primary worker, replica worker) pairs tends to stay stable, for example,
for primary PartitionLocations on Worker A, their replica PartitionLocations will all be on
Worker B in general scenarios, i.e. all workers are healthy and works well. This way, one Worker
will have only one (or very few) connections to other workers' replicate netty server.

However, https://github.com/apache/incubator-celeborn/pull/1790 calls `Collections.shuffle(availableWorkers)`,
causing the number of replica connections increases dramatically:
![image](https://github.com/apache/incubator-celeborn/assets/948245/013c7bc8-a224-413e-9c0c-519ae76c9d32)

### Why are the changes needed?
This PR refine the logic of selecting limited number of workers, instead of shuffling,
Master just randomly picks a range of available workers.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Manual test.

Closes #1975 from waitinfuture/1034.

Lead-authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Co-authored-by: Keyong Zhou <waitinfuture@gmail.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-10-18 17:00:03 +08:00
..
src [CELEBORN-1034] Offer slots uses random range of available workers instead of shuffling 2023-10-18 17:00:03 +08:00
pom.xml [CELEBORN-1006] Add support for Apache Hadoop 2.x in Celeborn build 2023-09-25 20:15:02 +08:00