celeborn/master
onebox-li 0e53a3d552 [CELEBORN-932] Fix worker register after gracefaully restart
### What changes were proposed in this pull request?
Worker will firstly register failed after worker gracefully restart in HA mode, it will be really registered after one heartbeat.
<img width="889" alt="image" src="https://github.com/apache/incubator-celeborn/assets/19429353/371aa0e0-b2e9-4c1f-9e40-276dc1460219">
This is because master here uses same `requestId` to submit request,  causing the second request not be processed correctly, due to Ratis `RetryCache`.
Master logs like below:
(worker gracefully stop)
Master: Receive ReportNodeFailure
(worker start)
Master: Received RegisterWorker request
Master: Received heartbeat from unknown worker
Master: Registered worker

So here improve AbstractMetaManager#updateRegisterWorkerMeta to cover `WorkerRemove` logic. For back compatibility and possible inconsistencies during rolling upgrade, temporarily fix duplicate requestId and keep remove function. And we can try to remove `WorkerRemove` logic in the future version.

### Why are the changes needed?
Ditto

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Cluster test

Closes #1863 from onebox-li/fix-restart-register.

Authored-by: onebox-li <lyh-36@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-09-11 21:23:28 +08:00
..
src [CELEBORN-932] Fix worker register after gracefaully restart 2023-09-11 21:23:28 +08:00
pom.xml [CELEBORN-666] Define protobuf-maven-plugin in the root pom.xml 2023-06-12 19:46:46 +08:00