Commit Graph

1423 Commits

Author SHA1 Message Date
jiaoqingbo
e1656616ad [MINOR] Fix typo in CelebornConf
### What changes were proposed in this pull request?

As Title

### Why are the changes needed?

As Title

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Pass GA

Closes #1821 from jiaoqingbo/fixtypo-doc.

Authored-by: jiaoqingbo <1178404354@qq.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-08-21 14:14:15 +08:00
liangyongyuan
30d979f685 [CELEBORN-899] Fix potential NPE in ShuffleClientImpl#revive
### What changes were proposed in this pull request?
After obtaining the results of reviveBatch, determine whether it contains the corresponding partitionId.

### Why are the changes needed?
that maybe cause  npe in some versions of jdk8.The decompilation result is as follows
![image](https://github.com/apache/incubator-celeborn/assets/46274164/be947d3f-0da2-4cd7-8be1-e160ced92b6d)

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
through existing uts

Closes #1819 from lyy-pineapple/fix-npe.

Authored-by: liangyongyuan <liangyongyuan@xiaomi.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-08-17 11:01:23 +08:00
Fu Chen
4f47d0a56b [CELEBORN-898][INFRA] Fix java.lang.NoClassDefFoundError: org/hamcrest/SelfDescribing for SBT CI
### What changes were proposed in this pull request?

As title

### Why are the changes needed?

Recently, I came across an issue in the SBT CI process that can result in failure due to the `NoClassDefFoundError` exception.

```
[error] Uncaught exception when running org.apache.celeborn.common.unsafe.PlatformUtilSuite: java.lang.NoClassDefFoundError: org/hamcrest/SelfDescribing
[error] sbt.ForkMain$ForkError: java.lang.NoClassDefFoundError: org/hamcrest/SelfDescribing
[error]    at java.lang.ClassLoader.defineClass1(Native Method)
[error]    at java.lang.ClassLoader.defineClass(ClassLoader.java:757)
[error]    at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
[error]    at java.net.URLClassLoader.defineClass(URLClassLoader.java:473)
[error]    at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
[error]    at java.net.URLClassLoader$1.run(URLClassLoader.java:369)
[error]    at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
[error]    at java.security.AccessController.doPrivileged(Native Method)
[error]    at java.net.URLClassLoader.findClass(URLClassLoader.java:362)
[error]    at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
[error]    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
[error]    at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
[error]    at org.junit.runner.Computer.getSuite(Computer.java:28)
[error]    at org.junit.runner.Request.classes(Request.java:77)
[error]    at org.junit.runner.Request.classes(Request.java:92)
[error]    at com.novocode.junit.JUnitTask.execute(JUnitTask.java:52)
[error]    at sbt.ForkMain$Run.lambda$runTest$1(ForkMain.java:414)
[error]    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
[error]    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[error]    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[error]    at java.lang.Thread.run(Thread.java:750)
[error] Caused by: sbt.ForkMain$ForkError: java.lang.ClassNotFoundException: org.hamcrest.SelfDescribing
[error]    at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
[error]    at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
[error]    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
[error]    at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
[error]    at java.lang.ClassLoader.defineClass1(Native Method)
[error]    at java.lang.ClassLoader.defineClass(ClassLoader.java:757)
[error]    at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
[error]    at java.net.URLClassLoader.defineClass(URLClassLoader.java:473)
[error]    at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
[error]    at java.net.URLClassLoader$1.run(URLClassLoader.java:369)
[error]    at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
[error]    at java.security.AccessController.doPrivileged(Native Method)
[error]    at java.net.URLClassLoader.findClass(URLClassLoader.java:362)
```

Upon further investigation, I found that the root cause is SBT's sometimes inability to resolve Maven dependencies cached within GA.

```shell
./build/sbt "show celeborn-common/update"
```

```
[info] 		org.hamcrest:hamcrest-core:1.3:default: (MISSING) Artifact(hamcrest-core, jar, jar, None, Vector(), Some(file:/home/runner/.m2/repository/org/hamcrest/hamcrest-core/1.3/hamcrest-core-1.3.jar), Map(), None, false)
```

This PR addresses the random issue by disabling the Maven cache for SBT CI.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Pass GA

https://github.com/apache/incubator-celeborn/pull/1797 pass GA after disabled maven cache.

Closes #1818 from cfmcgrady/sbt-ci.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Fu Chen <cfmcgrady@gmail.com>
2023-08-16 11:16:21 +08:00
Fu Chen
6af3b50508 [CELEBORN-884][BUILD] Consolidate all dependencies into a global object Dependencies
### What changes were proposed in this pull request?

As title

### Why are the changes needed?

consolidate all sbt dependencies into a global object `Dependencies`, similar to Maven's dependencyManagement, to improve dependency management.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Pass GA

Closes #1802 from cfmcgrady/sbt-dependencies.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Fu Chen <cfmcgrady@gmail.com>
2023-08-16 10:42:59 +08:00
zky.zhoukeyong
57fdbf08c2 [CELEBORN-897] Set celeborn.network.memory.allocator.allowCache default to false
### What changes were proposed in this pull request?
As title

### Why are the changes needed?
I tested 1.1T and 3.3T shuffle, as well as 3T TPCDS with thread cache on and off in the shared PooledByteBufAllocator and find no
difference:
| Benchmark    | Cache On | Cache Off|
| -------- | ------- |------- |
|1.1T Shuffle| 3.7min/1.9min   |3.7min/1.9min|
| 3.3T Shuffle| 12min/6.7min  |12min/6.2min|
| 3T TPCDS | 2645s |2644s|

And since the configuration has a big influence to the direct memory usage, see https://github.com/apache/incubator-celeborn/pull/1716 , it's very necessary to set the default value to false.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Manual test.

Closes #1817 from waitinfuture/897.

Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-08-16 00:26:39 +08:00
e
1ec1ba7061 [FOLLOWUP][MINOR] Add an alternative for CLIENT_RESERVE_SLOTS_RACKAWAE_ENABLED
### What changes were proposed in this pull request?

As title

### Why are the changes needed?

As title

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Pass GA

Closes #1816 from jiaoqingbo/typo-conf-followup.

Authored-by: e <1178404354@qq.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-08-15 20:11:15 +08:00
Angerszhuuuu
17de30009b [CELEBORN-847] Support use RESTful API to trigger worker exit and exitImmediately
### What changes were proposed in this pull request?
As title

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1768 from AngersZhuuuu/CELEBORN-847.

Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Co-authored-by: Keyong Zhou <zhouky@apache.org>
Co-authored-by: Keyong Zhou <waitinfuture@gmail.com>
Co-authored-by: Keyong Zhou <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-08-15 20:04:26 +08:00
e
4a4a37ed17 [MINOR] Fix typo in CelebornConf
### What changes were proposed in this pull request?

Fix typo in CelebornConf

### Why are the changes needed?

Fix typo in CelebornConf

### Does this PR introduce _any_ user-facing change?

NO

### How was this patch tested?

Passing GA

Closes #1813 from jiaoqingbo/typo-conf.

Authored-by: e <1178404354@qq.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-08-15 10:32:08 +08:00
liangbowen
1bf93991bc [CELEBORN-893][DOC] Fix Spark patch list text in Readme
### What changes were proposed in this pull request?

- Fix the text of Spark patch list

### Why are the changes needed?

Before:
<img width="909" alt="image" src="https://github.com/apache/incubator-celeborn/assets/1935105/1d402df1-3a68-4810-8f84-8ab61a38314c">

After:
<img width="908" alt="image" src="https://github.com/apache/incubator-celeborn/assets/1935105/2c733568-a08a-4951-bd5a-f4a444a28833">

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Screenshots attached.

Closes #1810 from bowenliang123/readme-patch.

Authored-by: liangbowen <liangbowen@gf.com.cn>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-08-14 14:54:58 +08:00
e
307872d4f7 [CELEBORN-892][TEST] Fix statistics error of commitFiles method
### What changes were proposed in this pull request?

Fix statistics error of commitFiles method
res1 should be res2

### Why are the changes needed?

Fix statistics error of commitFiles method

### Does this PR introduce _any_ user-facing change?

NO

### How was this patch tested?

passing GA

Closes #1809 from jiaoqingbo/892.

Authored-by: e <1178404354@qq.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-08-14 12:08:11 +08:00
zky.zhoukeyong
3a64a8b7cb [CELEBORN-890][BUG] PushHandler should check whether FileWriter has closed to avoid data lost
### What changes were proposed in this pull request?
This PR fixes a bug that in rare cases it may cause data lost.

### Why are the changes needed?
I received a bug report from one of the users that in an extreme case small data lost happens. I
reproduced the bug under the following conditions:
1. Shuffle data size for one partition id is relatively large, for example 400GB
2. `celeborn.client.shuffle.partitionSplit.mode` is set to HARD
3. `celeborn.client.shuffle.batchHandleCommitPartition.enabled` is enabled

At the mean time, there are warning messages in worker's log
```
23/08/11 17:10:04,501 WARN [push-server-6-44] PushDataHandler: Append data failed for task(shuffle application_1691635581416_0021-0, map 746, attempt 0), caused by AlreadyClosedException, endedAttempt -1, error message: FileWriter has already closed!, fileName /mnt/disk1/celeborn-worker/shuffle_data/application_1691635581416_0021/0/0-107-0
23/08/11 17:12:04,445 WARN [push-server-6-35] PushDataHandler: Append data failed for task(shuffle application_1691635581416_0021-0, map 3016, attempt 0), caused by AlreadyClosedException, endedAttempt -1, error message: FileWriter has already closed!, fileName /mnt/disk3/celeborn-worker/shuffle_data/application_1691635581416_0021/0/0-356-0
```

![image](https://github.com/apache/incubator-celeborn/assets/948245/c05f25ba-4b24-4483-8baf-96915e40da17)

After digging into it, I found the reason for the data lost is as follows:
1. For some partition id in some worker, the file size exceeds `celeborn.client.shuffle.partitionSplit.threshold`, then
   `CommitManager` in `LifecycleManager` will trigger `CommitFiles` because `batchHandleCommitPartition` is enabled
2. Before `CommitFile` finishes, `PushDataHandler` receives `PushData` or `PushMergedData`, it finds that the partition has not committed yet, and is preparing to call `fileWriter.incrementPendingWrites()` and `callback.onSuccess`
3. Before `PushDataHandler` calls `fileWriter.incrementPendingWrites()`, the `CommitFiles` finishes and the FileWriter
    successfully closes.
4. Then `PushDataHandler` calls `fileWriter.incrementPendingWrites()` and  `callback.onSuccess`. After this time,
    `ShuffleClient` thinks the `PushData` succeeds. However, when `PushDataHandler` calls `fileWriter.write()`, it
    finds it already closed and throws the above exception. However, the exception is ignored, so the data lost happens.

This PR fixes this by checking whether FileWriter has closed after calling `incrementPendingWrites`. If true,
`PushDataHandler` calls `onFailure`.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Manual test.

Closes #1808 from waitinfuture/890.

Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-08-12 00:13:34 +08:00
mingji
7d0e257001 [CELEBORN-846][FOLLOWUP] Fix broken link caused by unknown RPC
### What changes were proposed in this pull request?
Keep ReleaseSlots RPC to make sure that 0.3 client can worker with 0.3.1-SNAPSHOT and 0.4.0-SNAPSHOT.
This PR will need to merged into main and branch-0.3.

### Why are the changes needed?
Ditto.

### Does this PR introduce _any_ user-facing change?
NO.

### How was this patch tested?
GA and cluster.

Closes #1794 from FMX/CELEBORN-846-FOLLOWUP.

Lead-authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Co-authored-by: Ethan Feng <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-08-11 22:00:51 +08:00
Cheng Pan
e137d0e1e7 [CELEBORN-887] Option --config should take effect in celeborn-daemon.sh script
### What changes were proposed in this pull request?
I find a little difficult to use `celeborn-daemon.sh` to get instance status, so I polish the usage and fix --config load.

### Why are the changes needed?
Ditto

### Does this PR introduce _any_ user-facing change?
Polish the `celeborn-daemon.sh` usage

### How was this patch tested?
Manually test.

Closes #1805 from onebox-li/improve-script.

Lead-authored-by: Cheng Pan <pan3793@gmail.com>
Co-authored-by: Leo Li <lyh-36@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-08-11 21:57:21 +08:00
Fu Chen
efc334a6aa [CELEBORN-877][FOLLOWUP][DOC] Expand 'note' blocks by default in the docs sbt.md
### What changes were proposed in this pull request?

As title

### Why are the changes needed?

As title

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Pass GA

Closes #1806 from cfmcgrady/sbt-docs-followup.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-08-11 21:54:24 +08:00
Fu Chen
6f1bb41646 [CELEBORN-796] Support for globally disable thread-local cache in the shared PooledByteBufAllocator
### What changes were proposed in this pull request?

As title

### Why are the changes needed?

As title

### Does this PR introduce _any_ user-facing change?

Yes, the thread local cache of shared `PooledByteBufAllocator` can be disabled by setting `celeborn.network.memory.allocator.allowCache=false`

### How was this patch tested?

Pass GA

Closes #1716 from cfmcgrady/allow-cache.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-08-11 21:49:09 +08:00
mingji
54622814cc
[CELEBORN-885][SPARK] Shade RoaringBitmap to avoid dependency conflicts
### What changes were proposed in this pull request?
Shade roaring bitmap to void dependency conflicts.

### Why are the changes needed?
Some user reports that celeborn client will introduce roaring bitmap conflicts.

### Does this PR introduce _any_ user-facing change?
NO.

### How was this patch tested?
GA and cluster.

Closes #1803 from FMX/CELEBORN-885.

Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-08-11 12:48:12 +08:00
Fu Chen
516bdc7e08
[CELEBORN-877][DOC] Document on SBT
### What changes were proposed in this pull request?

As title

### Why are the changes needed?

As title

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Manual test

Closes #1795 from cfmcgrady/sbt-docs.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-08-11 12:17:55 +08:00
zwangsheng
63df84593e [CELEBORN-883][WORKER] Optimized configuration checks during MemoryManager initialization
<!--
Thanks for sending a pull request!  Here are some tips for you:
  - Make sure the PR title start w/ a JIRA ticket, e.g. '[CELEBORN-XXXX] Your PR title ...'.
  - Be sure to keep the PR description updated to reflect all changes.
  - Please write your PR title to summarize what this PR proposes.
  - If possible, provide a concise example to reproduce the issue for a faster review.
-->

### What changes were proposed in this pull request?
1. Expose the config check logic during `MemoryManager#initialization` in the user configuration doc.
2. Add Preconditions Error Message
3. Add unit test to make sure that part of the logic isn't altered by mistake

### Why are the changes needed?
User-friendly

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
Add Unit Test

Closes #1801 from zwangsheng/CELEBORN-883.

Authored-by: zwangsheng <2213335496@qq.com>
Signed-off-by: zwangsheng <2213335496@qq.com>
2023-08-11 10:46:00 +08:00
Fu Chen
0d1261632d [CELEBORN-880] Remove sbt compiler plugin genjavadoc-plugin
### What changes were proposed in this pull request?

As title

### Why are the changes needed?

the plugin may generate unexpected source files in the project root directory. we need to refine this feature if we want to generate Java doc.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Pass GA

Closes #1799 from cfmcgrady/sbt-compiler-plugin.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Fu Chen <cfmcgrady@gmail.com>
2023-08-09 10:16:33 +08:00
mingji
3ec218878a
[CELEBORN-876] Enhance log to find out failed workers if data lost
### What changes were proposed in this pull request?
1. Log offer slots results from LifecycleManager.
2. Log change partition results from LifecycleManager.
3. Log reserve slots results.
4. Log fetch file group failure instead of data lost.

### Why are the changes needed?
If data lost happened, we need to find out what worker cause this failure. So we need to check reserve slots result from LifecycleManager.

### Does this PR introduce _any_ user-facing change?
NO.

### How was this patch tested?
GA.

Closes #1798 from FMX/CELEBORN-876.

Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-08-08 18:20:41 +08:00
Shuang
4ae9e2476f
[CELEBORN-878][FLINK] Convert all IOException to PartitionUnRetryAbleException when openStream/read file
### What changes were proposed in this pull request?
1. Wrap IOException to PartitionUnRetryAbleException when fetch
2. Improve message logging when open stream/read data error

### Why are the changes needed?
When open stream, there would be encounter many different IOExceptions such as NoSuchFileException, FileNotFoundException,FileCorruptedException etc, for these checked exception should wrap to PartitionUnRetryAbleException to let client choose to regenerate the data.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
UT & Manual test

Closes #1796 from RexXiong/CELEBORN-878-IO-Exception.

Authored-by: Shuang <lvshuang.tb@gmail.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-08-08 16:15:51 +08:00
Kerwin Zhang
4fb3f31a2d
[CELEBORN-870][FOLLOWUP][DOC] Document on usage together with Gluten (#1793) 2023-08-08 10:37:13 +08:00
zky.zhoukeyong
6ea1ee2ec4 [CELEBORN-152] Add config to limit max workers when offering slots
### What changes were proposed in this pull request?
Add config to limit max workers when offering slots, the config can be set both
in server side and client side. Celeborn will choose the smaller positive configs from client and master.

### Why are the changes needed?
For large Celeborn clusters, users may want to limit the number of workers that
a shuffle can spread, reasons are:

1. One worker failure will not affect all applications
2. One huge shuffle will not affect all applications
3. It's more efficient to limit a shuffle within a restricted number of workers, say 100, than
    spreading across a large number of workers, say 1000, because the network connections
   in pushing data is `number of ShuffleClient` * `number of allocated Workers`

The recommended number of Workers should depend on workload and Worker hardware,
and this can be configured per application, so it's relatively flexible.

### Does this PR introduce _any_ user-facing change?
No, added a new configuration.

### How was this patch tested?
Added ITs and passes GA.

Closes #1790 from waitinfuture/152.

Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-08-07 10:13:53 +08:00
mingji
ea39a9372a [CELEBORN-760] Convert OpenStream and StreamHandler to Pb
### What changes were proposed in this pull request?
Merge OpenStream and StreamHandler to transport messages to enhance celeborn's compatibility.

### Why are the changes needed?
1. Improve flexibility to change RPC.
2. Compatible with 0.2 client.

### Does this PR introduce _any_ user-facing change?
NO.

### How was this patch tested?
UT and cluster.

Closes #1750 from FMX/CELEBORN-760.

Lead-authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Co-authored-by: Ethan Feng <fengmingxiao.fmx@alibaba-inc.com>
Co-authored-by: Keyong Zhou <zhouky@apache.org>
Co-authored-by: Keyong Zhou <waitinfuture@gmail.com>
Co-authored-by: Keyong Zhou <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-08-05 13:58:08 +08:00
Fu Chen
d786e0ecf5 [CELEBORN-874] Enrich Fetch log
### What changes were proposed in this pull request?

As title

### Why are the changes needed?

As title

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Pass GA

Closes #1791 from cfmcgrady/enrich-fetch-log.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-08-05 12:15:30 +08:00
mingji
efc9a875e9 [CELEBORN-863] Persist committed file infos to support worker recovery
### What changes were proposed in this pull request?
Support worker recovery if the worker has crashed when workers has enabled graceful shutdown..

1. Persist committed file info to LevelDB.
2. Load levelDB when worker started.
3. Clean expired file infos in LevelDB.

### Why are the changes needed?
Ditto.

### Does this PR introduce _any_ user-facing change?
NO.

### How was this patch tested?
GA and cluster. After testing on a cluster I found that 8k file infos will consume about 2MB of disk space, disk space can be reclaimed if shuffle is expired shortly.

Closes #1779 from FMX/CELEBORN-863.

Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-08-04 23:58:47 +08:00
zky.zhoukeyong
fe37405899 [CELEBORN-712][FOLLOWUP] Fix Utils.makeReducerKey
### What changes were proposed in this pull request?
As title.

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Passes GA

Closes #1792 from waitinfuture/712-fu.

Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-08-04 20:40:14 +08:00
xiyu.zk
35fe63e4a9 [CELEBORN-870][DOC] Document on usage together with Gluten
### What changes were proposed in this pull request?
As title.

### Why are the changes needed?
As title.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Manual test.

Closes #1784 from kerwin-zk/gluten_celeborn.

Lead-authored-by: xiyu.zk <xiyu.zk@alibaba-inc.com>
Co-authored-by: Kerwin Zhang <xiyu.zk@alibaba-inc.com>
Co-authored-by: Keyong Zhou <zhouky@apache.org>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-08-04 11:32:13 +08:00
zwangsheng
6e9a98a28f
[CELEBORN-872][MASTER] Extract the same allocation logic for both loadaware and roundrobin
### What changes were proposed in this pull request?
As title

### Why are the changes needed?
Reduce duplicate code segments, improve code readability and maintenance difficulty.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Unit Test

Closes #1786 from zwangsheng/CELEBORN-872.

Authored-by: zwangsheng <2213335496@qq.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-08-03 20:14:45 +08:00
Fu Chen
39ab731b85 [CELEBORN-875][FOLLOWUP] Enhance DataPushQueueSuiteJ for thread safety and prevent NullPointerException
### What changes were proposed in this pull request?

1. replaced the usage of `HashMap` with `ConcurrentHashMap` for `partitionBatchIdMap` to ensure thread safety during parallel data processing
2. put the partition id and batch id into the `partitionBatchIdMap` before adding the task to prevent the possibility of a NPE

### Why are the changes needed?

to fix NPE

https://github.com/apache/incubator-celeborn/actions/runs/5734532048/job/15540863715?pr=1785

```
xception in thread "DataPusher-0" java.lang.NullPointerException
	at org.apache.celeborn.client.write.DataPushQueueSuiteJ$1.pushData(DataPushQueueSuiteJ.java:121)
	at org.apache.celeborn.client.write.DataPusher$1.run(DataPusher.java:125)
Error: The operation was canceled.
```

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Pass GA

Closes #1789 from cfmcgrady/celeborn-875-followup.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-08-02 21:52:53 +08:00
mingji
2b79c37381 [CELEBORN-852][FOLLOWUP] Add active connection count metrics to grafana dashboard
### What changes were proposed in this pull request?
Add active connections count metrics to grafana dashboard.

### Why are the changes needed?
Ditto.

### Does this PR introduce _any_ user-facing change?
Yes, new metric chart in the grafana dashboard.

### How was this patch tested?
Cluster.

Closes #1783 from FMX/CELEBORN-852.

Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-08-02 21:24:57 +08:00
zky.zhoukeyong
3ee0674058 [CELEBORN-869][FOLLOWUP][DOC] Document on Integrating Celeborn
### What changes were proposed in this pull request?
As title.

### Why are the changes needed?
As title.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Manual test.

Closes #1788 from waitinfuture/869-fu.

Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-08-02 18:17:17 +08:00
Keyong Zhou
8c473c038b [CELEBORN-869][DOC] Document on Integrating Celeborn
### What changes were proposed in this pull request?
As title.

### Why are the changes needed?
As title.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Manual test.

Closes #1787 from waitinfuture/869.

Lead-authored-by: Keyong Zhou <waitinfuture@gmail.com>
Co-authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-08-02 17:22:41 +08:00
zky.zhoukeyong
bee8648421 [CELEBORN-864][DOC] Document on blacklist
### What changes were proposed in this pull request?
As title.

### Why are the changes needed?
As title.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Manual test.

Closes #1782 from waitinfuture/864.

Lead-authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Co-authored-by: Keyong Zhou <waitinfuture@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-08-01 21:23:55 +08:00
caojiaqing
3e266c0cf6 [CELEBORN-852] Adding new metrics to record the number of registered …
### What changes were proposed in this pull request?
Adding new metrics to record the number of registered connections

### Why are the changes needed?
Monitor the number of active connections on worker nodes

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
no

Closes #1773 from JQ-Cao/852.

Authored-by: caojiaqing <caojiaqing@bilibili.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-08-01 21:21:13 +08:00
zwangsheng
5e6a23fd88 [CELEBORN-868][MASTER] Adjust logic of SlotsAllocator#offerSlotsLoadAware fallback to roundrobin
### What changes were proposed in this pull request?
Fallback in following order:

1. usableDisks is empty (no need to call iter)
2. under replicate case, first usableDisks == 1 fast fallback
3. count distinct worker

### Why are the changes needed?
Clear about the logic here

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Unit Test

Closes #1781 from zwangsheng/CELEBORN-868.

Authored-by: zwangsheng <2213335496@qq.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-08-01 20:39:23 +08:00
Fu Chen
6ba4b7e138
[CELEBORN-850][INFRA] Add SBT CI
### What changes were proposed in this pull request?

This PR adds new GitHub Actions workflows to enable Continuous Integration using SBT based on #1764

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Pass GA

Closes #1771 from cfmcgrady/sbt-ci.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-08-01 18:14:58 +08:00
Fu Chen
40e416c95c
[CELEBORN-843][BUILD] sbt support flink-related module build/test
### What changes were proposed in this pull request?

This PR adds packaging and testing support for Flink-related modules using SBT based on #1757

### Why are the changes needed?

improve project build speed

running flink-it tests with -Pflink-1.14

```shell
sbt:celeborn> project flink-it
sbt:flink-it> clean
sbt:flink-it> test
[success] Total time: 136 s (02:16), completed 2023-7-27 11:55:10
```

running flink-it tests with -Pflink-1.17

```shell
$ ./build/sbt -Pflink-1.17
sbt:celeborn> project flink-it
sbt:flink-it> clean
sbt:flink-it> test
[success] Total time: 168 s (02:48), completed 2023-7-27 11:28:35
```

packing and shading the flink 1.14 client

```shell
$ ./build/sbt -Pflink-1.14
sbt:celeborn> clean
sbt:celeborn> project celeborn-client-flink-1_14-shaded
sbt:celeborn-client-flink-1_14-shaded> assembly
[success] Total time: 35 s, completed 2023-7-27 11:51:54
```

packing and shading the flink 1.17 client

```shell
$ ./build/sbt -Pflink-1.17
sbt:celeborn> clean
sbt:celeborn> project celeborn-client-flink-1_17-shaded
sbt:celeborn-client-flink-1_17-shaded> assembly
[success] Total time: 39 s, completed 2023-7-27 11:49:20
```

### Does this PR introduce _any_ user-facing change?

yes

### How was this patch tested?

tested locally

Closes #1764 from cfmcgrady/sbt-flink.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-08-01 12:29:29 +08:00
Fu Chen
f2fc520d04
[CELEBORN-867][BUILD] Add maven local repository to sbt respositories
### What changes were proposed in this pull request?

As title

### Why are the changes needed?

https://github.com/apache/incubator-celeborn/pull/1764#issuecomment-1659463340

before this PR
```shell
> ./build/sbt -Pspark-3.3 "clean; show celeborn-common/csrResolvers"
[info] * FileRepository(local, Patterns(ivyPatterns=Vector(/Users/fchen/.ivy2/local/[organisation]/[module]/(scala_[scalaVersion]/)(sbt_[sbtVersion]/)([branch]/)[revision]/[type]s/[artifact](-[classifier]).[ext]), artifactPatterns=Vector(/Users/fchen/.ivy2/local/[organisation]/[module]/(scala_[scalaVersion]/)(sbt_[sbtVersion]/)([branch]/)[revision]/[type]s/[artifact](-[classifier]).[ext]), isMavenCompatible=false, descriptorOptional=false, skipConsistencyCheck=false), FileConfiguration(true, None))
[info] * private: file:/dev/null
[info] * aliyun-maven: https://maven.aliyun.com/nexus/content/groups/public/
[info] * huawei-central: https://mirrors.huaweicloud.com/repository/maven/
[success] Total time: 0 s, completed 2023-8-1 11:40:25
```

```shell
> ./build/sbt -Pspark-3.3 "clean; show celeborn-client-spark-3/managedClasspath" | grep spark
[info] * Attributed(/Users/fchen/Library/Caches/Coursier/v1/https/mirrors.huaweicloud.com/repository/maven/org/apache/spark/spark-core_2.12/3.3.2/spark-core_2.12-3.3.2.jar)
[info] * Attributed(/Users/fchen/Library/Caches/Coursier/v1/https/mirrors.huaweicloud.com/repository/maven/org/apache/spark/spark-sql_2.12/3.3.2/spark-sql_2.12-3.3.2.jar)
[info] * Attributed(/Users/fchen/Library/Caches/Coursier/v1/https/mirrors.huaweicloud.com/repository/maven/org/apache/spark/spark-launcher_2.12/3.3.2/spark-launcher_2.12-3.3.2.jar)
[info] * Attributed(/Users/fchen/Library/Caches/Coursier/v1/https/mirrors.huaweicloud.com/repository/maven/org/apache/spark/spark-kvstore_2.12/3.3.2/spark-kvstore_2.12-3.3.2.jar)
[info] * Attributed(/Users/fchen/Library/Caches/Coursier/v1/https/mirrors.huaweicloud.com/repository/maven/org/apache/spark/spark-network-common_2.12/3.3.2/spark-network-common_2.12-3.3.2.jar)
[info] * Attributed(/Users/fchen/Library/Caches/Coursier/v1/https/mirrors.huaweicloud.com/repository/maven/org/apache/spark/spark-network-shuffle_2.12/3.3.2/spark-network-shuffle_2.12-3.3.2.jar)
[info] * Attributed(/Users/fchen/Library/Caches/Coursier/v1/https/mirrors.huaweicloud.com/repository/maven/org/apache/spark/spark-unsafe_2.12/3.3.2/spark-unsafe_2.12-3.3.2.jar)
[info] * Attributed(/Users/fchen/Library/Caches/Coursier/v1/https/mirrors.huaweicloud.com/repository/maven/org/apache/spark/spark-tags_2.12/3.3.2/spark-tags_2.12-3.3.2.jar)
[info] * Attributed(/Users/fchen/Library/Caches/Coursier/v1/https/mirrors.huaweicloud.com/repository/maven/org/spark-project/spark/unused/1.0.0/unused-1.0.0.jar)
[info] * Attributed(/Users/fchen/Library/Caches/Coursier/v1/https/mirrors.huaweicloud.com/repository/maven/org/apache/spark/spark-sketch_2.12/3.3.2/spark-sketch_2.12-3.3.2.jar)
[info] * Attributed(/Users/fchen/Library/Caches/Coursier/v1/https/mirrors.huaweicloud.com/repository/maven/org/apache/spark/spark-catalyst_2.12/3.3.2/spark-catalyst_2.12-3.3.2.jar)
```

after this PR

```shell
> ./build/sbt -Pspark-3.3 "clean; show celeborn-common/csrResolvers"
[success] Total time: 0 s, completed 2023-8-1 11:45:02
[info] * FileRepository(local, Patterns(ivyPatterns=Vector(/Users/fchen/.ivy2/local/[organisation]/[module]/(scala_[scalaVersion]/)(sbt_[sbtVersion]/)([branch]/)[revision]/[type]s/[artifact](-[classifier]).[ext]), artifactPatterns=Vector(/Users/fchen/.ivy2/local/[organisation]/[module]/(scala_[scalaVersion]/)(sbt_[sbtVersion]/)([branch]/)[revision]/[type]s/[artifact](-[classifier]).[ext]), isMavenCompatible=false, descriptorOptional=false, skipConsistencyCheck=false), FileConfiguration(true, None))
[info] * mavenLocal: file:/Users/fchen/.m2/repository/
[info] * private: file:/dev/null
[info] * aliyun-maven: https://maven.aliyun.com/nexus/content/groups/public/
[info] * huawei-central: https://mirrors.huaweicloud.com/repository/maven/
[success] Total time: 0 s, completed 2023-8-1 11:45:02
```

```shell
> ./build/sbt -Pspark-3.3 "clean; show celeborn-client-spark-3/managedClasspath" | grep spark
[info] * Attributed(/Users/fchen/.m2/repository/org/apache/spark/spark-core_2.12/3.3.2/spark-core_2.12-3.3.2.jar)
[info] * Attributed(/Users/fchen/.m2/repository/org/apache/spark/spark-sql_2.12/3.3.2/spark-sql_2.12-3.3.2.jar)
[info] * Attributed(/Users/fchen/.m2/repository/org/apache/spark/spark-launcher_2.12/3.3.2/spark-launcher_2.12-3.3.2.jar)
[info] * Attributed(/Users/fchen/.m2/repository/org/apache/spark/spark-kvstore_2.12/3.3.2/spark-kvstore_2.12-3.3.2.jar)
[info] * Attributed(/Users/fchen/.m2/repository/org/apache/spark/spark-network-common_2.12/3.3.2/spark-network-common_2.12-3.3.2.jar)
[info] * Attributed(/Users/fchen/.m2/repository/org/apache/spark/spark-network-shuffle_2.12/3.3.2/spark-network-shuffle_2.12-3.3.2.jar)
[info] * Attributed(/Users/fchen/.m2/repository/org/apache/spark/spark-unsafe_2.12/3.3.2/spark-unsafe_2.12-3.3.2.jar)
[info] * Attributed(/Users/fchen/.m2/repository/org/apache/spark/spark-tags_2.12/3.3.2/spark-tags_2.12-3.3.2.jar)
[info] * Attributed(/Users/fchen/.m2/repository/org/spark-project/spark/unused/1.0.0/unused-1.0.0.jar)
[info] * Attributed(/Users/fchen/.m2/repository/org/apache/spark/spark-sketch_2.12/3.3.2/spark-sketch_2.12-3.3.2.jar)
[info] * Attributed(/Users/fchen/.m2/repository/org/apache/spark/spark-catalyst_2.12/3.3.2/spark-catalyst_2.12-3.3.2.jar)
```

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Pass GA

Closes #1780 from cfmcgrady/sbt-maven-local.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-08-01 12:28:18 +08:00
zky.zhoukeyong
6cd1355488 [CELEBORN-726][FOLLOWUP] Amend method names
### What changes were proposed in this pull request?
As title

### Why are the changes needed?
As title

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Passes GA

Closes #1776 from waitinfuture/method.

Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-31 20:14:41 +08:00
zky.zhoukeyong
3593adf12d [CELEBORN-860][DOC] Document on ShuffleClient
### What changes were proposed in this pull request?
As title.

### Why are the changes needed?
As title.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Manual test.

Closes #1778 from waitinfuture/860-1.

Lead-authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Co-authored-by: Keyong Zhou <waitinfuture@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-31 20:07:20 +08:00
zky.zhoukeyong
37a9c633b3 [CELEBORN-853][DOC] Document on LifecycleManager
### What changes were proposed in this pull request?
As title.

### Why are the changes needed?
As title.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Manual test.

Closes #1775 from waitinfuture/853.

Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-31 17:36:42 +08:00
Fu Chen
f869ab25b6 [CELEBORN-857][TEST] Refine DataPushQueueSuiteJ
### What changes were proposed in this pull request?

1. This PR propose renaming the class `DataPushQueueSuitJ` to `DataPushQueueSuiteJ` in order to enable its integration with the test suite. This change is required to comply with our maven-surefire-plugin plugin rule

5f0295e9f3/pom.xml (L543-L551)

2. To fix a potential logic bug in the test, tasks within `DataPushQueue` may inadvertently be consumed by the `DataPusher`s built-in thread `DataPusher-${taskId}`, leading to test suite failures.

![截屏2023-07-31 下午12 08 06](https://github.com/apache/incubator-celeborn/assets/8537877/b7a294a5-a12b-474a-b43d-233998bc7f31)

![截屏2023-07-31 下午12 07 49](https://github.com/apache/incubator-celeborn/assets/8537877/c585ed00-0111-4aab-863a-e7984ed8a298)

### Why are the changes needed?

fix DataPushQueueSuiteJ bug

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Pass GA

Closes #1774 from cfmcgrady/refine-data-push-queue-suite.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-31 15:43:43 +08:00
Fu Chen
5f0295e9f3
[CELEBORN-836][BUILD] Initial support sbt
### What changes were proposed in this pull request?

This PR introduces the SBT build system implementation that operates independently from the current Maven build system. Different from https://github.com/apache/incubator-celeborn/pull/1627, the current implementation does not depend on `pom.xml`

The implementation enables packaging and testing functionalities for server-related modules and Spark-related modules using SBT.

For Flink-related build/test, sbt build documentation, continuous integration, and plugins, they will be submitted in separate PRs

### Why are the changes needed?

improve project build speed

packing the project.

```shell
$ ./build/sbt
sbt:celeborn> clean
[success] Total time: 1 s, completed 2023-7-25 16:36:12
sbt:celeborn> package
[success] Total time: 28 s, completed 2023-7-25 16:36:46
```

packing and shading the spark 3.3 client

```shell
$ ./build/sbt -Pspark-3.3
sbt:celeborn> clean
[success] Total time: 1 s, completed 2023-7-25 16:39:11
sbt:celeborn> project celeborn-client-spark-3-shaded
sbt:celeborn-client-spark-3-shaded> assembly
[success] Total time: 37 s, completed 2023-7-25 16:40:03
```

packing and shading the spark 2.4 client

```shell
$ ./build/sbt -Pspark-2.4
sbt:celeborn> clean
[success] Total time: 1 s, completed 2023-7-25 16:41:06
sbt:celeborn> project celeborn-client-spark-2-shaded
sbt:celeborn-client-spark-2-shaded> assembly
[success] Total time: 36 s, completed 2023-7-25 16:41:53
```

running server-related tests

```shell
$ ./build/sbt clean test
[success] Total time: 350 s (05:50), completed 2023-7-25 16:48:58
```

### Does this PR introduce _any_ user-facing change?

yes

### How was this patch tested?

tested locally

Closes #1757 from cfmcgrady/pure-sbt.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-07-28 10:40:04 +08:00
zky.zhoukeyong
b36ea39001 [CELEBORN-834][DOC] Add fault tolerant document
### What changes were proposed in this pull request?
As title.

### Why are the changes needed?
As title.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Manual test.

Closes #1769 from waitinfuture/834.

Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-28 10:39:08 +08:00
zky.zhoukeyong
41509d6e7e [CELEBORN-849][DOC] Document on Master
### What changes were proposed in this pull request?
As title.

### Why are the changes needed?
As title.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Manual test.

Closes #1772 from waitinfuture/849.

Lead-authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Co-authored-by: Keyong Zhou <waitinfuture@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-27 22:09:43 +08:00
Angerszhuuuu
5cb73ed3b4 [CELEBORN-851] Mention Celeborn 0.4 server requires 0.3 or above clients
### What changes were proposed in this pull request?
As title

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1770 from AngersZhuuuu/CELEBORN-851.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
2023-07-27 18:07:44 +08:00
Angerszhuuuu
0db2150731 [CELEBORN-808] Remove unnecessary RssShuffleManager in 0.4.0
### What changes were proposed in this pull request?
As title

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1731 from AngersZhuuuu/CELEBORN-808.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
2023-07-27 17:47:44 +08:00
Angerszhuuuu
e82a8e8992 [CELEBORN-846] Remove unused updateReleaseSlotsMeta in master side
### What changes were proposed in this pull request?
As title

### Why are the changes needed?

CELEBORN-791 removed sending the ReleaseSlotsRequest from worker, so Master is not required to handle it.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1767 from AngersZhuuuu/CELEBORN-846.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
2023-07-27 17:46:00 +08:00
Angerszhuuuu
bacfb54447 [CELEBORN-832] Support use RESTful API to trigger worker decommission
### What changes were proposed in this pull request?
As title

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1759 from AngersZhuuuu/CELEBORN-832.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
2023-07-27 15:40:14 +08:00