Commit Graph

149 Commits

Author SHA1 Message Date
mingji
113311df3e [CELEBORN-1081][FOLLOWUP] Remove UNKNOWN_DISK and allocate all slots to disk
### What changes were proposed in this pull request?
1. Remove UNKNOWN_DISK from StorageInfo.
2. Enable load-aware slots allocation when there is HDFS.

### Why are the changes needed?
To support the application's config about available storage types.

### Does this PR introduce _any_ user-facing change?
no.

### How was this patch tested?
GA and Cluster.

Closes #2098 from FMX/B1081-1.

Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: Shuang <lvshuang.tb@gmail.com>
2023-11-28 11:26:00 +08:00
SteNicholas
60871750e4
[CELEBORN-1136] Support policy for master to assign slots fallback to roundrobin with no available slots
### What changes were proposed in this pull request?

`SlotsAllocator` supports policy for master to assign slots fallback to roundrobin with no available slots.

### Why are the changes needed?

When the selected workers have no available slots, the loadaware policy could throw `MasterNotLeaderException`. It's recommended to support policy for master to assign slots fallback to roundrobin with no available slots. Meanwhile, the situation that there is no available slots would occur when the partition size has increased a lot in a short period of time.
```
Caused by: org.apache.celeborn.common.haclient.MasterNotLeaderException: Master:xx.xx.xx.xx:9099 is not the leader. Suggested leader is Master:xx.xx.xx.xx:9099. Exception:bound must be positive.
    at org.apache.celeborn.service.deploy.master.clustermeta.ha.HAHelper.sendFailure(HAHelper.java:58)
    at org.apache.celeborn.service.deploy.master.Master.executeWithLeaderChecker(Master.scala:236)
    at org.apache.celeborn.service.deploy.master.Master$$anonfun$receiveAndReply$1.applyOrElse(Master.scala:314)
    ... 7 more
Caused by: java.lang.IllegalArgumentException: bound must be positive
    at java.util.Random.nextInt(Random.java:388)
    at org.apache.celeborn.service.deploy.master.SlotsAllocator.roundRobin(SlotsAllocator.java:202)
    at org.apache.celeborn.service.deploy.master.SlotsAllocator.offerSlotsLoadAware(SlotsAllocator.java:151)
    at org.apache.celeborn.service.deploy.master.Master.$anonfun$handleRequestSlots$1(Master.scala:598)
    at org.apache.celeborn.common.metrics.source.AbstractSource.sample(AbstractSource.scala:199)
    at org.apache.celeborn.common.metrics.source.AbstractSource.sample(AbstractSource.scala:189)
    at org.apache.celeborn.service.deploy.master.Master.handleRequestSlots(Master.scala:587)
    at org.apache.celeborn.service.deploy.master.Master$$anonfun$receiveAndReply$1.$anonfun$applyOrElse$12(Master.scala:314)
    at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
    at org.apache.celeborn.service.deploy.master.Master.executeWithLeaderChecker(Master.scala:233)
    ... 8 more
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

`SlotsAllocatorSuiteJ#testAllocateSlotsWithNoAvailableSlots`

Closes #2108 from SteNicholas/CELEBORN-1136.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-11-22 14:08:06 +08:00
Shuang
931880a82d [CELEBORN-1112] Inform celeborn application is shutdown, then celeborn cluster can release resource immediately
### What changes were proposed in this pull request?
Unregister application to Celeborn master After Application stopped, then master will expire the related shuffle resource immediately, resulting in resource savings.

### Why are the changes needed?
Currently Celeborn master expires the related application shuffle resource only when application is being checked timeout,
this would greatly delay the release of resources, which is not conducive to saving resources.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
PASS GA

Closes #2075 from RexXiong/CELEBORN-1112.

Authored-by: Shuang <lvshuang.tb@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-11-08 20:46:51 +08:00
SteNicholas
d2582919ad
[CELEBORN-1110] Support celeborn.worker.storage.disk.reserve.ratio to configure worker reserved ratio for each disk
### What changes were proposed in this pull request?

Support `celeborn.worker.storage.disk.reserve.ratio` to configure worker reserved ratio for each disk.

### Why are the changes needed?

`CelebornConf` supports to configure celeborn worker reserved space for each disk, which space is absolute. `CelebornConf` could support `celeborn.worker.storage.disk.reserve.ratio` to configure worker reserved ratio for each disk. The minimum usable size for each disk should be the max space between the reserved space and the space calculate via reserved ratio.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

`SlotsAllocatorSuiteJ`

Closes #2071 from SteNicholas/CELEBORN-1110.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-11-08 12:39:25 +08:00
SteNicholas
52eddc59f3
[CELEBORN-448] Support exclude worker manually
### What changes were proposed in this pull request?

Support exclude worker manually given worker id. This worker is added into excluded workers manually.

### Why are the changes needed?

Celeborn supports to shuffle client-side fetch and push exclude workers on failure at present. It's necessary to exclude worker manually for maintaining the Celeborn cluster.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

- `HttpUtilsSuite`
- `DefaultMetaSystemSuiteJ#testHandleWorkerExclude`
- `RatisMasterStatusSystemSuiteJ#testHandleWorkerExclude`
- `MasterStateMachineSuiteJ#testObjSerde`

Closes #1997 from SteNicholas/CELEBORN-448.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-11-07 16:25:24 +08:00
mingji
5e77b851c9 [CELEBORN-1081] Client support celeborn.storage.activeTypes config
### What changes were proposed in this pull request?
1.To support `celeborn.storage.activeTypes` in Client.
2.Master will ignore slots for "UNKNOWN_DISK".

### Why are the changes needed?
Enable client application to select storage types to use.

### Does this PR introduce _any_ user-facing change?
Yes.

### How was this patch tested?
GA and cluster.

Closes #2045 from FMX/B1081.

Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: Shuang <lvshuang.tb@gmail.com>
2023-11-03 20:03:11 +08:00
SteNicholas
b45b63f9a5
[CELEBORN-247][FOLLOWUP] Add metrics for each user's quota usage of Celeborn Worker
### What changes were proposed in this pull request?

Add the metric `ResourceConsumption` to monitor each user's quota usage of Celeborn Worker.

### Why are the changes needed?

The metric `ResourceConsumption` supports to monitor each user's quota usage of Celeborn Master at present. The usage of Celeborn Worker also needs to monitor.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Internal tests.

Closes #2059 from SteNicholas/CELEBORN-247.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-11-01 15:48:31 +08:00
SteNicholas
df40a28959 [CELEBORN-1032][FOLLOWUP] Use scheduleWithFixedDelay instead of scheduleAtFixedRate in threads pool of master and worker
### What changes were proposed in this pull request?

Use `scheduleWithFixedDelay` instead of `scheduleAtFixedRate` in thread pool of Celeborn Master and Worker.

### Why are the changes needed?

Follow up #1970.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Internal tests.

Closes #2048 from SteNicholas/CELEBORN-1032.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-10-27 11:20:28 +08:00
xleoken
37f83a4a03 [CELEBORN-1087] Remove SimpleStateMachineStorageUtil in master module
### What changes were proposed in this pull request?

To complement the functionality of ratis, we added SimpleStateMachineStorageUtil class in the master module, which contains two functions, one is findLatestSnapshot and the other is getSmDir. We can implement these functions in a more elegant way.

refer to https://github.com/apache/ratis/tree/master/ratis-examples

### How was this patch tested?

Local tested. 3 master with 1 worker.

**Test case one**

After patch, we can get the correct tmp directory as before.
![image](https://github.com/apache/incubator-celeborn/assets/95013770/c801c159-9f3b-4197-806a-306083d5101c)

**Test case two**
I stop one of the masters and clean up the ratis data directory, then restart the stopped master, everything works fine.

Closes #2037 from xleoken/rm-ratis.

Authored-by: xleoken <leo65535@163.com>
Signed-off-by: Shuang <lvshuang.tb@gmail.com>
2023-10-26 20:49:57 +08:00
SteNicholas
49ea881037
[MINOR] Remove unnecessary increment index of Master#timeoutDeadWorkers
### What changes were proposed in this pull request?

Remove unnecessary increment index of `Master#timeoutDeadWorkers`.

### Why are the changes needed?

Increment index of `Master#timeoutDeadWorkers` is unnecessary.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #2027 from SteNicholas/timeout-dead-workers.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-10-23 22:18:39 +08:00
xleoken
2758df713e
[CELEBORN-1030] Improve the logic of delete md5 files when initializing SimpleStateMachineStorage
### What changes were proposed in this pull request?

We need to delete md5 file init SimpleStateMachineStorage based on ratis-2.0.0, but the logic about cleanup md5 files already support after RATIS-1752, so we can optimize initialization.

Remove `MasterStateMachineSuiteJ#testSnapshotCleanup`, it already test cleanup snapshots and md5 files in
https://github.com/apache/ratis/blob/release-2.5.1/ratis-test/src/test/java/org/apache/ratis/server/storage/TestRaftStorage.java#L221

<br>

**links:**

https://issues.apache.org/jira/browse/RATIS-1752

https://github.com/apache/ratis/blob/release-2.5.1/ratis-server/src/main/java/org/apache/ratis/statemachine/impl/SimpleStateMachineStorage.java#L105

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

local test.

Closes #1966 from xleoken/patch.

Authored-by: xleoken <leo65535@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-10-23 11:15:18 +08:00
liangbowen
434492bd98
[CELEBORN-1065] Prevent the local variable 'time' declared in one 'switch' branch and used in another
### What changes were proposed in this pull request?
- Minor code improvement in `MetaHandler`
  - Local variable 'time' declared in one 'switch' branch `AppHeartbeat` and used in another branch `WorkerHeartbeat`

### Why are the changes needed?

- Incorrect code pattern.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI tests.

Closes #2012 from bowenliang123/time.

Authored-by: liangbowen <liangbowen@gf.com.cn>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-10-23 11:15:18 +08:00
Mridul Muralidharan
05c9ed0563
[CELEBORN-1069] Avoid double brace initialization
Avoid double brace initialization. See more [here](https://errorprone.info/bugpattern/DoubleBraceInitialization)
Note that in this case, there is no actual functional or performance issue - since it is happening in test cases.

Once fixed, error-prone can catch future violations as part of build as Celeborn evolves.

No

Unit tests

Closes #2020 from mridulm/avoid-double-brace-initialization.

Authored-by: Mridul Muralidharan <mridulatgmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-10-23 11:15:15 +08:00
SteNicholas
7276dd024c
[CELEBORN-1035] Expose RunningApplicationCount, PartitionWritten and PartitionFileCount metric by Celeborn master
### What changes were proposed in this pull request?

Meta manager records `appHeartbeatTime`, `partitionTotalWritten` and `partitionTotalFileCount`, which are useful to monitor the application heartbeat and shuffle partition. `RunningApplicationCount`, `PartitionWritten` and `PartitionFileCount` metrics are exposed by Celeborn master to monitor the application and shuffle partition.

### Why are the changes needed?

`Master` exposes `RunningApplicationCount`, `PartitionWritten` and `PartitionFileCount` metrics.

### Does this PR introduce _any_ user-facing change?

None.

### How was this patch tested?

Internal tests.

Closes #1976 from SteNicholas/CELEBORN-1035.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-10-19 22:07:17 +08:00
zky.zhoukeyong
a5dfd67d5b
[CELEBORN-1034] Offer slots uses random range of available workers instead of shuffling
### What changes were proposed in this pull request?
In original design, (primary worker, replica worker) pairs tends to stay stable, for example,
for primary PartitionLocations on Worker A, their replica PartitionLocations will all be on
Worker B in general scenarios, i.e. all workers are healthy and works well. This way, one Worker
will have only one (or very few) connections to other workers' replicate netty server.

However, https://github.com/apache/incubator-celeborn/pull/1790 calls `Collections.shuffle(availableWorkers)`,
causing the number of replica connections increases dramatically:
![image](https://github.com/apache/incubator-celeborn/assets/948245/013c7bc8-a224-413e-9c0c-519ae76c9d32)

### Why are the changes needed?
This PR refine the logic of selecting limited number of workers, instead of shuffling,
Master just randomly picks a range of available workers.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Manual test.

Closes #1975 from waitinfuture/1034.

Lead-authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Co-authored-by: Keyong Zhou <waitinfuture@gmail.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-10-18 17:00:03 +08:00
mingji
69defcad7f [CELEBORN-1021] Celeborn support arbitary Ratis configs and client rpc timeout
### What changes were proposed in this pull request?
1. To support arbitrary Ratis configs
2. To support Ratis client rpc timeout

### Why are the changes needed?
After some digs that I found out Celeborn never changed the default config of ratis client's timeout.

### Does this PR introduce _any_ user-facing change?
NO.

### How was this patch tested?
GA and cluster.

Closes #1969 from FMX/CELEBORN-1021.

Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: Shuang <lvshuang.tb@gmail.com>
2023-10-18 10:26:11 +08:00
xleoken
f6dcfaa37f [CELEBORN-1044] Enhance the check of parameter array length
### What changes were proposed in this pull request?

We can't get any response from /conf when the master started with default celeborn conf.

![e8c649b733e0c8495bb6555dfb7c5e58_13063594_image-2023-10-17-11-37-15-261](https://github.com/apache/incubator-celeborn/assets/95013770/a6de4496-f53f-46ad-94b6-e02adaa6fbfc)

**Internal Exception**
```
empty.max
java.lang.UnsupportedOperationException: empty.max
	at scala.collection.TraversableOnce.max(TraversableOnce.scala:275)
	at scala.collection.TraversableOnce.max$(TraversableOnce.scala:273)
	at scala.collection.AbstractTraversable.max(Traversable.scala:108)
	at org.apache.celeborn.server.common.HttpService.getConf(HttpService.scala:36)
	at org.apache.celeborn.service.deploy.master.MasterSuite.$anonfun$new$1(MasterSuite.scala:46)
```

### Why are the changes needed?

Bug.

### How was this patch tested?

Local

Closes #1995 from xleoken/patch5.

Authored-by: xleoken <leo65535@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-10-17 20:52:36 +08:00
sychen
dd65e74f99 [CELEBORN-983] Rename PrometheusMetric configuration
### What changes were proposed in this pull request?
Replace
```properties
celeborn.metrics.master.prometheus.host
celeborn.metrics.master.prometheus.port
celeborn.metrics.worker.prometheus.host
celeborn.metrics.worker.prometheus.port
```
With
```properties
celeborn.master.http.host
celeborn.master.http.port
celeborn.worker.http.host
celeborn.worker.http.port
```

### Why are the changes needed?
The `celeborn.master.metrics.prometheus.port` and `celeborn.metrics.worker.prometheus.port` bind port not only serve prometheus metrics, but also provide some useful API services.

https://celeborn.apache.org/docs/latest/monitoring/#rest-api

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1919 from cxzl25/CELEBORN-983.

Lead-authored-by: sychen <sychen@ctrip.com>
Co-authored-by: Keyong Zhou <zhouky@apache.org>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-10-13 13:28:58 +08:00
xleoken
304b796475 [MINOR] Fix wrong description about app list
### What changes were proposed in this pull request?

Fix wrong description about app list.

### How was this patch tested?

local.

Closes #1979 from xleoken/patch2.

Authored-by: xleoken <leo65535@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-10-12 20:43:03 +08:00
onebox-li
a47f6169d8 [MINOR] Fix some typos
### What changes were proposed in this pull request?
Fix some typos

### Why are the changes needed?
Ditto

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
-

Closes #1983 from onebox-li/fix-typo.

Authored-by: onebox-li <lyh-36@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-10-12 20:34:07 +08:00
zml1206
74a5acfd7b [CELEBORN-1038] Clean up deprecated api
### What changes were proposed in this pull request?
Replace `org.apache.commons.io.Charsets.UTF_8` to `java.nio.charset.StandardCharsets.UTF_8`.
Replace `Assert.assertEquals` to `Assert.assertArrayEquals`.

### Why are the changes needed?

Clean up deprecated api.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?
Existing UT.

Closes #1980 from zml1206/CELEBORN-1038.

Authored-by: zml1206 <zhuml1206@gmail.com>
Signed-off-by: Fu Chen <cfmcgrady@gmail.com>
2023-10-12 17:14:08 +08:00
SteNicholas
84a318f716 [CELEBORN-1033] MasterNotLeaderException should provide the cause of exception
### What changes were proposed in this pull request?

`HAHelper#sendFailure` only sends `MasterNotLeaderException` without cause, which causes that the actual exception of `MasterNotLeaderException` could not catch for troubleshooting.

### Why are the changes needed?

`MasterNotLeaderException` provides the cause of exception.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

`MasterClientSuiteJ`

Closes #1972 from SteNicholas/CELEBORN-1033.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Shuang <lvshuang.tb@gmail.com>
2023-10-11 20:18:58 +08:00
sychen
f6d27609b8 [CELEBORN-1028] Make prometheus path configurable
### What changes were proposed in this pull request?
`celeborn.metrics.prometheus.path`

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1965 from cxzl25/CELEBORN-1028.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-10-10 18:37:44 +08:00
Fu Chen
b2412d0774 [CELEBORN-1022][TEST] Update log level from FATAL to ERROR for console output in unit tests
### What changes were proposed in this pull request?

As title

### Why are the changes needed?

1. this is developer-friendly for debugging unit tests in IntelliJ IDEA, for example: Netty's memory leak reports are logged at the error level and won't cause unit tests to be marked as fatal.

```
23/10/09 09:57:26,422 ERROR [fetch-server-52-2] ResourceLeakDetector: LEAK: ByteBuf.release() was not called before it's garbage-collected. See https://netty.io/wiki/reference-counted-objects.html for more information.
Recent access records:
Created at:
	io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:403)
	io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:188)
	io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:179)
	io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:140)
	io.netty.channel.DefaultMaxMessagesRecvByteBufAllocator$MaxMessageHandle.allocate(DefaultMaxMessagesRecvByteBufAllocator.java:120)
	io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:150)
	io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788)
	io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
	io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)
	io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
	io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
	io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	java.lang.Thread.run(Thread.java:750)
```

2. this won't increase console output and affect the stability of CI.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Pass GA

Closes #1958 from cfmcgrady/ut-console-log-level.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-10-09 15:56:05 +08:00
xleoken
ccea1bda39 [MINOR] Fix integer overflow in expression
### What changes were proposed in this pull request?

Fix integer overflow in expression, the `64 * 1024 * 1024 * 1024` result is `0` not a long value.

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1951 from xleoken/patch.

Authored-by: xleoken <leo65535@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-09-30 21:43:47 +08:00
sychen
5310bcaf6b
[CELEBORN-313] Add rest endpoint to show master group info
### What changes were proposed in this pull request?

<img width="1347" alt="image" src="https://github.com/apache/incubator-celeborn/assets/3898450/43d10bff-6878-4591-9461-889494d797f9">

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?

```bash
./bin/celeborn-ratis sh -Draft.rpc.type=NETTY  group info   -peers clb-1:9872,clb-2:9873,clb-3:9874
```

```
group id: c5196f6d-2c34-3ed3-8b8a-47bede733167
leader info: 1(clb-1:9872)

[server {
  id: "1"
  address: "clb-1:9872"
  clientAddress: "clb-1:9097"
  startupRole: FOLLOWER
}
commitIndex: 316
, server {
  id: "2"
  address: "clb-2:9873"
  clientAddress: "clb-2:9098"
  startupRole: FOLLOWER
}
commitIndex: 316
, server {
  id: "3"
  address: "clb-3:9874"
  clientAddress: "clb-3:9099"
  startupRole: FOLLOWER
}
commitIndex: 316
]
```

```bash
curl http://clb-3:9983/masterGroupInfo
```

```
====================== Master Group INFO ==============================
group id: c5196f6d-2c34-3ed3-8b8a-47bede733167
leader info: 1(clb-1:9872)

[server {
  id: "3"
  address: "clb-3:9874"
  clientAddress: "clb-3:9099"
  startupRole: FOLLOWER
}
commitIndex: 316
, server {
  id: "1"
  address: "clb-1:9872"
  clientAddress: "clb-1:9097"
  startupRole: FOLLOWER
}
commitIndex: 316
, server {
  id: "2"
  address: "clb-2:9873"
  clientAddress: "clb-2:9098"
  startupRole: FOLLOWER
}
commitIndex: 316
]
```

Closes #1946 from cxzl25/CELEBORN-313.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-09-28 20:08:31 +08:00
sychen
16bf2aeeaa
[CELEBORN-1013] Shutdown master if initialized failed
### What changes were proposed in this pull request?
```java
23/09/28 14:48:12,512 ERROR [main] Master: Initialize master failed.
java.net.BindException: Address already in use
	at sun.nio.ch.Net.bind0(Native Method)
	at sun.nio.ch.Net.bind(Net.java:461)
	at sun.nio.ch.Net.bind(Net.java:453)
	at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:222)
	at io.netty.channel.socket.nio.NioServerSocketChannel.doBind(NioServerSocketChannel.java:141)
	at io.netty.channel.AbstractChannel$AbstractUnsafe.bind(AbstractChannel.java:562)
	at io.netty.channel.DefaultChannelPipeline$HeadContext.bind(DefaultChannelPipeline.java:1334)
```

### Why are the changes needed?
For example, bind's http service port(`celeborn.metrics.master.prometheus.port`) port is occupied and master startup fails, but because the thread started by Raft is not a daemon, the master process still exists.

d461a01a53/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java (L283-L290)

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1945 from cxzl25/CELEBORN-1013.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-09-28 19:02:59 +08:00
sychen
a9ed7f6a39 [CELEBORN-986] Use formatted log instead of string concat
### What changes were proposed in this pull request?

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
GA

Closes #1941 from cxzl25/CELEBORN-986.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-09-28 16:29:58 +08:00
Mridul Muralidharan
3a41db360b
[CELEBORN-1006] Add support for Apache Hadoop 2.x in Celeborn build
Add support for Apache Hadoop 2.x in Celeborn build
Developers need to only specify their `hadoop.version`, and the build will pick the right profile internally based on the version to add the relevant dependencies.

[hadoop-client-api](https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-client-api) and [hadoop-client-runtime](https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-client-runtime) were introduced in hadoop 3.x, while hadoop 2.x had [hadoop-client](https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-client)
Celeborn depends on the former, and so requires hadoop 3.x to build.

Apache Spark dropped support for Hadoop 2.x only in the recent v3.5 ([SPARK-42452](https://issues.apache.org/jira/browse/SPARK-42452)). Given this, we have case where deployments on supported platforms like Spark 3.4 and older running on 2.x hadoop, will need to pull in hadoop 3.x just for Celeborn.

This PR uses `hadoop-client` when `hadoop.version` is specified as 2.x - and preserves existing behavior when `hadoop.version` is 3.x

Note - while using `hadoop-client` in 3.x is an option, hadoop community recommendation is to rely on `hadoop-client-api`/`hadoop-client-runtime`, hence making an effort to leverage that as much as possible.

Adds support for using 2.x for hadoop.version

Three combinations were tested:

* Default, without overriding hadoop.version

Dependencies:
```
$ build/mvn dependency:list 2>&1 | grep hadoop | sort | uniq
[INFO]    org.apache.hadoop:hadoop-client-api:jar:3.2.4:compile
[INFO]    org.apache.hadoop:hadoop-client-runtime:jar:3.2.4:compile
```

Will update this section again based on test suite results (which are ongoing)

* Setting hadoop.version to newer 3.3.0 explicitly

Dependencies:
```
$ ARGS="-Pspark-3.1 -Dhadoop.version=3.3.0" ; build/mvn dependency:list $ARGS 2>&1 | grep hadoop | sort | uniq
[INFO]    org.apache.hadoop:hadoop-client-api:jar:3.3.0:compile
[INFO]    org.apache.hadoop:hadoop-client-runtime:jar:3.3.0:compile
```

* Setting hadoop.version to older 2.10.0

Dependencies:
```
$ ARGS="-Pspark-3.1 -Dhadoop.version=2.10.0" ; build/mvn dependency:list $ARGS 2>&1 | grep hadoop | grep compile | sort | uniq
[INFO]    org.apache.hadoop:hadoop-auth:jar:2.10.0:compile -- module hadoop.auth (auto)
[INFO]    org.apache.hadoop:hadoop-client:jar:2.10.0:compile -- module hadoop.client (auto)
[INFO]    org.apache.hadoop:hadoop-common:jar:2.10.0:compile -- module hadoop.common (auto)
[INFO]    org.apache.hadoop:hadoop-hdfs-client:jar:2.10.0:compile -- module hadoop.hdfs.client (auto)
[INFO]    org.apache.hadoop:hadoop-mapreduce-client-app:jar:2.10.0:compile -- module hadoop.mapreduce.client.app (auto)
[INFO]    org.apache.hadoop:hadoop-mapreduce-client-common:jar:2.10.0:compile -- module hadoop.mapreduce.client.common (auto)
[INFO]    org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.10.0:compile -- module hadoop.mapreduce.client.core (auto)
[INFO]    org.apache.hadoop:hadoop-mapreduce-client-jobclient:jar:2.10.0:compile
[INFO]    org.apache.hadoop:hadoop-mapreduce-client-shuffle:jar:2.10.0:compile -- module hadoop.mapreduce.client.shuffle (auto)
[INFO]    org.apache.hadoop:hadoop-yarn-api:jar:2.10.0:compile -- module hadoop.yarn.api (auto)
[INFO]    org.apache.hadoop:hadoop-yarn-common:jar:2.10.0:compile -- module hadoop.yarn.common (auto)
```

For each of the case above, build/test passes for each of the `ARGS`.

Closes #1936 from mridulm/main.

Authored-by: Mridul Muralidharan <mridulatgmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-09-25 20:15:02 +08:00
ming.li
428e2660bc [CELEBORN-990] Add exception handler when calling CelebornHadoopUtils.getHadoopFS
Add exception handler when calling CelebornHadoopUtils.getHadoopFS(conf) on Master and Worker, Avoid Concealing Initialization HDFS Exception Information

### What changes were proposed in this pull request?

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1923 from leemingzixxoo/main.

Authored-by: ming.li <ming.li@dmall.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-09-19 19:44:59 +08:00
Shuang
615479c442 [CELEBORN-468] Timeout useless lostWorkers/shutdownWorkers meta
### What changes were proposed in this pull request?
As title

### Why are the changes needed?
If Worker lost or lost after graceful shutdown, Master would retain these lostWorker/shutdownWorkers meta permanently,
These meta would cause some noisy message in lifecycleManager. For these meta better to delete them after a while

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
UT & E2E test

Closes #1916 from RexXiong/CELEBORN-468.

Authored-by: Shuang <lvshuang.tb@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-09-18 18:39:43 +08:00
mingji
e0c00ecd38 [CELEBORN-839][MR] Support Hadoop MapReduce
### What changes were proposed in this pull request?
1. Map side merge and push.
2. Support hadoop2 & 3.
3. Reduce in-memory merge.
4. Integrate LifecycleManager to RmApplicationMaster.

### Why are the changes needed?
Ditto.

### Does this PR introduce _any_ user-facing change?
NO.

### How was this patch tested?
Cluster.

I tested this PR on a cluster with a 4x 16 CPU 64G Mem 4ESSD cluster.
Hadoop 2.8.5

1TB Terasort, 8400 mappers, 1000 reducers
Celeborn 81min vs MR shuffle 89min
![mr1](https://github.com/apache/incubator-celeborn/assets/4150993/a3cf6493-b6ff-4c03-9936-4558cf22761d)
![mr2](https://github.com/apache/incubator-celeborn/assets/4150993/9119ffb4-6996-4b77-bcdf-cbd6db5c096f)

1GB wordcount, 8 mappers, 8 reducers
Celeborn 35s VS MR shuffle 38s
![mr3](https://github.com/apache/incubator-celeborn/assets/4150993/907dce24-16b7-4788-ab5d-5b784fd07d47)
![mr4](https://github.com/apache/incubator-celeborn/assets/4150993/8e8065b9-6c46-4c8d-9e71-45eed8e63877)

Closes #1830 from FMX/CELEBORN-839.

Lead-authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Co-authored-by: Ethan Feng <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-09-14 14:12:53 +08:00
onebox-li
0e53a3d552 [CELEBORN-932] Fix worker register after gracefaully restart
### What changes were proposed in this pull request?
Worker will firstly register failed after worker gracefully restart in HA mode, it will be really registered after one heartbeat.
<img width="889" alt="image" src="https://github.com/apache/incubator-celeborn/assets/19429353/371aa0e0-b2e9-4c1f-9e40-276dc1460219">
This is because master here uses same `requestId` to submit request,  causing the second request not be processed correctly, due to Ratis `RetryCache`.
Master logs like below:
(worker gracefully stop)
Master: Receive ReportNodeFailure
(worker start)
Master: Received RegisterWorker request
Master: Received heartbeat from unknown worker
Master: Registered worker

So here improve AbstractMetaManager#updateRegisterWorkerMeta to cover `WorkerRemove` logic. For back compatibility and possible inconsistencies during rolling upgrade, temporarily fix duplicate requestId and keep remove function. And we can try to remove `WorkerRemove` logic in the future version.

### Why are the changes needed?
Ditto

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Cluster test

Closes #1863 from onebox-li/fix-restart-register.

Authored-by: onebox-li <lyh-36@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-09-11 21:23:28 +08:00
sychen
cb8ace406b [CELEBORN-960] Exclude workers without healthy disks
### What changes were proposed in this pull request?
The master checks the number of healthy disks in the woker and decides whether to exclude it.

### Why are the changes needed?

When the disks of all the workers are unhealthy, HDFS is not enabled, and the master does not exclude the workers, the spark client calls checkWorkersAvailable and returns available, and the shuffle write ultimately fails without fallback.

```java
23/09/08 23:20:44 ERROR LifecycleManager: Aggregated error of reserveSlots for shuffleId 9 failure:
 [reserveSlots] Failed to reserve buffers for shuffleId 9 from worker Host:1.2.3.4:RpcPort:55803:PushPort:55805:FetchPort:55807:ReplicatePort:55806. Reason: Local storage has no available dirs!
23/09/08 23:20:44 ERROR LifecycleManager: Retry reserve slots for 9 failed caused by not enough slots.
23/09/08 23:20:44 WARN LifecycleManager: Reserve buffers for 9 still fail after retrying, clear buffers.
23/09/08 23:20:44 ERROR LifecycleManager: reserve buffer for 9 failed, reply to all.
23/09/08 23:20:44 ERROR ShuffleClientImpl: LifecycleManager request slots return RESERVE_SLOTS_FAILED, retry again, remain retry times 0.
23/09/08 23:20:47 WARN TaskSetManager: Lost task 8.0 in stage 27.0 (TID 89) (1.2.3.4 executor driver): TaskKilled (Stage cancelled)
23/09/08 23:20:59 ERROR MasterClient: Send rpc with failure, has tried 15, max try 15!
org.apache.celeborn.common.exception.CelebornException: Exception thrown in awaitResult:
	at org.apache.celeborn.common.util.ThreadUtils$.awaitResult(ThreadUtils.scala:229)
	at org.apache.celeborn.common.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:74)
	at org.apache.celeborn.common.client.MasterClient.sendMessageInner(MasterClient.java:150)
	at org.apache.celeborn.common.client.MasterClient.askSync(MasterClient.java:118)
	at org.apache.celeborn.client.LifecycleManager.requestMasterRequestSlots(LifecycleManager.scala:1033)
	at org.apache.celeborn.client.LifecycleManager.requestMasterRequestSlotsWithRetry(LifecycleManager.scala:1022)
	at org.apache.celeborn.client.LifecycleManager.org$apache$celeborn$client$LifecycleManager$$offerAndReserveSlots(LifecycleManager.scala:402)
	at org.apache.celeborn.client.LifecycleManager$$anonfun$receiveAndReply$1.applyOrElse(LifecycleManager.scala:210)
```

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
local test
```
23/09/08 23:23:27 WARN CelebornShuffleFallbackPolicyRunner: No workers available for current user `default`.`default`.
23/09/08 23:23:27 WARN SparkShuffleManager: Fallback to vanilla Spark SortShuffleManager for shuffle: 10
23/09/08 23:23:28 WARN CelebornShuffleFallbackPolicyRunner: No workers available for current user `default`.`default`.
23/09/08 23:23:28 WARN SparkShuffleManager: Fallback to vanilla Spark SortShuffleManager for shuffle: 11
100000
Time taken: 0.192 seconds, Fetched 1 row(s)
```
```

Closes #1893 from cxzl25/CELEBORN-960.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-09-09 18:52:25 +08:00
SteNicholas
4625484d2c [CELEBORN-830] Check available workers in CelebornShuffleFallbackPolicyRunner
### What changes were proposed in this pull request?

`CelebornShuffleFallbackPolicyRunner` could not only check quota, but also check whether cluster has available workers. If there is no available workers, fallback to external shuffle.

### Why are the changes needed?

`CelebornShuffleFallbackPolicyRunner` adds a check for available workers.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

- `SparkShuffleManagerSuite#testClusterNotAvailableWithAvailableWorkers`

Closes #1814 from SteNicholas/CELEBORN-830.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-08-29 16:56:56 +08:00
Keyong Zhou
1d04a23289 [CELEBORN-920] Worker sends its load to Master through heartbeat
### What changes were proposed in this pull request?

 Adding a flag indicating high load in the worker's heartbeat allows the master to better schedule the workers

### Why are the changes needed?

In our production environment, there is a node with abnormally high load, but the master is not aware of this situation. It assigned numerous jobs to this node, and as a result, the stability of these jobs has been affected.

### Does this PR introduce _any_ user-facing change?
NO

### How was this patch tested?

UT

Closes #1840 from JQ-Cao/920.

Lead-authored-by: Keyong Zhou <zky.zhoukeyong@alibaba-inc.com>
Co-authored-by: caojiaqing <caojiaqing@bilibili.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-08-26 13:58:37 +08:00
mingji
7d0e257001 [CELEBORN-846][FOLLOWUP] Fix broken link caused by unknown RPC
### What changes were proposed in this pull request?
Keep ReleaseSlots RPC to make sure that 0.3 client can worker with 0.3.1-SNAPSHOT and 0.4.0-SNAPSHOT.
This PR will need to merged into main and branch-0.3.

### Why are the changes needed?
Ditto.

### Does this PR introduce _any_ user-facing change?
NO.

### How was this patch tested?
GA and cluster.

Closes #1794 from FMX/CELEBORN-846-FOLLOWUP.

Lead-authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Co-authored-by: Ethan Feng <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-08-11 22:00:51 +08:00
zky.zhoukeyong
6ea1ee2ec4 [CELEBORN-152] Add config to limit max workers when offering slots
### What changes were proposed in this pull request?
Add config to limit max workers when offering slots, the config can be set both
in server side and client side. Celeborn will choose the smaller positive configs from client and master.

### Why are the changes needed?
For large Celeborn clusters, users may want to limit the number of workers that
a shuffle can spread, reasons are:

1. One worker failure will not affect all applications
2. One huge shuffle will not affect all applications
3. It's more efficient to limit a shuffle within a restricted number of workers, say 100, than
    spreading across a large number of workers, say 1000, because the network connections
   in pushing data is `number of ShuffleClient` * `number of allocated Workers`

The recommended number of Workers should depend on workload and Worker hardware,
and this can be configured per application, so it's relatively flexible.

### Does this PR introduce _any_ user-facing change?
No, added a new configuration.

### How was this patch tested?
Added ITs and passes GA.

Closes #1790 from waitinfuture/152.

Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-08-07 10:13:53 +08:00
zwangsheng
6e9a98a28f
[CELEBORN-872][MASTER] Extract the same allocation logic for both loadaware and roundrobin
### What changes were proposed in this pull request?
As title

### Why are the changes needed?
Reduce duplicate code segments, improve code readability and maintenance difficulty.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Unit Test

Closes #1786 from zwangsheng/CELEBORN-872.

Authored-by: zwangsheng <2213335496@qq.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-08-03 20:14:45 +08:00
zwangsheng
5e6a23fd88 [CELEBORN-868][MASTER] Adjust logic of SlotsAllocator#offerSlotsLoadAware fallback to roundrobin
### What changes were proposed in this pull request?
Fallback in following order:

1. usableDisks is empty (no need to call iter)
2. under replicate case, first usableDisks == 1 fast fallback
3. count distinct worker

### Why are the changes needed?
Clear about the logic here

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Unit Test

Closes #1781 from zwangsheng/CELEBORN-868.

Authored-by: zwangsheng <2213335496@qq.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-08-01 20:39:23 +08:00
Angerszhuuuu
e82a8e8992 [CELEBORN-846] Remove unused updateReleaseSlotsMeta in master side
### What changes were proposed in this pull request?
As title

### Why are the changes needed?

CELEBORN-791 removed sending the ReleaseSlotsRequest from worker, so Master is not required to handle it.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1767 from AngersZhuuuu/CELEBORN-846.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
2023-07-27 17:46:00 +08:00
Angerszhuuuu
2ab88f773a [CELEBORN-819] Worker close should pass close status to support handle graceful shutdown and decommission
### What changes were proposed in this pull request?
Pass exit kind to each component, if the exit kind match:

- GRACEFUL_SHUTDOWN: Behavior as origin code's graceful == true
- Others: will clean the level db file.

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1748 from AngersZhuuuu/CELEBORN-819.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
2023-07-25 14:54:01 +08:00
Angerszhuuuu
76201c92f8 [CELEBORN-820] Merge service shutdown and close method
### What changes were proposed in this pull request?
As title

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1742 from AngersZhuuuu/CELEBORN-820.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
2023-07-22 21:04:29 +08:00
onebox-li
405b2801fa [CELEBORN-810] Fix some typos and grammar
### What changes were proposed in this pull request?
Fix some typos and grammar

### Why are the changes needed?
Ditto

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
manually test

Closes #1733 from onebox-li/fix-typo.

Authored-by: onebox-li <lyh-36@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-19 18:35:38 +08:00
mingji
a4687716d2 [CELEBORN-791] Remove slots allocation simulation in master and use active slots sent from worker's heartbeat
### What changes were proposed in this pull request?
Master won't simulate slots allocations and use active slots sent from worker.

### Why are the changes needed?
I have observed that a new worker might allocate more slots than other workers when using the round-robin slot allocation algorithm.
There is a logic error in processing heartbeat from worker. It will update disk info's active slots to max(current disk info active slots, disk info sent from worker active slots). If I registered a huge shuffle, master will allocate more slots than a disk's max slots and mark them as unknown disk slots but worker will count the unknown disk slots as active slots and report it to the master. Then the slots release logic can not distinguish unknown slots from a number so the release will not decrease active slots properly.
Due to the gap between work and master, so I think it's OK to remove slots allocation simulation from worker and use active slots from worker.

Before this patch:
<img width="928" alt="截屏2023-07-12 16 51 15" src="https://github.com/apache/incubator-celeborn/assets/4150993/9c8a46d9-26a8-42f5-a956-938273277c9b">

After this patch:
<img width="509" alt="截屏2023-07-12 16 25 52" src="https://github.com/apache/incubator-celeborn/assets/4150993/c49b3d91-14ea-4eb8-9b71-9aab73541faf">

### Does this PR introduce _any_ user-facing change?
NO.

### How was this patch tested?
UT and cluster.

Closes #1710 from FMX/CELEBORN-791.

Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-14 20:40:55 +08:00
Angerszhuuuu
b4dfb0352b [CELEBORN-733] Clean unused GetBlacklist & GetBlacklistResponse
### What changes were proposed in this pull request?
Clean unused GetBlacklist & GetBlacklistResponse

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1656 from AngersZhuuuu/CELEBORN-733.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
2023-07-11 15:58:20 +08:00
mingji
d0ecf83fec [CELEBORN-764] Fix celeborn on HDFS might clean using app directories
### What changes were proposed in this pull request?
Make Celeborn leader clean expired app dirs on HDFS when an application is Lost.

### Why are the changes needed?
If Celeborn is working on HDFS, the storage manager starts and cleans expired app directories, and the newly created worker will want to delete any unknown app directories.
This will cause using app directories to be deleted unexpectedly.

### Does this PR introduce _any_ user-facing change?
NO.

### How was this patch tested?
UT and cluster.

Closes #1678 from FMX/CELEBORN-764.

Lead-authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Co-authored-by: Cheng Pan <pan3793@gmail.com>
Co-authored-by: Ethan Feng <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-05 23:11:50 +08:00
Angerszhuuuu
693172d0bd [CELEBORN-751] Rename remain rss related class name and filenames etc
### What changes were proposed in this pull request?
Rename remain rss related class name and filenames etc...

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1664 from AngersZhuuuu/CELEBORN-751.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
2023-07-04 10:20:08 +08:00
Cheng Pan
a1be02b4fa [CELEBORN-757] Improve metrics method signature and code style
### What changes were proposed in this pull request?

- gauge method definition improvement. i.e.

  before
  ```
  def addGauge[T](name: String, f: Unit => T, labels: Map[String, String])
  ```
  after
  ```
  def addGauge[T](name: String, labels: Map[String, String])(f: () => T)
  ```
  which improves the caller-side code style
  ```
  addGauge(name, labels) { () =>
    ...
  }
  ```

- remove unnecessary Java/Scala collection conversion. Since Scala 2.11 does not support SAM, the extra implicit function is required.

- leverage Logging to defer message evaluation

- UPPER_CASE string constants

### Why are the changes needed?

Improve code quality and performance(maybe)

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass GA.

Closes #1670 from pan3793/CELEBORN-757.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-03 11:56:43 +08:00
Fu Chen
adbd38a926
[CELEBORN-726][FOLLOWUP] Update data replication terminology from master/slave to primary/replica in the codebase
### What changes were proposed in this pull request?

As title

### Why are the changes needed?

In order to distinguish it from the existing master/worker, refactor data replication terminology to 'primary/replica' for improved clarity and inclusivity in the codebase

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

existing tests

Closes #1639 from cfmcgrady/primary-replica.

Lead-authored-by: Fu Chen <cfmcgrady@gmail.com>
Co-authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-06-29 17:07:26 +08:00