### What changes were proposed in this pull request?
As title
### Why are the changes needed?
This PR addressed a NPE issue occurs when the `Worker#reigstered` member is accessed before it is initialized.
The problem occurs because the `TransportChannelHandler` might be served before the worker is registered.
```
24/02/01 15:07:32,090 WARN [push-server-6-6] TransportChannelHandler: Exception in connection from /xx.xx.xx.xx:xxx
java.lang.NullPointerException
at org.apache.celeborn.service.deploy.worker.PushDataHandler.checkRegistered(PushDataHandler.scala:714)
at org.apache.celeborn.common.network.server.TransportRequestHandler.checkRegistered(TransportRequestHandler.java:82)
at org.apache.celeborn.common.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:76)
at org.apache.celeborn.common.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:151)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at org.apache.celeborn.common.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:74)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:750)
```
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Pass GA
Closes#2274 from cfmcgrady/check-registered.
Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Fu Chen <cfmcgrady@gmail.com>
### What changes were proposed in this pull request?
Introduce `ReplicateDataFailNonCriticalCauseCount` metric in Grafana dashboard. Follow up #2323.
### Why are the changes needed?
`ReplicateDataFailNonCriticalCauseCount` metric should support in Grafana dashboard with `celeborn-dashboard.json`.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
- [Celeborn Dashboard](https://stenicholas.grafana.net/public-dashboards/6e50cc2c7af34692babcc2809066e147)
Closes#2332 from SteNicholas/CELEBORN-1282.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
When an `IdleStateEvent ` is received, the configuration items of the corresponding module are output.
### Why are the changes needed?
Now that the `IdleStateEvent` event is received, only the timeout time is output, but the corresponding configuration items are not output.
```
24/02/26 04:12:08,062 [data-client-5-8] ERROR TransportChannelHandler: Connection to /XXX:YYY has been quiet for 240000 ms while there are outstanding requests.
```
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
GA
Closes#2329 from cxzl25/CELEBORN-1288.
Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
### What changes were proposed in this pull request?
### Why are the changes needed?
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
GA
Closes#2321 from cxzl25/CELEBORN-1281.
Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
### What changes were proposed in this pull request?
### Why are the changes needed?
`web_lint.yml` and `style.yml` use the same group, which causes one of their CIs to fail to run.
```
Canceling since a higher priority waiting request for 'style-refs/pull/PR_ID/merge' exists
```
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
GA
Closes#2331 from cxzl25/CELEBORN-1240-FOLLOWUP.
Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: Shaoyun Chen <csy@apache.org>
### What changes were proposed in this pull request?
When I upgrade to 0.4.0 and backport the app level consumption [pr](https://github.com/apache/incubator-celeborn/pull/1174).
WorkerInfo consumption contains very huge information.
Since we enable debug level info for master, causing master print slots info very huge and stuck.
This pr fix this issue
<img width="1700" alt="截屏2024-02-26 17 36 17" src="https://github.com/apache/incubator-celeborn/assets/46485123/9631f9dd-ee69-4de9-aaf4-c0c7f706cb73">
### Why are the changes needed?
Fix bug
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Closes#2330 from AngersZhuuuu/CELEBORN-1291.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
### What changes were proposed in this pull request?
Optimize the handling of exceptions during the push of replica data, now only throwing PUSH_DATA_CONNECTION_EXCEPTION_REPLICA in specific scenarios.
### Why are the changes needed?
When handling exceptions related to pushing replica data in the worker, unmatched exceptions, such as 'file already closed,' are uniformly transformed into REPLICATE_DATA_CONNECTION_EXCEPTION_COUNT and returned to the client. The client then excludes the peer node based on this count, which may not be appropriate in certain scenarios. For instance, in the case of an exception like 'file already closed,' it typically occurs during multiple splits and commitFile operations. Excluding a large number of nodes under such circumstances is clearly not in line with expectations.

### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
through exist uts
Closes#2323 from lyy-pineapple/CELEBORN-1282.
Authored-by: liangyongyuan <liangyongyuan@xiaomi.com>
Signed-off-by: waitinfuture <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
Add `celeborn.quota.enabled` at Master and Client side to enable checking quota
### Why are the changes needed?
`celeborn.quota.enabled` should be added in Master and Client side to enable quota check for Celeborn Master and Client.
### Does this PR introduce _any_ user-facing change?
Add categories of `celeborn.quota,enabled` with `master` and `client`.
### How was this patch tested?
No.
Closes#2318 from AngersZhuuuu/CELEBORN-1277.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
### What changes were proposed in this pull request?
Add check tenantConfig.getConfigs().isEmpty() in getTenantUserConfigFromCache
### Why are the changes needed?
Add check tenantConfig.getConfigs().isEmpty() in getTenantUserConfigFromCache
### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?
PASS GA
Closes#2324 from jiaoqingbo/1285.
Authored-by: jiaoqingbo <1178404354@qq.com>
Signed-off-by: waitinfuture <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
Improve both combine and sort operation of shuffle read for `CelebornShuffleReader` to reduce the number of spills to disk.
### Why are the changes needed?
After the shuffle reader obtains the block, it will first perform a combine operation, and then perform a sort operation. It is known that both combine and sort may generate temporary files, so the performance may be poor when both sort and combine are used. In fact, combine operations can be performed during the sort process, and we can avoid the combine spill file.
Backport: [[SPARK-46512][CORE] Optimize shuffle reading when both sort and combine are used](https://github.com/apache/spark/pull/44512)
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
GA and cluster.
Closes#2326 from SteNicholas/CELEBORN-1287.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: waitinfuture <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
Update netty to latest version.
### Why are the changes needed?
[Netty 4.1.107.Final](https://netty.io/news/2024/02/13/4-1-107-Final.html) has been released two weeks ago, seems many useful changes.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing UT.
Closes#2328 from turboFei/netty_bump.
Authored-by: Fei Wang <fwang12@ebay.com>
Signed-off-by: waitinfuture <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
https://github.com/apache/incubator-celeborn/pull/2292#discussion_r1497160753
Based on the above discussion, removing the additional secured port. The existing port will be used for secured communication when auth is enabled.
### Why are the changes needed?
These changes are for enabling authentication
### Does this PR introduce _any_ user-facing change?
Yes.
### How was this patch tested?
This removed additional secured port.
Closes#2327 from otterc/CELEBORN-1257.
Authored-by: Chandni Singh <singh.chandni@gmail.com>
Signed-off-by: waitinfuture <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
`TransportClientFactory` avoid contention and get or create clientPools quickly.
### Why are the changes needed?
Avoid contention for getting or creating clientPools, and clean up the code.
Backport: [[SPARK-38555][NETWORK][SHUFFLE] Avoid contention and get or create clientPools quickly in the TransportClientFactory](https://github.com/apache/spark/pull/35860)
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
No.
Closes#2322 from SteNicholas/CELEBORN-1283.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Change the default value of `celeborn.worker.graceful.shutdown.recoverDbBackend` from `LEVELDB` to `ROCKSDB`.
### Why are the changes needed?
Because the LevelDB support will be removed, the default value of `celeborn.worker.graceful.shutdown.recoverDbBackend` could be changed to ROCKSDB instead of LEVELDB for preparation of LevelDB deprecation.
Backport:
[[SPARK-45351][CORE] Change spark.shuffle.service.db.backend default value to ROCKSDB](https://github.com/apache/spark/pull/43142)
[[SPARK-45413][CORE] Add warning for prepare drop LevelDB support](https://github.com/apache/spark/pull/43217)
### Does this PR introduce _any_ user-facing change?
The default value of `celeborn.worker.graceful.shutdown.recoverDbBackend` is changed from `LEVELDB` to `ROCKSDB`.
### How was this patch tested?
No.
Closes#2320 from SteNicholas/CELEBORN-1280.
Lead-authored-by: SteNicholas <programgeek@163.com>
Co-authored-by: Nicholas Jiang <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Refer [SPARK-37894](https://issues.apache.org/jira/browse/SPARK-37984)/ https://github.com/apache/spark/pull/35276
Avoid calculating all outstanding requests to improve performance
### Why are the changes needed?
Ditto.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Not needed.
Closes#2319 from turboFei/SPARK-37984_backport.
Authored-by: Fei Wang <fwang12@ebay.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Introduce Rest API of listing dynamic configuration `/listDynamicConfigs` to list the dynamic configs. The result of `/listDynamicConfigs` is as follows:
```
=========================== Dynamic Configuration ============================
celeborn.master.ha.ratis.raft.server.snapshot.auto.trigger.threshold 100000
celeborn.worker.flusher.buffer.size 64k
=========================== SYSTEM ============================
celeborn.master.ha.ratis.raft.server.snapshot.auto.trigger.threshold 200000
celeborn.worker.flusher.buffer.size 128k
=========================== TENANT ============================
=========================== Tenant: tenantId1 ============================
celeborn.master.ha.ratis.raft.server.snapshot.auto.trigger.threshold 300000
celeborn.worker.flusher.buffer.size 256k
=========================== TENANT_USER ============================
=========================== Tenant: tenantId1, Name: user1 ============================
celeborn.master.ha.ratis.raft.server.snapshot.auto.trigger.threshold 400000
celeborn.worker.flusher.buffer.size 512k
```
### Why are the changes needed?
Celeborn supports dynamic configuration with `ConfigService` at present. It's recommend to introduce Rest API of dynamic configuration management.
### Does this PR introduce _any_ user-facing change?
- Introduce Rest API of listing dynamic configuration: `/listDynamicConfigs?level=[system|tenant|tenant_user]&tenant=tenantId1&name=user1`.
### How was this patch tested?
- `HttpUtilsSuite#CELEBORN-1056: Introduce Rest API of listing dynamic configuration`
Closes#2311 from SteNicholas/CELEBORN-1056.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>
### What changes were proposed in this pull request?
Refer: [SPARK-28160](https://issues.apache.org/jira/browse/SPARK-28160) / https://github.com/apache/spark/pull/24964
ByteBuffer.allocate may throw OutOfMemoryError when the response is large but no enough memory is available. However, when this happens, TransportClient.sendRpcSync will just hang forever if the timeout set to unlimited.
### Why are the changes needed?
To catch the exception of `ByteBuffer.allocate` in corner case.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Quote the local test in https://github.com/apache/spark/pull/24964
```
I tested in my IDE by setting the value of size to -1 to verify the result. Without this patch, it won't be finished until timeout (May hang forever if timeout set to MAX_INT), or the expected IllegalArgumentException will be caught.
Override
public void onSuccess(ByteBuffer response) {
try {
int size = response.remaining();
ByteBuffer copy = ByteBuffer.allocate(size); // set size to -1 in runtime when debug
copy.put(response);
// flip "copy" to make it readable
copy.flip();
result.set(copy);
} catch (Throwable t) {
result.setException(t);
}
}
```
Closes#2316 from turboFei/fix_transport_client_onsucess.
Authored-by: Fei Wang <fwang12@ebay.com>
Signed-off-by: chenfu <chenfu@xiaohongshu.com>
### What changes were proposed in this pull request?
Move checkQuotaSpaceAvailable from Quota to QuotaManager
### Why are the changes needed?
Put method in correct place
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Added UT
Closes#2317 from AngersZhuuuu/CELEBORN-1276.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
### What changes were proposed in this pull request?
This pr does 2 things:
1. Remove unnecessary conf QUOTA_MANAGER since we implement it with ConfigService and ConfigService already have a conf to indicate the implement method.
2. Move the quota manager to Master side since only master use this
3. Support quota manager use FsConfigService and support default system level
### Why are the changes needed?
1. Many times, for users who do not have a quota configured, we hope to have a default quota that applies to them.
2. Quota manager should support refresh
3. QuotaManager should support integrate with ConfigService
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
added ut
Closes#2298 from AngersZhuuuu/CELEBORN-1239.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
### What changes were proposed in this pull request?
This pr followup CELEBORN-1241/https://github.com/apache/incubator-celeborn/pull/2246
For `SingleMasterMetaManager`, the given CelebornRackResolver is not used and a new one created in the constructor.
And in each CelebornRackResolver, there is `master-rack-resolver-refresher` thread pool. So, there is also duplicated thread pool issue.
### Why are the changes needed?
Fix duplicated `CelebornRackResolver` issue.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Not needed.
Closes#2315 from turboFei/rack_resolver.
Authored-by: Fei Wang <fwang12@ebay.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
### What changes were proposed in this pull request?
Fix some typos.
### Why are the changes needed?
Ditto.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Not needed.
Closes#2314 from turboFei/fix_typo.
Authored-by: Fei Wang <fwang12@ebay.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
### What changes were proposed in this pull request?
Rename `celeborn.worker.sortPartition.reservedMemory.enabled` to `celeborn.worker.sortPartition.prefetch.enabled`. Address [r1469066327](https://github.com/apache/incubator-celeborn/pull/2264/files#r1469066327) of pan3793.
### Why are the changes needed?
`celeborn.worker.sortPartition.reservedMemory.enabled` is misleading, which should represent that prefetch the original partition files during the first sequential reading path to leverage the Linux PageCache mechanism to speed up the subsequent random reading of them. The config name could use `celeborn.worker.sortPartition.prefetch.enabled` which is is more accurate.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
No.
Closes#2312 from SteNicholas/CELEBORN-1254.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>
### What changes were proposed in this pull request?
Just fix minor typo:
```
ls **/scala/**/*.java
common/src/main/scala/org/apache/celeborn/common/meta/WorkerEventInfo.java common/src/main/scala/org/apache/celeborn/common/meta/WorkerStatus.java
```
After this:
```
ls **/scala/**/*.java
zsh: no matches found: **/scala/**/*.java
ls **/java/**/*.scala
zsh: no matches found: **/java/**/*.scala
```
### Why are the changes needed?
Fix code format.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing UT.
Closes#2310 from turboFei/scala_java.
Authored-by: Fei Wang <fwang12@ebay.com>
Signed-off-by: SteNicholas <programgeek@163.com>
### What changes were proposed in this pull request?
1. Add licenses of web module.
2. Rat excludes `node_modules`.
### Why are the changes needed?
Licenses of frontend files in web module should be added.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Local test.
Closes#2306 from tiny-dust/CELEBORN-1249.
Lead-authored-by: tiny-dust <idioticzhou@foxmail.com>
Co-authored-by: 周顺顺 <idioticzhou@foxmail.com>
Signed-off-by: SteNicholas <programgeek@163.com>
### What changes were proposed in this pull request?
Remove an unused class.
### Why are the changes needed?
#2289
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
CI
Closes#2301 from kerwin-zk/issue-1265.
Authored-by: xiyu.zk <xiyu.zk@alibaba-inc.com>
Signed-off-by: SteNicholas <programgeek@163.com>
### What changes were proposed in this pull request?
Since we support ConfigService, many configuration can be dynamic, add `isDynamic` property for CelebornConf in this pr.
### Why are the changes needed?
Make configuration doc more cleear
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Existed UT
Closes#2308 from AngersZhuuuu/CELEBORN-1051.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
### What changes were proposed in this pull request?
Improve tenant user level dynamic configuration interface of `ConfigService` including:
- Renames `getRawTenantUserConfig` to `getRawTenantUserConfigFromCache`.
- Renames `getTenantUserConfig` to `getTenantUserConfigFromCache`.
### Why are the changes needed?
The naming of tenant user level dynamic configuration interface of `ConfigService` needs to be consistent with other interfaces which names with `FromCache`.
### Does this PR introduce _any_ user-facing change?
- Renames `getRawTenantUserConfig` to `getRawTenantUserConfigFromCache`.
- Renames `getTenantUserConfig` to `getTenantUserConfigFromCache`.
### How was this patch tested?
- `ConfigServiceSuiteJ`
Closes#2307 from SteNicholas/CELEBORN-1264.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>
### What changes were proposed in this pull request?
https://github.com/apache/incubator-celeborn/pull/2145https://github.com/apache/incubator-celeborn/pull/2162 changes the behavior that retry commit files should use the same epoch. This PR revert the behavior back.
### Why are the changes needed?
ditto
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Passes UTs.
Closes#2299 from waitinfuture/1272.
Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: waitinfuture <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
Improve the implementation of `ConfigService` including:
- Removes `celeborn.dynamicConfig.enabled`.
- Changes `celeborn.dynamicConfig.store.backend` to optional.
- Renames `refreshAllCache` to `refreshCache` in `ConfigService`.
- Checks whether the dynamic config file exists and is file in `FsConfigServiceImpl`.
### Why are the changes needed?
Whether to enable dynamic config could check via whether `celeborn.dynamicConfig.store.backend` is provided, instead of `celeborn.dynamicConfig.enabled`. The `refreshAllCache` interface could rename to `refreshCache` and throw Exception simply. Meanwhile, `FsConfigServiceImpl` should check whether the dynamic config file exists and is file.
### Does this PR introduce _any_ user-facing change?
- Renames `refreshAllCache` to `refreshCache` in `ConfigService`.
### How was this patch tested?
- `ConfigServiceSuiteJ`
Closes#2304 from SteNicholas/CELEBORN-1052.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>
### What changes were proposed in this pull request?
1. Fix dependencies failure.
2. Upload yarn cluster directory if MR UT fails.
### Why are the changes needed?
To fix MR UT failures.
### Does this PR introduce _any_ user-facing change?
NO.
### How was this patch tested?
GA.
Closes#2302 from FMX/B1225-1.
Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
Improve log of current failed workers for `WorkerStatusTracker#recordWorkerFailure` and `WorkerStatusTracker#handleHeartbeatResponse`.
### Why are the changes needed?
It's recommended to improve the log of current failed workers in `recordWorkerFailure` and `handleHeartbeatResponse` of `WorkerStatusTracker`. Meanwhile the log level of current failed workers could be warn.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
No.
Closes#2290 from SteNicholas/CELEBORN-1266.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>
### What changes were proposed in this pull request?
Fix `HeartbeatFromApplicationResponse` does not include manually excluded workers.
### Why are the changes needed?
`HeartbeatFromApplicationResponse` should include manually excluded workers, otherwise `WorkerStatusTracker` misses the manually excluded workers.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Local test and GA.
Closes#2297 from SteNicholas/CELEBORN-448.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>
### What changes were proposed in this pull request?
ConfigService support user level config
### Why are the changes needed?
Support more case of config, later can integrate with quota manager
### Does this PR introduce _any_ user-facing change?
With this pr, user's setting form config service will have three level
- User
- Tenant
- System
User identifier is construct by username and tenantId,
If there is no specify setting for username, will fallback to tenant level setting, if tenant level setting also not set, fallback to system setting
### How was this patch tested?
Added UT
Closes#2285 from AngersZhuuuu/CELEBORN-1264.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
### What changes were proposed in this pull request?
To close CELEBORN-1016, fix the issue when parse IPv6 host address.
### Why are the changes needed?
Fix CELEBORN-1016
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
UT.
Closes#2293 from turboFei/CELEBORN-1016_ipv6.
Authored-by: Fei Wang <fwang12@ebay.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
Log info when using columnar hash shuffle writer.
### Why are the changes needed?
To close CELEBORN-1078
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing UT.
Closes#2294 from turboFei/CELEBORN_1078_columnar_shuffle.
Authored-by: Fei Wang <fwang12@ebay.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
In this pr, when getting device disk info, we check the dir writable to make sure that the capacity reported to celeborn master are correct and does no include non-writable directories.
### Why are the changes needed?
To ignore bad disk when initializing the worker.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing UT.
Closes#2233 from turboFei/check_disk_init.
Lead-authored-by: Fei Wang <fwang12@ebay.com>
Co-authored-by: Keyong Zhou <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
As title.
### Why are the changes needed?
For some scenarios, if Celeborn cannot be used, users want to report an error directly instead of fallback.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
CI
Closes#2291 from kerwin-zk/add-config.
Authored-by: xiyu.zk <xiyu.zk@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
Fix batches read metric for gluten columnar shuffle
### Why are the changes needed?

Due to the fix in [Gluten-4025](https://github.com/oap-project/gluten/pull/4051) for the records read metric issue, the read metric of CelebornShuffleReader does not need additional processing, otherwise the batches read metric will have the issue shown in the graph.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
CI
Closes#2289 from kerwin-zk/batches-read.
Authored-by: xiyu.zk <xiyu.zk@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
This adds a secured port to Celeborn Master which is used for secure communication with LifecycleManager.
This is part of adding authentication support in Celeborn (see CELEBORN-1011).
This change targets just adding the secured port to Master. The following items from the proposal are still pending:
1. Persisting the app secrets in Ratis.
2. Forwarding secrets to Workers and having ability for the workers to pull registration info from the Master.
3. Secured and internal port in Workers.
4. Secured communication between workers and clients.
In addition, since we are supporting both secured and unsecured communication for backward compatibility and seamless rolling upgrades, there is an additional change needed. An app which registers with the Master can try to talk to the workers on unsecured ports which is a security breach. So, the workers need to know whether an app registered with Master or not and for that Master has to propagate list of un-secured apps to Celeborn workers as well. We can discuss this more with https://issues.apache.org/jira/browse/CELEBORN-1261
### Why are the changes needed?
It is needed for adding authentication support to Celeborn (CELEBORN-1011)
### Does this PR introduce _any_ user-facing change?
Yes
### How was this patch tested?
Added a simple UT.
Closes#2281 from otterc/CELEBORN-1257.
Authored-by: Chandni Singh <singh.chandni@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
Fix `Worker#computeResourceConsumption` `NullPointerException` for `userResourceConsumption` that does not contain given `userIdentifier`.
### Why are the changes needed?
When `userResourceConsumption` of `workerInfo` does not contain given `userIdentifier`, `Worker#computeResourceConsumption` causes `NullPointerException` for worker dimension resource consumption metrics.
```
24/02/05 17:36:15,983 ERROR [worker-forward-message-scheduler] Utils: Uncaught exception in thread worker-forward-message-scheduler
java.lang.NullPointerException
at org.apache.celeborn.service.deploy.worker.Worker.$anonfun$gaugeResourceConsumption$1(Worker.scala:555)
at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23)
at org.apache.celeborn.common.metrics.source.GaugeSupplier$$anon$3.getValue(AbstractSource.scala:453)
at org.apache.celeborn.common.metrics.source.AbstractSource.addGauge(AbstractSource.scala:79)
at org.apache.celeborn.common.metrics.source.AbstractSource.addGauge(AbstractSource.scala:99)
at org.apache.celeborn.service.deploy.worker.Worker.gaugeResourceConsumption(Worker.scala:554)
at org.apache.celeborn.service.deploy.worker.Worker.$anonfun$handleResourceConsumption$1(Worker.scala:537)
at org.apache.celeborn.service.deploy.worker.Worker.$anonfun$handleResourceConsumption$1$adapted(Worker.scala:536)
at scala.collection.immutable.Map$Map1.foreach(Map.scala:128)
at org.apache.celeborn.service.deploy.worker.Worker.handleResourceConsumption(Worker.scala:536)
at org.apache.celeborn.service.deploy.worker.Worker.org$apache$celeborn$service$deploy$worker$Worker$$heartbeatToMaster(Worker.scala:362)
at org.apache.celeborn.service.deploy.worker.Worker$$anon$1.$anonfun$run$1(Worker.scala:395)
at org.apache.celeborn.common.util.Utils$.tryLogNonFatalError(Utils.scala:230)
at org.apache.celeborn.service.deploy.worker.Worker$$anon$1.run(Worker.scala:395)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
```
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
GA and cluster.
Closes#2288 from SteNicholas/CELEBORN-1252.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Delete redundant remove operations and handle timeout requests in final check
### Why are the changes needed?
Delete redundant remove operations and handle timeout requests in final check
### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?
PASS GA
Closes#2251 from jiaoqingbo/CELEBORN-1244.
Authored-by: jiaoqingbo <1178404354@qq.com>
Signed-off-by: SteNicholas <programgeek@163.com>
### What changes were proposed in this pull request?
Support database based store backend implementation for dynamic configuration management
### Why are the changes needed?
Currently celeborn provides `FsConfigServiceImpl` implementation for dynamic config service which is based on file system, We cloud Support database based store backend implementation.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- `ConfigServiceSuiteJ#testDbConfig`
Closes#2273 from RexXiong/CELEBORN-1054.
Authored-by: Shuang <lvshuang.xjs@alibaba-inc.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
Fix `Worker#computeResourceConsumption` `NullPointerException` with null `subResourceConsumptions`.
### Why are the changes needed?
With null `subResourceConsumptions`, `Worker#computeResourceConsumption` causes `NullPointerException` for application dimension resource consumption metrics.
```
24/02/04 13:58:13,757 ERROR [worker-forward-message-scheduler] Utils: Uncaught exception in thread worker-forward-message-scheduler
java.lang.NullPointerException
at org.apache.celeborn.service.deploy.worker.Worker.computeResourceConsumption(Worker.scala:581)
at org.apache.celeborn.service.deploy.worker.Worker.$anonfun$gaugeResourceConsumption$1(Worker.scala:555)
at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23)
at org.apache.celeborn.common.metrics.source.GaugeSupplier$$anon$3.getValue(AbstractSource.scala:453)
at org.apache.celeborn.common.metrics.source.AbstractSource.addGauge(AbstractSource.scala:79)
at org.apache.celeborn.common.metrics.source.AbstractSource.addGauge(AbstractSource.scala:99)
at org.apache.celeborn.service.deploy.worker.Worker.gaugeResourceConsumption(Worker.scala:554)
at org.apache.celeborn.service.deploy.worker.Worker.$anonfun$handleResourceConsumption$1(Worker.scala:537)
at org.apache.celeborn.service.deploy.worker.Worker.$anonfun$handleResourceConsumption$1$adapted(Worker.scala:536)
at scala.collection.immutable.Map$Map1.foreach(Map.scala:128)
at org.apache.celeborn.service.deploy.worker.Worker.handleResourceConsumption(Worker.scala:536)
at org.apache.celeborn.service.deploy.worker.Worker.org$apache$celeborn$service$deploy$worker$Worker$$heartbeatToMaster(Worker.scala:362)
at org.apache.celeborn.service.deploy.worker.Worker$$anon$1.$anonfun$run$1(Worker.scala:395)
at org.apache.celeborn.common.util.Utils$.tryLogNonFatalError(Utils.scala:230)
at org.apache.celeborn.service.deploy.worker.Worker$$anon$1.run(Worker.scala:395)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
```
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
GA and cluster.
Closes#2286 from SteNicholas/CELEBORN-1174.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Master can not start up with ha mode
### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
local tested
Closes#2283 from wxplovecc/fix-ha-internal-port.
Lead-authored-by: 吴祥平 <408317717@qq.com>
Co-authored-by: 吴祥平 <wxp4532@ly.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
### What changes were proposed in this pull request?
Bump Spark from 3.3.3 to 3.3.4. Meanwhile, bump the default `spark.version` from 3.3.2 to 3.3.4.
### Why are the changes needed?
Spark 3.3.4 has been announced to release: [Spark 3.3.4 released](https://spark.apache.org/news/spark-3-3-4-released.html). The profile spark-3.3 could bump Spark from 3.3.3 to 3.3.4.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
No.
Closes#2284 from SteNicholas/CELEBORN-1262.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
### What changes were proposed in this pull request?
`WorkerSource` supports application dimension `ActiveConnectionCount` metric to record the number of registered connections for each application.
### Why are the changes needed?
`ActiveConnectionCount` metric records the number of registered connections at present. It's recommended to support dimension ActiveConnectionCount metric to record the number of registered connections for each application in Worker. Application dimension `ActiveConnectionCount` metric could provide users with the actual number of registered connections for each application.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Internal tests.
Closes#2167 from SteNicholas/CELEBORN-1182.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Improve `Spark Configuration` of `Deploy Spark client` in `deploy.md`.
Fix#2270.
### Why are the changes needed?
It's recommended to improve the Spark Configuration of Deploy Spark client for deployment document with Spark Dynamic Resource Allocation support.
```
# Support Spark Dynamic Resource Allocation
# Required Spark version >= 3.5.0
spark.shuffle.sort.io.plugin.class org.apache.spark.shuffle.celeborn.CelebornShuffleDataIO
# Required Spark version >= 3.4.0, highly recommended to disable
spark.dynamicAllocation.shuffleTracking.enabled false
```
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
No.
Closes#2278 from SteNicholas/CELEBORN-1260.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
a tiny typo fix
### Why are the changes needed?
found internally that we have a non-exist profile name "server", and verified in upstream we also have this problem https://github.com/apache/incubator-celeborn/actions/runs/7721711328/job/21048669317?pr=2273
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
CI
Closes#2277 from CodingCat/fix_typos.
Authored-by: CodingCat <zhunansjtu@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
Introduce `ThreadUtils#shutdown(executor)` method to improve the default gracePeriod of `ThreadUtils#shutdown`.
### Why are the changes needed?
The default value of `gracePeriod` for `ThreadUtils#shutdown` is 30 seconds at present. Meanwhile, the `gracePeriod` of most invoker for `ThreadUtils#shutdown` is 800 milliseconds. Therefore, the default `gracePeriod` of `ThreadUtils#shutdown` could be improved as 800 milliseconds.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
No.
Closes#2276 from SteNicholas/CELEBORN-1259.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Introduce application dimension resource consumption metrics for `ResourceConsumptionSource`.
### Why are the changes needed?
`ResourceConsumption` namespace metrics are generated for each user and they are identified using a metric tag at present. It's recommended to introduce application dimension resource consumption metrics that expose application dimension resource consumption of Master and Worker. By monitoring resource consumption in the application dimension, you can obtain the actual situation of application resource consumption.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
- `WorkerInfoSuite#WorkerInfo toString output`
- `PbSerDeUtilsTest#fromAndToPbResourceConsumption`
- `MasterStateMachineSuitej#testObjSerde`
Closes#2161 from SteNicholas/CELEBORN-1174.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>