### What changes were proposed in this pull request?
DynamicConfigServiceFactory supports singleton.
### Why are the changes needed?
Improve code.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing UTs.
Closes#2635 from leixm/singleton.
Authored-by: Xianming Lei <31424839+leixm@users.noreply.github.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
Now there are three different jackson versions in the server dependency list.
It is better to align them.
### Why are the changes needed?
To align the dependency versions and reduce the conflicts in the future.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass the GA.
Closes#2620 from turboFei/align_jackson.
Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
### What changes were proposed in this pull request?
If authentication enabled, check admin privileges for http mutative requests.
Likes:
```
POST /api/v1/workers/exclude
POST /api/v1/workers/events
POST /api/v1/workers/exit
```
### Why are the changes needed?
For security requirement.
### Does this PR introduce _any_ user-facing change?
Yes, after this pr, if http authentication enabled, for all mutative http requests, it will check the admin privileges.
Before this PR, if an API is not defined and the method is `POST/PUT/DELETE/PATCH`, the response status code is `404`.
After this PR, if the admin privileges check failed, the response status code will be `403`.
### How was this patch tested?
UT.
Closes#2601 from turboFei/admin_check.
Authored-by: Fei Wang <fwang12@ebay.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
This PR is for [CIP-9 Refine the celeborn RESTful APIs](https://docs.google.com/document/d/1LV2vV-w3XtlbJj2Vi4J77mt4IYCr40-8A_JncZLsHqs/edit?usp=sharing).
We leverage [openapi-generator](https://github.com/OpenAPITools/openapi-generator) to generate the client and model code.
### Why are the changes needed?
Celeborn has implemented RESTful APIs for monitoring and administrative operations on both master and worker endpoints. These APIs enable tasks such as configuration checks, status viewing of master/worker nodes, worker decommissioning/recommissioning, and more. They provide crucial insights and support for DevOps.
The primary concern with the existing API is the response content type, which is `text/plain` rather than the more widely accepted `application/json`. This mismatch makes integration with DevOps tools challenging, as these tools typically require JSON-formatted responses for seamless parsing and automation.
And I also saw the need for REST API evolution in[ Apache Celeborn CLI Proposal](https://cwiki.apache.org/confluence/display/CELEBORN/CIP-7+Celeborn+CLI).
### Does this PR introduce _any_ user-facing change?
This pr introduce a new API namespace: `/api/v1`. This approach allows us to maintain the current API for compatibility while offering an improved version.
### How was this patch tested?
UT.
Closes#2599 from turboFei/cip_9_openapi.
Lead-authored-by: Fei Wang <fwang12@ebay.com>
Co-authored-by: Fei Wang <cn.feiwang@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
### What changes were proposed in this pull request?
I am implementing the plugin for Bearer token authentication, and I found that, in ebay, the tokens are bound to client IP.
So, I also need to transfer the clientIp for token validation, I wonder that it is a general case.
This pr is a followup for Http password/token authentication and extend the current interface api.
### Why are the changes needed?
To extend the token authentication use case in case that we need more properties associate with the token.
### Does this PR introduce _any_ user-facing change?
No, this interface `TokenAuthenticationProvider` has not been released.
### How was this patch tested?
Not needed.
Closes#2604 from turboFei/auth_properties.
Lead-authored-by: Fei Wang <fwang12@ebay.com>
Co-authored-by: Cheng Pan <pan3793@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
### What changes were proposed in this pull request?
Fix authentication support for Apache Flink.
### Why are the changes needed?
Without these changes, Apache Flink applications fail when Celeborn cluster has authentication enabled.
### Does this PR introduce _any_ user-facing change?
Fixes authentication support for Apache Flink integration
### How was this patch tested?
This is forward port + adaptation of changes we did internally (against 0.4) when testing Apache Flink applications against Celeborn cluster with authentication (and TLS) enabled.
Integration test has been updated to additionally test for Flink applications with authentication enabled in Celeborn cluster.
Closes#2596 from mridulm/fix-flink-auth-support.
Authored-by: Mridul Muralidharan <mridulatgmail.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
Partial backport https://github.com/apache/kyuubi/pull/2634
It is aimed to enhance the error message when exception thrown in RESTful api method.
### Why are the changes needed?
As title.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Closes#2587 from turboFei/RestExceptionMapper.
Authored-by: Fei Wang <fwang12@ebay.com>
Signed-off-by: SteNicholas <programgeek@163.com>
### What changes were proposed in this pull request?
`CelebornScalaObjectMapper` supports configuring `FAIL_ON_UNKNOWN_PROPERTIES` to false.
### Why are the changes needed?
`CelebornScalaObjectMapper` would fail on unknown properties in Celeborn server side. Therefore, `CelebornScalaObjectMapper` could support configuring `FAIL_ON_UNKNOWN_PROPERTIES` to false which does not fail on unknown properties for Celeborn Master/Worker.
Backport: https://github.com/apache/kyuubi/pull/4691.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
No.
Closes#2582 from SteNicholas/CELEBORN-1471.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Support celeborn master/worker http authentication.
### Why are the changes needed?
Authentication is needed for celeborn admin APIs.
### Does this PR introduce _any_ user-facing change?
Yes, introduce authentication related config items, but does not break the current behavior.
### How was this patch tested?
Added UT for BASIC and Bearer authentication.
Closes#2440 from turboFei/http_auth.
Authored-by: Fei Wang <fwang12@ebay.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
### What changes were proposed in this pull request?
Introduce worker decommission metrics and corresponding REST API.
### Why are the changes needed?
In a production environment, due to certain hardware or environmental reasons, our script will automatically decommission the node. At this time, we need to distinguish between graceful shutdown nodes and decommissioned nodes.
If we distinguish shutdown worker and decommission worker metrics, we can achieve better operation and maintenance.
### Does this PR introduce _any_ user-facing change?
Yes.
### How was this patch tested?
- `DefaultMetaSystemSuiteJ#testHandleReportWorkerDecommission`
- `RatisMasterStatusSystemSuiteJ#testHandleReportWorkerDecommission`
- `ApiMasterResourceSuite#decommissionWorkers`
- `ApiWorkerResourceSuite#isDecommissioning`
Closes#2535 from leixm/issue_1444.
Lead-authored-by: Xianming Lei <jerrylei@apache.org>
Co-authored-by: Xianming Lei <31424839+leixm@users.noreply.github.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
Improve parameters, description and document of Celeborn REST API, including:
1. The POST request uses `FormParam` instead of `QueryParam`.
2. The parameter name uses lowercase instead of uppercase.
3. The description of `/exclude` aligns with document in `monitoring.md`.
4. The document of `REST API` adds the `Method` and `Parameters` to document GET/POST method and corresponding interface.
### Why are the changes needed?
The parameters, description and document of REST API need to improve after http server refine.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
GA.
Closes#2495 from SteNicholas/CELEBORN-1317.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
Introduce `celeborn.master.http.idleTimeout` and `celeborn.worker.http.idleTimeout` to support idle timeout configuration of Jetty for `HttpServer`.
### Why are the changes needed?
`ServerConnector` supports HTTP idle timeout configuration via `jetty.http.idleTimeout`, of which default value is 30000ms that is configured as `jetty.http.idleTimeout=300000`. `HttpServer` should also support idle timeout configuration of Jetty, which timeout is as follows:
```
2024-04-12 16:04:00,926 [DEBUG] [master-JettyScheduler-1] - org.eclipse.jetty.io.IdleTimeout -IdleTimeout.java(161) -SocketChannelEndPoint567d3f82{l=/127.0.0.1:9097,r=/127.0.0.1:35276,OPEN,fill=FI,flush=-,to=29999/30000}{io=1/1,kio=1,kro=1}->HttpConnection2f88da0c[p=HttpParser{s=START,0 of -1},g=HttpGenerator796c3666{s=START}]=>HttpChannelOverHttp63815646{s=HttpChannelState5c192497{s=IDLE rs=BLOCKING os=OPEN is=IDLE awp=false se=false i=true al=0},r=5,c=false/false,a=IDLE,uri=null,age=0} idle timeout check, elapsed: 29999 ms, remaining: 1 ms
2024-04-12 16:04:00,927 [DEBUG] [master-JettyScheduler-1] - org.eclipse.jetty.io.IdleTimeout -IdleTimeout.java(161) -SocketChannelEndPoint567d3f82{l=/127.0.0.1:9097,r=/127.0.0.1:35276,OPEN,fill=FI,flush=-,to=30001/30000}{io=1/1,kio=1,kro=1}->HttpConnection2f88da0c[p=HttpParser{s=START,0 of -1},g=HttpGenerator796c3666{s=START}]=>HttpChannelOverHttp63815646{s=HttpChannelState5c192497{s=IDLE rs=BLOCKING os=OPEN is=IDLE awp=false se=false i=true al=0},r=5,c=false/false,a=IDLE,uri=null,age=0} idle timeout check, elapsed: 30001 ms, remaining: -1 ms
2024-04-12 16:04:00,927 [DEBUG] [master-JettyScheduler-1] - org.eclipse.jetty.io.IdleTimeout -IdleTimeout.java(168) -SocketChannelEndPoint567d3f82{l=/127.0.0.1:9097,r=/127.0.0.1:35276,OPEN,fill=FI,flush=-,to=30001/30000}{io=1/1,kio=1,kro=1}->HttpConnection2f88da0c[p=HttpParser{s=START,0 of -1},g=HttpGenerator796c3666{s=START}]=>HttpChannelOverHttp63815646{s=HttpChannelState5c192497{s=IDLE rs=BLOCKING os=OPEN is=IDLE awp=false se=false i=true al=0},r=5,c=false/false,a=IDLE,uri=null,age=0} idle timeout expired
2024-04-12 16:04:00,927 [DEBUG] [master-JettyScheduler-1] - org.eclipse.jetty.io.FillInterest -FillInterest.java(136) -onFail FillInterest6cc48840{AC.ReadCB2f88da0c{HttpConnection2f88da0c::SocketChannelEndPoint567d3f82{l=/127.0.0.1:9097,r=/127.0.0.1:35276,OPEN,fill=FI,flush=-,to=30001/30000}{io=1/1,kio=1,kro=1}->HttpConnection2f88da0c[p=HttpParser{s=START,0 of -1},g=HttpGenerator796c3666{s=START}]=>HttpChannelOverHttp63815646{s=HttpChannelState5c192497{s=IDLE rs=BLOCKING os=OPEN is=IDLE awp=false se=false i=true al=0},r=5,c=false/false,a=IDLE,uri=null,age=0}}}
java.util.concurrent.TimeoutException: Idle timeout expired: 30001/30000 ms
at org.eclipse.jetty.io.IdleTimeout.checkIdleTimeout(IdleTimeout.java:171) ~[jetty-io-9.4.52.v20230823.jar:9.4.52.v20230823]
at org.eclipse.jetty.io.IdleTimeout.idleCheck(IdleTimeout.java:113) ~[jetty-io-9.4.52.v20230823.jar:9.4.52.v20230823]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[?:1.8.0_162]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_162]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) ~[?:1.8.0_162]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) ~[?:1.8.0_162]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_162]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_162]
at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_162]
2024-04-12 16:04:00,927 [DEBUG] [master-JettyScheduler-1] - org.eclipse.jetty.http.HttpParser -HttpParser.java(1883) -close HttpParser{s=START,0 of -1}
2024-04-12 16:04:00,927 [DEBUG] [master-JettyScheduler-1] - org.eclipse.jetty.http.HttpParser -HttpParser.java(1912) -START --> CLOSE
```
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
No.
Closes#2455 from SteNicholas/CELEBORN-1385.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>
### What changes were proposed in this pull request?
Log celeborn config for debugging purposes.
### Why are the changes needed?
Help with debugging
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
tested the patch internally.
Closes#2442 from akpatnam25/CELEBORN-1368.
Authored-by: Aravind Patnam <apatnam@linkedin.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
`ServerConnector` supports `celeborn.master.http.stopTimeout` and `celeborn.worker.http.stopTimeout`.
### Why are the changes needed?
Jetty `Server` supports `celeborn.master.http.stopTimeout` and `celeborn.worker.http.stopTimeout`, but `ServerConnector` does not support, which default stop timeout is 5min.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Local test.
Closes#2437 from SteNicholas/CELEBORN-1317.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
Remove `Incubating` from REST API Documentation.
### Why are the changes needed?
The ASF board has approved a resolution to graduate Celeborn into a full Top Level Project. The REST API Documentation should remove `Incubating`.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
No.
Closes#2425 from SteNicholas/CELEBORN-1317.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
To fix the UT for http server port already in use issue.
For Jetty HttpServer, if failed to bind port, the exception is IOException and the cause is BindException, we should retry for that.
Before:
```
case e: BindException => // retry to setup mini cluster
```
Now:
```
case e: IOException
if e.isInstanceOf[BindException] || Option(e.getCause).exists(
_.isInstanceOf[BindException]) => // retry to setup mini cluster
```
### Why are the changes needed?
To fix the UT for http server port already in use issue.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Will trigger GA for 3 three times.
Closes#2424 from turboFei/set_connector_stop_timeout.
Authored-by: Fei Wang <fwang12@ebay.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
Before, there is no http request spec likes query param, http method and response mediaType.
And for each api, a HttpEndpoint class is needed.
In this PR, we refine the code for http service and provide swagger ui.
Note that: This pr does not change the orignal api request and response behavior, including metrics APIs.
TODO:
1. define DTO
2. http request authentication
<img width="1900" alt="image" src="https://github.com/apache/incubator-celeborn/assets/6757692/7f8c2363-170d-4bdf-b2c9-74260e31d3e5">
<img width="1138" alt="image" src="https://github.com/apache/incubator-celeborn/assets/6757692/3ae6ec8e-00a8-475b-bb37-0329536185f6">
### Why are the changes needed?
To close CELEBORN-1317
### Does this PR introduce _any_ user-facing change?
The api is align with before.
### How was this patch tested?
UT.
Closes#2371 from turboFei/jetty.
Authored-by: Fei Wang <fwang12@ebay.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
Change DB script column from user to name
### Why are the changes needed?
Change DB script column from user to name
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Closes#2340 from AngersZhuuuu/CELEBORN-1297.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
### What changes were proposed in this pull request?
Introduce `celeborn.dynamicConfig.store.fs.path` config to configure the path of dynamic config file for fs store backend.
### Why are the changes needed?
`FsConfigServiceImpl` uses `celeborn.quota.configuration.path` to configure the path of dynamic config file for fs store backend at present. The path of dynamic config file should be introduced with `celeborn.dynamicConfig.store.fs.path` instead of quota configuration path.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
No.
Closes#2337 from SteNicholas/CELEBORN-1296.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>
### What changes were proposed in this pull request?
Add check tenantConfig.getConfigs().isEmpty() in getTenantUserConfigFromCache
### Why are the changes needed?
Add check tenantConfig.getConfigs().isEmpty() in getTenantUserConfigFromCache
### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?
PASS GA
Closes#2324 from jiaoqingbo/1285.
Authored-by: jiaoqingbo <1178404354@qq.com>
Signed-off-by: waitinfuture <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
Introduce Rest API of listing dynamic configuration `/listDynamicConfigs` to list the dynamic configs. The result of `/listDynamicConfigs` is as follows:
```
=========================== Dynamic Configuration ============================
celeborn.master.ha.ratis.raft.server.snapshot.auto.trigger.threshold 100000
celeborn.worker.flusher.buffer.size 64k
=========================== SYSTEM ============================
celeborn.master.ha.ratis.raft.server.snapshot.auto.trigger.threshold 200000
celeborn.worker.flusher.buffer.size 128k
=========================== TENANT ============================
=========================== Tenant: tenantId1 ============================
celeborn.master.ha.ratis.raft.server.snapshot.auto.trigger.threshold 300000
celeborn.worker.flusher.buffer.size 256k
=========================== TENANT_USER ============================
=========================== Tenant: tenantId1, Name: user1 ============================
celeborn.master.ha.ratis.raft.server.snapshot.auto.trigger.threshold 400000
celeborn.worker.flusher.buffer.size 512k
```
### Why are the changes needed?
Celeborn supports dynamic configuration with `ConfigService` at present. It's recommend to introduce Rest API of dynamic configuration management.
### Does this PR introduce _any_ user-facing change?
- Introduce Rest API of listing dynamic configuration: `/listDynamicConfigs?level=[system|tenant|tenant_user]&tenant=tenantId1&name=user1`.
### How was this patch tested?
- `HttpUtilsSuite#CELEBORN-1056: Introduce Rest API of listing dynamic configuration`
Closes#2311 from SteNicholas/CELEBORN-1056.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>
### What changes were proposed in this pull request?
This pr does 2 things:
1. Remove unnecessary conf QUOTA_MANAGER since we implement it with ConfigService and ConfigService already have a conf to indicate the implement method.
2. Move the quota manager to Master side since only master use this
3. Support quota manager use FsConfigService and support default system level
### Why are the changes needed?
1. Many times, for users who do not have a quota configured, we hope to have a default quota that applies to them.
2. Quota manager should support refresh
3. QuotaManager should support integrate with ConfigService
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
added ut
Closes#2298 from AngersZhuuuu/CELEBORN-1239.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
### What changes were proposed in this pull request?
Improve tenant user level dynamic configuration interface of `ConfigService` including:
- Renames `getRawTenantUserConfig` to `getRawTenantUserConfigFromCache`.
- Renames `getTenantUserConfig` to `getTenantUserConfigFromCache`.
### Why are the changes needed?
The naming of tenant user level dynamic configuration interface of `ConfigService` needs to be consistent with other interfaces which names with `FromCache`.
### Does this PR introduce _any_ user-facing change?
- Renames `getRawTenantUserConfig` to `getRawTenantUserConfigFromCache`.
- Renames `getTenantUserConfig` to `getTenantUserConfigFromCache`.
### How was this patch tested?
- `ConfigServiceSuiteJ`
Closes#2307 from SteNicholas/CELEBORN-1264.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>
### What changes were proposed in this pull request?
Improve the implementation of `ConfigService` including:
- Removes `celeborn.dynamicConfig.enabled`.
- Changes `celeborn.dynamicConfig.store.backend` to optional.
- Renames `refreshAllCache` to `refreshCache` in `ConfigService`.
- Checks whether the dynamic config file exists and is file in `FsConfigServiceImpl`.
### Why are the changes needed?
Whether to enable dynamic config could check via whether `celeborn.dynamicConfig.store.backend` is provided, instead of `celeborn.dynamicConfig.enabled`. The `refreshAllCache` interface could rename to `refreshCache` and throw Exception simply. Meanwhile, `FsConfigServiceImpl` should check whether the dynamic config file exists and is file.
### Does this PR introduce _any_ user-facing change?
- Renames `refreshAllCache` to `refreshCache` in `ConfigService`.
### How was this patch tested?
- `ConfigServiceSuiteJ`
Closes#2304 from SteNicholas/CELEBORN-1052.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>
### What changes were proposed in this pull request?
ConfigService support user level config
### Why are the changes needed?
Support more case of config, later can integrate with quota manager
### Does this PR introduce _any_ user-facing change?
With this pr, user's setting form config service will have three level
- User
- Tenant
- System
User identifier is construct by username and tenantId,
If there is no specify setting for username, will fallback to tenant level setting, if tenant level setting also not set, fallback to system setting
### How was this patch tested?
Added UT
Closes#2285 from AngersZhuuuu/CELEBORN-1264.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
### What changes were proposed in this pull request?
Support database based store backend implementation for dynamic configuration management
### Why are the changes needed?
Currently celeborn provides `FsConfigServiceImpl` implementation for dynamic config service which is based on file system, We cloud Support database based store backend implementation.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- `ConfigServiceSuiteJ#testDbConfig`
Closes#2273 from RexXiong/CELEBORN-1054.
Authored-by: Shuang <lvshuang.xjs@alibaba-inc.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
Introduce `ThreadUtils#shutdown(executor)` method to improve the default gracePeriod of `ThreadUtils#shutdown`.
### Why are the changes needed?
The default value of `gracePeriod` for `ThreadUtils#shutdown` is 30 seconds at present. Meanwhile, the `gracePeriod` of most invoker for `ThreadUtils#shutdown` is 800 milliseconds. Therefore, the default `gracePeriod` of `ThreadUtils#shutdown` could be improved as 800 milliseconds.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
No.
Closes#2276 from SteNicholas/CELEBORN-1259.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
1. Support Celeborn Master(Leader) to manage workers by sending event when heartbeat
2. Add Worker Status to Worker then we can know the status of the workers(such as during decommission...)
3. Add Http interface for master to handleWorkerEvent/getWorkerEvent
### Why are the changes needed?
Currently, we only support managing the status of workers on the worker side. This pr supports the master to manage the status of all workers. By sending events such as (Decommission/Graceful/Exit) when heartbeat, workers can be asynchronously execute the command from master. MeanWhile we can't know what the worker status during worker decommission so this pr add worker status to tell the exactly status of the worker.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Pass GA
Closes#2255 from RexXiong/CELEBORN-1245.
Authored-by: Shuang <lvshuang.xjs@alibaba-inc.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
Unify celeborn thread name format with the following pattern:
- client: `celeborn-client-[component]-[function]er`
- service: `[master|worker]-[component]-[function]er`
- other: `celeborn-[component]-[function]er`
### Why are the changes needed?
It's recommended to unify celeborn thread name format especially client side for application.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
No.
Closes#2248 from AngersZhuuuu/CELEBORN-1242.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Format the timestamp when recoding worker failure inforamtion.
### Why are the changes needed?
Now the long type timestamp is difficult to view and confuse without reading source code.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Closes#2230 from turboFei/date_format.
Authored-by: Fei Wang <fwang12@ebay.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Introduce `RunningApplicationCount` metric and `/applications` API to record running applications for Celeborn worker.
### Why are the changes needed?
`RunningApplicationCount` metrics only monitors the count of running applications in the cluster for master. Meanwhile, `/listTopDiskUsedApps` API lists the top disk usage application ids for master and worker. Therefore `RunningApplicationCount` metric and `/applications` API could be introduced to record running applications of worker.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Internal tests.
Closes#2172 from SteNicholas/CELEBORN-1189.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
1. Fix MissingOverride, DefaultCharset, UnnecessaryParentheses Rule
2. Exclude generated sources, FutureReturnValueIgnored, TypeParameterUnusedInFormals, UnusedVariable
### Why are the changes needed?
```
./build/make-distribution.sh --release
```
We get a lot of WARNINGs.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
GA
Closes#2177 from cxzl25/error_prone_patch.
Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: Fu Chen <cfmcgrady@gmail.com>
### What changes were proposed in this pull request?
If the user does not use prometheus to collect monitoring metrics, but rather some other ones. Using metrics in JSON format would be more user-friendly.The PR supports JSON format for metrics.
### Why are the changes needed?
Ditto.
### Does this PR introduce _any_ user-facing change?
Metrics supports JSON format
### How was this patch tested?
Cluster test.
Closes#2089 from suizhe007/CELEBORN-1122.
Authored-by: qinrui <qr7972@gmail.com>
Signed-off-by: Shuang <lvshuang.tb@gmail.com>
### What changes were proposed in this pull request?
Follow up #2100. Mainly changes the package from scala to java of the codes in #2100. Meanwhile, `FsConfigServiceImpl#refresh` should directly return instead of refreshing configs.
### Why are the changes needed?
This PR follow up dynamic `ConfigService` at `SystemLevel` and `TenantLevel`, Dynamic configuration is a type of configuration that can be changed at runtime as needed in #2100. The implementation of `ConfigService` is based on Java codes, which are put into Scala package and cause that the spotless plugin does not format well. After the changes of the pull request, there are much code style changes generated from the package moving behavior.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
`ConfigServiceSuiteJ`.
Closes#2125 from SteNicholas/CELEBORN-1052.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Shuang <lvshuang.tb@gmail.com>
### What changes were proposed in this pull request?
This PR introduce dynamic ConfigService at SystemLevel and TenantLevel, Dynamic configuration is a type of configuration that can be changed at runtime as needed. It can be used at system level/tenant level. When applying dynamic configuration, the priority order is as follows: tenant level overrides system level, which in turn overrides static configuration(CelebornConf). This means that if a configuration is defined at the tenant level, it will be used instead of the system level or static configuration(CelebornConf). If the tenant-level configuration is missing,
the system-level configuration will be used. If the system-level configuration is also missing, CelebornConf
will be used as the default value.
There are several other tasks related to this feature that will be implemented in the future.
- [ ] [Add isDynamic property for CelebornConf](https://issues.apache.org/jira/browse/CELEBORN-1051)
- [ ] [Support DB based Configserver](https://issues.apache.org/jira/browse/CELEBORN-1054)
- [ ] [Add restAPI for configuration management](https://issues.apache.org/jira/browse/CELEBORN-1056)
### Why are the changes needed?
The current configuration of the server (CelebornConf) is static. When the configuration is changed, the service needs to be restarted. This PR introduces a dynamic configuration solution. The server side can use dynamic configuration as needed. At the same time, it is considered that the tenant level will be supported in the future (such as supporting tenant level dynamic quota control) configuration, so this time we will also consider supporting dynamic tenant-level configuration, and this PR will provide a default implementation based on the file system.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
UT
Closes#2100 from RexXiong/CELEBORN-1052.
Authored-by: Shuang <lvshuang.tb@gmail.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Support exclude worker manually given worker id. This worker is added into excluded workers manually.
### Why are the changes needed?
Celeborn supports to shuffle client-side fetch and push exclude workers on failure at present. It's necessary to exclude worker manually for maintaining the Celeborn cluster.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
- `HttpUtilsSuite`
- `DefaultMetaSystemSuiteJ#testHandleWorkerExclude`
- `RatisMasterStatusSystemSuiteJ#testHandleWorkerExclude`
- `MasterStateMachineSuiteJ#testObjSerde`
Closes#1997 from SteNicholas/CELEBORN-448.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
We can't get any response from /conf when the master started with default celeborn conf.

**Internal Exception**
```
empty.max
java.lang.UnsupportedOperationException: empty.max
at scala.collection.TraversableOnce.max(TraversableOnce.scala:275)
at scala.collection.TraversableOnce.max$(TraversableOnce.scala:273)
at scala.collection.AbstractTraversable.max(Traversable.scala:108)
at org.apache.celeborn.server.common.HttpService.getConf(HttpService.scala:36)
at org.apache.celeborn.service.deploy.master.MasterSuite.$anonfun$new$1(MasterSuite.scala:46)
```
### Why are the changes needed?
Bug.
### How was this patch tested?
Local
Closes#1995 from xleoken/patch5.
Authored-by: xleoken <leo65535@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
Improve response message of invalid HTTP request, which lists available API providers like as below:
- master
```
Invalid uri of the master. Available API providers include:
/applications List all running application's ids of the cluster.
/conf List the conf setting of the master.
/excludedWorkers List all excluded workers of the master.
/help List the available API providers of the master.
/hostnames List all running application's LifecycleManager's hostnames of the cluster.
/listTopDiskUsedApps List the top disk usage application ids. It will return the top disk usage application ids for the cluster.
/lostWorkers List all lost workers of the master.
/masterGroupInfo List master group information of the service. It will list all master's LEADER, FOLLOWER information.
/shuffles List all running shuffle keys of the service. It will return all running shuffle's key of the cluster.
/shutdownWorkers List all shutdown workers of the master.
/threadDump List the current thread dump of the master.
/workerInfo List worker information of the service. It will list all registered workers 's information.
```
- worker
```
Invalid uri of the worker. Available API providers include:
/conf List the conf setting of the worker.
/exit Trigger this worker to exit. Legal types are 'DECOMMISSION‘, 'GRACEFUL' and 'IMMEDIATELY'
/help List the available API providers of the worker.
/isRegistered Show if the worker is registered to the master success.
/isShutdown Show if the worker is during the process of shutdown.
/listPartitionLocationInfo List all the living PartitionLocation information in that worker.
/listTopDiskUsedApps List the top disk usage application ids. It only return application ids running in that worker.
/shuffles List all the running shuffle keys of the worker. It only return keys of shuffles running in that worker.
/threadDump List the current thread dump of the worker.
/unavailablePeers List the unavailable peers of the worker, this always means the worker connect to the peer failed.
/workerInfo List the worker information of the worker.
```
### Why are the changes needed?
Response message of invalid HTTP request could not help users with correct HTTP path.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
`HttpUtilsSuite#CELEBORN-829: Improve response message of invalid HTTP request`
Closes#1986 from SteNicholas/CELEBORN-829.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
Replace
```properties
celeborn.metrics.master.prometheus.host
celeborn.metrics.master.prometheus.port
celeborn.metrics.worker.prometheus.host
celeborn.metrics.worker.prometheus.port
```
With
```properties
celeborn.master.http.host
celeborn.master.http.port
celeborn.worker.http.host
celeborn.worker.http.port
```
### Why are the changes needed?
The `celeborn.master.metrics.prometheus.port` and `celeborn.metrics.worker.prometheus.port` bind port not only serve prometheus metrics, but also provide some useful API services.
https://celeborn.apache.org/docs/latest/monitoring/#rest-api
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Closes#1919 from cxzl25/CELEBORN-983.
Lead-authored-by: sychen <sychen@ctrip.com>
Co-authored-by: Keyong Zhou <zhouky@apache.org>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
The code that `HttpRequestHandler` handles HTTP request could be improved with handling HTTP request with base, master and worker.
### Why are the changes needed?
Improves `HttpRequestHandler` handle HTTP request with base, master and worker.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Internal tests.
Closes#1977 from SteNicholas/http-request-handler.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
### What changes were proposed in this pull request?
### Why are the changes needed?
Make it easy for administrators to find the address of the http service bindings.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
```
23/09/28 17:10:50,465 INFO [main] HttpServer: master: HttpServer started on port 9983.
```
PR
```
23/09/28 17:28:29,797 INFO [main] HttpServer: master: HttpServer started on clb-3 with port 9983.
```
Closes#1947 from cxzl25/CELEBORN-1014.
Lead-authored-by: sychen <sychen@ctrip.com>
Co-authored-by: Cheng Pan <pan3793@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
### What changes were proposed in this pull request?
As title
### Why are the changes needed?
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Closes#1768 from AngersZhuuuu/CELEBORN-847.
Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Co-authored-by: Keyong Zhou <zhouky@apache.org>
Co-authored-by: Keyong Zhou <waitinfuture@gmail.com>
Co-authored-by: Keyong Zhou <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
As title
### Why are the changes needed?
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Closes#1759 from AngersZhuuuu/CELEBORN-832.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
### What changes were proposed in this pull request?
Pass exit kind to each component, if the exit kind match:
- GRACEFUL_SHUTDOWN: Behavior as origin code's graceful == true
- Others: will clean the level db file.
### Why are the changes needed?
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Closes#1748 from AngersZhuuuu/CELEBORN-819.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
### What changes were proposed in this pull request?
As title
### Why are the changes needed?
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Closes#1742 from AngersZhuuuu/CELEBORN-820.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
### What changes were proposed in this pull request?
As title
### Why are the changes needed?
Recently, while conducting the sbt build test, it came to my attention that certain resources such as ports and threads were not being released promptly.
This pull request introduces a new method, `shutdown(graceful: Boolean)`, to the `Service` trait. When invoked by `MiniClusterFeature.shutdownMiniCluster`, it calls `worker.shutdown(graceful = false)`. This implementation aims to prevent possible memory leaks during CI processes.
Before this PR the unit tests in the `client/common/master/service/worker` modules resulted in leaked ports.
```
$ jps
1138131 Jps
1130743 sbt-launch-1.9.0.jar
$ netstat -lntp | grep 1130743
(Not all processes could be identified, non-owned process info
will not be shown, you would have to be root to see it all.)
tcp 0 0 127.0.0.1:12345 0.0.0.0:* LISTEN 1130743/java
tcp 0 0 0.0.0.0:41563 0.0.0.0:* LISTEN 1130743/java
tcp 0 0 0.0.0.0:42905 0.0.0.0:* LISTEN 1130743/java
tcp 0 0 0.0.0.0:44419 0.0.0.0:* LISTEN 1130743/java
tcp 0 0 0.0.0.0:45025 0.0.0.0:* LISTEN 1130743/java
tcp 0 0 0.0.0.0:44799 0.0.0.0:* LISTEN 1130743/java
tcp 0 0 0.0.0.0:39053 0.0.0.0:* LISTEN 1130743/java
tcp 0 0 0.0.0.0:39029 0.0.0.0:* LISTEN 1130743/java
tcp 0 0 0.0.0.0:39475 0.0.0.0:* LISTEN 1130743/java
tcp 0 0 0.0.0.0:40153 0.0.0.0:* LISTEN 1130743/java
tcp 0 0 0.0.0.0:33051 0.0.0.0:* LISTEN 1130743/java
tcp 0 0 0.0.0.0:33449 0.0.0.0:* LISTEN 1130743/java
tcp 0 0 0.0.0.0:34073 0.0.0.0:* LISTEN 1130743/java
tcp 0 0 0.0.0.0:35347 0.0.0.0:* LISTEN 1130743/java
tcp 0 0 0.0.0.0:35971 0.0.0.0:* LISTEN 1130743/java
tcp 0 0 0.0.0.0:36799 0.0.0.0:* LISTEN 1130743/java
tcp 0 0 192.168.1.151:40775 0.0.0.0:* LISTEN 1130743/java
tcp 0 0 192.168.1.151:44457 0.0.0.0:* LISTEN 1130743/java
```
After this PR:
```
$ jps
1114423 Jps
1107544 sbt-launch-1.9.0.jar
$ netstat -lntp | grep 1107544
(Not all processes could be identified, non-owned process info
will not be shown, you would have to be root to see it all.)
```
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Pass GA
Closes#1727 from cfmcgrady/shutdown.
Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
### What changes were proposed in this pull request?
Unify all blacklist related code and comment
### Why are the changes needed?
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Closes#1638 from AngersZhuuuu/CELEBORN-666-FOLLOWUP.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
`SimpleDateFormat` is not thread-safe, replace it with a thread-safe `FastDateFormat`
### Why are the changes needed?
`FastDateFormat` is a fast and thread-safe version of `java.text.SimpleDateFormat`.
### Does this PR introduce _any_ user-facing change?
Yes, it's a bug fix.
### How was this patch tested?
Manually review.
Closes#1545 from pan3793/CELEBORN-636.
Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Ethan Feng <ethanfeng@apache.org>