Commit Graph

122 Commits

Author SHA1 Message Date
Xianming Lei
d5b124d8cd [CELEBORN-1516] DynamicConfigServiceFactory should support singleton
### What changes were proposed in this pull request?
DynamicConfigServiceFactory supports singleton.

### Why are the changes needed?
Improve code.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Existing UTs.

Closes #2635 from leixm/singleton.

Authored-by: Xianming Lei <31424839+leixm@users.noreply.github.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-07-19 16:15:04 +08:00
Wang, Fei
0b8c9fdd4c [CELEBORN-1505] Algin the celeborn server jackson dependency versions
### What changes were proposed in this pull request?

Now there are three different jackson versions in the server dependency list.

It is better to align them.

### Why are the changes needed?
To align the dependency versions and reduce the conflicts in the future.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?
Pass the GA.

Closes #2620 from turboFei/align_jackson.

Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2024-07-15 11:00:23 +08:00
Fei Wang
09d3a3b05f [CELEBORN-1493] Check admin privileges for http mutative requests
### What changes were proposed in this pull request?

If authentication enabled, check admin privileges for http mutative requests.

Likes:

```
POST /api/v1/workers/exclude
POST /api/v1/workers/events
POST /api/v1/workers/exit
```

### Why are the changes needed?

For security requirement.

### Does this PR introduce _any_ user-facing change?
Yes, after this pr, if http authentication enabled, for all mutative http requests, it will check the admin privileges.

Before this PR, if an API is not defined and the method is `POST/PUT/DELETE/PATCH`, the response status code is `404`.
After this PR, if the admin privileges check failed, the response status code will be `403`.

### How was this patch tested?
UT.

Closes #2601 from turboFei/admin_check.

Authored-by: Fei Wang <fwang12@ebay.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-07-12 11:51:35 +08:00
Fei Wang
d698a69edc
[CELEBORN-1477][CIP-9] Refine the celeborn RESTful APIs
### What changes were proposed in this pull request?

This PR is for [CIP-9 Refine the celeborn RESTful APIs](https://docs.google.com/document/d/1LV2vV-w3XtlbJj2Vi4J77mt4IYCr40-8A_JncZLsHqs/edit?usp=sharing).

We leverage [openapi-generator](https://github.com/OpenAPITools/openapi-generator) to generate the client and model code.

### Why are the changes needed?

Celeborn has implemented RESTful APIs for monitoring and administrative operations on both master and worker endpoints. These APIs enable tasks such as configuration checks, status viewing of master/worker nodes, worker decommissioning/recommissioning, and more. They provide crucial insights and support for DevOps.
The primary concern with the existing API is the response content type, which is `text/plain` rather than the more widely accepted `application/json`. This mismatch makes integration with DevOps tools challenging, as these tools typically require JSON-formatted responses for seamless parsing and automation.
And I also saw the need for REST API evolution in[ Apache Celeborn CLI Proposal](https://cwiki.apache.org/confluence/display/CELEBORN/CIP-7+Celeborn+CLI).

### Does this PR introduce _any_ user-facing change?
This pr introduce  a new API namespace: `/api/v1`. This approach allows us to maintain the current API for compatibility while offering an improved version.

### How was this patch tested?
UT.

Closes #2599 from turboFei/cip_9_openapi.

Lead-authored-by: Fei Wang <fwang12@ebay.com>
Co-authored-by: Fei Wang <cn.feiwang@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2024-07-11 10:57:00 +08:00
Fei Wang
f6916317ec
[CELEBORN-1318][FOLLOWUP] Transfer extraInfo for http authentication providers
### What changes were proposed in this pull request?
I am implementing the plugin for Bearer token authentication, and I found that, in ebay, the tokens are bound to client IP.

So, I also need to transfer the clientIp for token validation, I wonder that it is a general case.

This pr is a followup for Http password/token authentication and extend the current interface api.

### Why are the changes needed?
To extend the token authentication use case in case that we need more properties associate with the token.

### Does this PR introduce _any_ user-facing change?

No, this interface `TokenAuthenticationProvider` has not been released.

### How was this patch tested?

Not needed.

Closes #2604 from turboFei/auth_properties.

Lead-authored-by: Fei Wang <fwang12@ebay.com>
Co-authored-by: Cheng Pan <pan3793@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2024-07-06 00:50:37 +08:00
Mridul Muralidharan
c90a1647af [CELEBORN-1489] Update Flink support with authentication support
### What changes were proposed in this pull request?
Fix authentication support for Apache Flink.

### Why are the changes needed?
Without these changes, Apache Flink applications fail when Celeborn cluster has authentication enabled.

### Does this PR introduce _any_ user-facing change?

Fixes authentication support for Apache Flink integration

### How was this patch tested?

This is forward port + adaptation of changes we did internally (against 0.4) when testing Apache Flink applications against Celeborn cluster with authentication (and TLS) enabled.
Integration test has been updated to additionally test for Flink applications with authentication enabled in Celeborn cluster.

Closes #2596 from mridulm/fix-flink-auth-support.

Authored-by: Mridul Muralidharan <mridulatgmail.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-07-04 10:17:56 +08:00
Fei Wang
02efb0b4f5 [CELEBORN-1476] Enhance the RESTful response error msg
### What changes were proposed in this pull request?
Partial backport https://github.com/apache/kyuubi/pull/2634

It is aimed to enhance the error message when exception thrown in RESTful api method.

### Why are the changes needed?

As title.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?

Closes #2587 from turboFei/RestExceptionMapper.

Authored-by: Fei Wang <fwang12@ebay.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2024-06-24 03:24:33 +08:00
SteNicholas
9855426790
[CELEBORN-1471] CelebornScalaObjectMapper supports configuring FAIL_ON_UNKNOWN_PROPERTIES to false
### What changes were proposed in this pull request?

`CelebornScalaObjectMapper` supports configuring `FAIL_ON_UNKNOWN_PROPERTIES` to false.

### Why are the changes needed?

`CelebornScalaObjectMapper` would fail on unknown properties in Celeborn server side. Therefore, `CelebornScalaObjectMapper` could support configuring `FAIL_ON_UNKNOWN_PROPERTIES` to false which does not fail on unknown properties for Celeborn Master/Worker.

Backport: https://github.com/apache/kyuubi/pull/4691.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #2582 from SteNicholas/CELEBORN-1471.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-06-20 19:25:21 +08:00
Fei Wang
5cea9cc7f2
[CELEBORN-1318] Support celeborn http authentication
### What changes were proposed in this pull request?
Support celeborn master/worker http authentication.

### Why are the changes needed?
Authentication is needed for celeborn admin APIs.

### Does this PR introduce _any_ user-facing change?
Yes, introduce authentication related config items, but does not break the current behavior.

### How was this patch tested?

Added UT for BASIC and Bearer authentication.

Closes #2440 from turboFei/http_auth.

Authored-by: Fei Wang <fwang12@ebay.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2024-06-20 10:35:12 +08:00
Xianming Lei
999510b265 [CELEBORN-1444] Introduce worker decommission metrics and corresponding REST API
### What changes were proposed in this pull request?

Introduce worker decommission metrics and corresponding REST API.

### Why are the changes needed?

In a production environment, due to certain hardware or environmental reasons, our script will automatically decommission the node. At this time, we need to distinguish between graceful shutdown nodes and decommissioned nodes.

If we distinguish shutdown worker and decommission worker metrics, we can achieve better operation and maintenance.

### Does this PR introduce _any_ user-facing change?

Yes.

### How was this patch tested?

- `DefaultMetaSystemSuiteJ#testHandleReportWorkerDecommission`
- `RatisMasterStatusSystemSuiteJ#testHandleReportWorkerDecommission`
- `ApiMasterResourceSuite#decommissionWorkers`
- `ApiWorkerResourceSuite#isDecommissioning`

Closes #2535 from leixm/issue_1444.

Lead-authored-by: Xianming Lei <jerrylei@apache.org>
Co-authored-by: Xianming Lei <31424839+leixm@users.noreply.github.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2024-06-08 11:10:31 +08:00
Fei Wang
493e0f10cf [CELEBORN-1317][FOLLOWUP] Fix threadDump UT stuck issue
### What changes were proposed in this pull request?

Try to fix ApiWorkerResourceSuite::threadDump UT stuck issue.
1. Using program way to get thread dump.

Related code copied from apache/spark
https://github.com/apache/spark/blob/v3.5.1/core/src/main/scala/org/apache/spark/util/Utils.scala
https://github.com/apache/spark/blob/v3.5.1/core/src/main/scala/org/apache/spark/status/api/v1/api.scala

### Why are the changes needed?
I found that sometimes the UT stuck for threadDump api:
For example: https://github.com/apache/celeborn/actions/runs/8462056188/job/23182806487?pr=2428
<img width="1291" alt="image" src="https://github.com/apache/celeborn/assets/6757692/f39d7bb9-6e31-4ce3-a573-1ff86f335318">

<img width="762" alt="image" src="https://github.com/apache/celeborn/assets/6757692/437592dd-fc9c-404d-a452-834fcf630bd1">

threadDump api UT is new introduced in [CELEBORN-1317](https://issues.apache.org/jira/browse/CELEBORN-1317).

Before there is no UT to cover that, and now it stuck sometimes.

And for getThreadDump, before it leverages processBuilder to get the thread info.

I wonder that the process is stuck because of some unknown reason, so, in this pr, we try to use program way to get thread info.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?

UT.

![image](https://github.com/apache/celeborn/assets/6757692/51aaa44e-0523-4b60-b6c8-f4e83c709497)

Closes #2429 from turboFei/thread_dump.

Lead-authored-by: Fei Wang <fwang12@ebay.com>
Co-authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2024-05-27 15:12:50 +08:00
SteNicholas
db163bd793 [CELEBORN-1317][FOLLOWUP] Improve parameters, description and document of REST API
### What changes were proposed in this pull request?

Improve parameters, description and document of Celeborn REST API, including:

1. The POST request uses `FormParam` instead of `QueryParam`.
2. The parameter name uses lowercase instead of uppercase.
3. The description of `/exclude` aligns with document in `monitoring.md`.
4. The document of `REST API` adds the `Method` and `Parameters` to document GET/POST method and corresponding interface.

### Why are the changes needed?

The parameters, description and document of REST API need to improve after http server refine.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

GA.

Closes #2495 from SteNicholas/CELEBORN-1317.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-05-09 17:41:13 +08:00
SteNicholas
8c65ddd017 [CELEBORN-1390] ServletContextHandler should allow null path info to avoid redirection
### What changes were proposed in this pull request?

`ServletContextHandler` allows null path info to avoid redirection via `setAllowNullPathInfo(true)`.

### Why are the changes needed?

`ServletContextHandler` does not allow null path info which causes that `celeborn.metrics.prometheus.path` and `celeborn.metrics.json.path` could not access without redirection. For example:

```
celebornceleborn-test:/data/service/celeborn$ curl http://localhost:9096/metrics/prometheus
celebornceleborn-test:/data/service/celeborn$ curl http://localhost:9096/metrics/prometheus/
metrics_WriteDataHardSplitCount_Count{role="Worker"} 0 1713182689795
metrics_ReplicateDataWriteFailCount_Count{role="Worker"} 0 1713182689795
metrics_ReplicateDataConnectionExceptionCount_Count{role="Worker"} 0 1713182689795
metrics_FetchChunkFailCount_Count{role="Worker"} 0 1713182689795
metrics_ReplicateDataCreateConnectionFailCount_Count{role="Worker"} 0 1713182689795
metrics_WriteDataSuccessCount_Count{role="Worker"} 0 1713182689795
metrics_FetchChunkSuccessCount_Count{role="Worker"} 0 1713182689795
metrics_ReplicateDataFailCount_Count{role="Worker"} 0 1713182689795
metrics_RegionStartFailCount_Count{role="Worker"} 0 1713182689795
metrics_RegionFinishFailCount_Count{role="Worker"} 0 1713182689795
metrics_ActiveConnectionCount_Count{role="Worker"} 0 1713182689795
metrics_SlotsAllocated_Count{role="Worker"} 0 1713182689795
metrics_OpenStreamSuccessCount_Count{role="Worker"} 0 1713182689795
metrics_WriteDataFailCount_Count{role="Worker"} 0 1713182689795
metrics_ReplicateDataFailNonCriticalCauseCount_Count{role="Worker"} 0 1713182689795
metrics_ReplicateDataTimeoutCount_Count{role="Worker"} 0 1713182689795
metrics_PushDataHandshakeFailCount_Count{role="Worker"} 0 1713182689795
metrics_OpenStreamFailCount_Count{role="Worker"} 0 1713182689795
```

`ServletContextHandler` should allow null path info to avoid redirection via `setAllowNullPathInfo(true)`. `setAllowNullPathInfo()` sets true if `/context` is not redirected to `/context/`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

- `ApiMasterResourceSuite`
- `ApiWorkerResourceSuite`

```
celebornceleborn-test:/data/service/celeborn$ curl http://localhost:9096/metrics/prometheus
metrics_WriteDataHardSplitCount_Count{role="Worker"} 0 1713182689795
metrics_ReplicateDataWriteFailCount_Count{role="Worker"} 0 1713182689795
metrics_ReplicateDataConnectionExceptionCount_Count{role="Worker"} 0 1713182689795
metrics_FetchChunkFailCount_Count{role="Worker"} 0 1713182689795
metrics_ReplicateDataCreateConnectionFailCount_Count{role="Worker"} 0 1713182689795
metrics_WriteDataSuccessCount_Count{role="Worker"} 0 1713182689795
metrics_FetchChunkSuccessCount_Count{role="Worker"} 0 1713182689795
metrics_ReplicateDataFailCount_Count{role="Worker"} 0 1713182689795
metrics_RegionStartFailCount_Count{role="Worker"} 0 1713182689795
metrics_RegionFinishFailCount_Count{role="Worker"} 0 1713182689795
metrics_ActiveConnectionCount_Count{role="Worker"} 0 1713182689795
metrics_SlotsAllocated_Count{role="Worker"} 0 1713182689795
metrics_OpenStreamSuccessCount_Count{role="Worker"} 0 1713182689795
metrics_WriteDataFailCount_Count{role="Worker"} 0 1713182689795
metrics_ReplicateDataFailNonCriticalCauseCount_Count{role="Worker"} 0 1713182689795
metrics_ReplicateDataTimeoutCount_Count{role="Worker"} 0 1713182689795
metrics_PushDataHandshakeFailCount_Count{role="Worker"} 0 1713182689795
metrics_OpenStreamFailCount_Count{role="Worker"} 0 1713182689795
celebornceleborn-test:/data/service/celeborn$ curl http://localhost:9096/metrics/prometheus/
metrics_WriteDataHardSplitCount_Count{role="Worker"} 0 1713182689795
metrics_ReplicateDataWriteFailCount_Count{role="Worker"} 0 1713182689795
metrics_ReplicateDataConnectionExceptionCount_Count{role="Worker"} 0 1713182689795
metrics_FetchChunkFailCount_Count{role="Worker"} 0 1713182689795
metrics_ReplicateDataCreateConnectionFailCount_Count{role="Worker"} 0 1713182689795
metrics_WriteDataSuccessCount_Count{role="Worker"} 0 1713182689795
metrics_FetchChunkSuccessCount_Count{role="Worker"} 0 1713182689795
metrics_ReplicateDataFailCount_Count{role="Worker"} 0 1713182689795
metrics_RegionStartFailCount_Count{role="Worker"} 0 1713182689795
metrics_RegionFinishFailCount_Count{role="Worker"} 0 1713182689795
metrics_ActiveConnectionCount_Count{role="Worker"} 0 1713182689795
metrics_SlotsAllocated_Count{role="Worker"} 0 1713182689795
metrics_OpenStreamSuccessCount_Count{role="Worker"} 0 1713182689795
metrics_WriteDataFailCount_Count{role="Worker"} 0 1713182689795
metrics_ReplicateDataFailNonCriticalCauseCount_Count{role="Worker"} 0 1713182689795
metrics_ReplicateDataTimeoutCount_Count{role="Worker"} 0 1713182689795
metrics_PushDataHandshakeFailCount_Count{role="Worker"} 0 1713182689795
metrics_OpenStreamFailCount_Count{role="Worker"} 0 1713182689795

Closes #2464 from SteNicholas/CELEBORN-1390.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-04-17 15:09:04 +08:00
SteNicholas
1d3558bd14 [CELEBORN-1385] HttpServer support idle timeout configuration of Jetty
### What changes were proposed in this pull request?

Introduce `celeborn.master.http.idleTimeout` and `celeborn.worker.http.idleTimeout` to support idle timeout configuration of Jetty for `HttpServer`.

### Why are the changes needed?

`ServerConnector` supports HTTP idle timeout configuration via `jetty.http.idleTimeout`, of which default value is 30000ms that is configured as `jetty.http.idleTimeout=300000`. `HttpServer` should also support idle timeout configuration of Jetty, which timeout is as follows:

```
2024-04-12 16:04:00,926 [DEBUG] [master-JettyScheduler-1] - org.eclipse.jetty.io.IdleTimeout -IdleTimeout.java(161) -SocketChannelEndPoint567d3f82{l=/127.0.0.1:9097,r=/127.0.0.1:35276,OPEN,fill=FI,flush=-,to=29999/30000}{io=1/1,kio=1,kro=1}->HttpConnection2f88da0c[p=HttpParser{s=START,0 of -1},g=HttpGenerator796c3666{s=START}]=>HttpChannelOverHttp63815646{s=HttpChannelState5c192497{s=IDLE rs=BLOCKING os=OPEN is=IDLE awp=false se=false i=true al=0},r=5,c=false/false,a=IDLE,uri=null,age=0} idle timeout check, elapsed: 29999 ms, remaining: 1 ms
2024-04-12 16:04:00,927 [DEBUG] [master-JettyScheduler-1] - org.eclipse.jetty.io.IdleTimeout -IdleTimeout.java(161) -SocketChannelEndPoint567d3f82{l=/127.0.0.1:9097,r=/127.0.0.1:35276,OPEN,fill=FI,flush=-,to=30001/30000}{io=1/1,kio=1,kro=1}->HttpConnection2f88da0c[p=HttpParser{s=START,0 of -1},g=HttpGenerator796c3666{s=START}]=>HttpChannelOverHttp63815646{s=HttpChannelState5c192497{s=IDLE rs=BLOCKING os=OPEN is=IDLE awp=false se=false i=true al=0},r=5,c=false/false,a=IDLE,uri=null,age=0} idle timeout check, elapsed: 30001 ms, remaining: -1 ms
2024-04-12 16:04:00,927 [DEBUG] [master-JettyScheduler-1] - org.eclipse.jetty.io.IdleTimeout -IdleTimeout.java(168) -SocketChannelEndPoint567d3f82{l=/127.0.0.1:9097,r=/127.0.0.1:35276,OPEN,fill=FI,flush=-,to=30001/30000}{io=1/1,kio=1,kro=1}->HttpConnection2f88da0c[p=HttpParser{s=START,0 of -1},g=HttpGenerator796c3666{s=START}]=>HttpChannelOverHttp63815646{s=HttpChannelState5c192497{s=IDLE rs=BLOCKING os=OPEN is=IDLE awp=false se=false i=true al=0},r=5,c=false/false,a=IDLE,uri=null,age=0} idle timeout expired
2024-04-12 16:04:00,927 [DEBUG] [master-JettyScheduler-1] - org.eclipse.jetty.io.FillInterest -FillInterest.java(136) -onFail FillInterest6cc48840{AC.ReadCB2f88da0c{HttpConnection2f88da0c::SocketChannelEndPoint567d3f82{l=/127.0.0.1:9097,r=/127.0.0.1:35276,OPEN,fill=FI,flush=-,to=30001/30000}{io=1/1,kio=1,kro=1}->HttpConnection2f88da0c[p=HttpParser{s=START,0 of -1},g=HttpGenerator796c3666{s=START}]=>HttpChannelOverHttp63815646{s=HttpChannelState5c192497{s=IDLE rs=BLOCKING os=OPEN is=IDLE awp=false se=false i=true al=0},r=5,c=false/false,a=IDLE,uri=null,age=0}}}
java.util.concurrent.TimeoutException: Idle timeout expired: 30001/30000 ms
    at org.eclipse.jetty.io.IdleTimeout.checkIdleTimeout(IdleTimeout.java:171) ~[jetty-io-9.4.52.v20230823.jar:9.4.52.v20230823]
    at org.eclipse.jetty.io.IdleTimeout.idleCheck(IdleTimeout.java:113) ~[jetty-io-9.4.52.v20230823.jar:9.4.52.v20230823]
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[?:1.8.0_162]
    at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_162]
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) ~[?:1.8.0_162]
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) ~[?:1.8.0_162]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_162]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_162]
    at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_162]
2024-04-12 16:04:00,927 [DEBUG] [master-JettyScheduler-1] - org.eclipse.jetty.http.HttpParser -HttpParser.java(1883) -close HttpParser{s=START,0 of -1}
2024-04-12 16:04:00,927 [DEBUG] [master-JettyScheduler-1] - org.eclipse.jetty.http.HttpParser -HttpParser.java(1912) -START --> CLOSE
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #2455 from SteNicholas/CELEBORN-1385.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2024-04-14 12:40:57 +08:00
SteNicholas
f25972d003 [CELEBORN-1317][FOLLOWUP] HttpServer avoid Jetty's acceptor thread shrink for stopping
### What changes were proposed in this pull request?

`HttpServer` set idle timeout to 0 in order to avoid Jetty's acceptor thread shrink for stopping.

### Why are the changes needed?

When the Jetty's acceptor thread shrinks before the main thread sends a signal to the thread, the issue `java.io.IOException: No such file or directory` could happen.

Jetty's acceptor thread waits for a new connection request and blocked by `accept(this.fd, newfd, isaa)` in [sun.nio.ch.ServerSocketChannelImpl#accept](http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/tip/src/share/classes/sun/nio/ch/ServerSocketChannelImpl.java#l241).

When `org.eclipse.jetty.server.Server.doStop` is called in the main thread, the thread reaches [this code](http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/tip/src/share/classes/sun/nio/ch/ServerSocketChannelImpl.java#l280).

The server socket descriptor will be closed by `nd.preClose` in the main thread.
Then, `accept()` in acceptor thread throws an Exception due to "Bad file descriptor" in case of macOS.
After the exception is thrown, the acceptor thread will continue to [fetch a task](https://github.com/eclipse/jetty.project/blob/jetty-9.4.18.v20190429/jetty-util/src/main/java/org/eclipse/jetty/util/thread/QueuedThreadPool.java#L783).
If the thread obtain the `SHRINK` task [here](https://github.com/eclipse/jetty.project/blob/jetty-9.4.18.v20190429/jetty-util/src/main/java/org/eclipse/jetty/util/thread/QueuedThreadPool.java#L854), the thread will be shrink.
If, the acceptor thread finishes before `NativeThread.signal` is called in the main thread, this issue happens.

Environment:

- Jetty: v9.4.52.v20230823
- JDK: Oracle JDK 1.8
- OS: Linux version 5.10.0 (gcc (Debian 6.3.0-18+deb9u1) 6.3.0 20170516, GNU ld (GNU Binutils for Debian) 2.35.2)

Backport https://github.com/apache/spark/pull/28437.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #2450 from SteNicholas/CELEBORN-1317.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-04-09 16:58:22 +08:00
Aravind Patnam
f04ebccd4d
[CELEBORN-1368] Log celeborn config for debugging purposes
### What changes were proposed in this pull request?
Log celeborn config for debugging purposes.

### Why are the changes needed?
Help with debugging

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
tested the patch internally.

Closes #2442 from akpatnam25/CELEBORN-1368.

Authored-by: Aravind Patnam <apatnam@linkedin.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-04-08 15:11:35 +08:00
SteNicholas
df2cb1be9a [CELEBORN-1317][FOLLOWUP] ServerConnector supports celeborn.master.http.stopTimeout and celeborn.worker.http.stopTimeout
### What changes were proposed in this pull request?

`ServerConnector` supports `celeborn.master.http.stopTimeout` and `celeborn.worker.http.stopTimeout`.

### Why are the changes needed?

Jetty `Server` supports `celeborn.master.http.stopTimeout` and `celeborn.worker.http.stopTimeout`, but `ServerConnector` does not support, which default stop timeout is 5min.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Local test.

Closes #2437 from SteNicholas/CELEBORN-1317.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-04-01 17:59:12 +08:00
SteNicholas
ff2bc92067 [CELEBORN-1317][FOLLOWUP] Update default value of celeborn.master.http.maxWorkerThreads and celeborn.worker.http.maxWorkerThreads via QueuedThreadPool
### What changes were proposed in this pull request?

Update default value of `celeborn.master.http.maxWorkerThreads` and `celeborn.worker.http.maxWorkerThreads` via `QueuedThreadPool`, of which default value is 200.

### Why are the changes needed?

`QueuedThreadPool` determines that the default minimum threads is 8, and the default maximum threads is 200 in [QueuedThreadPool#L121](48f6ab7289/jetty-core/jetty-util/src/main/java/org/eclipse/jetty/util/thread/QueuedThreadPool.java (L1210)) and [QueuedThreadPool#L125](48f6ab7289/jetty-core/jetty-util/src/main/java/org/eclipse/jetty/util/thread/QueuedThreadPool.java (L125)).

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #2428 from SteNicholas/CELEBORN-1317.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-03-29 11:56:04 +08:00
SteNicholas
bd7c99f056
[CELEBORN-1317][FOLLOWUP] Remove Incubating from REST API Documentation
### What changes were proposed in this pull request?

Remove `Incubating` from REST API Documentation.

### Why are the changes needed?

The ASF board has approved a resolution to graduate Celeborn into a full Top Level Project. The REST API Documentation should remove `Incubating`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #2425 from SteNicholas/CELEBORN-1317.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-03-28 11:09:19 +08:00
Fei Wang
ceed216a39 [CELEBORN-1317][FOLLOWUP] Retry to setup mini cluster if the cause is BindException
### What changes were proposed in this pull request?
To fix the UT for http server port already in use issue.

For Jetty HttpServer, if failed to bind port, the exception is IOException and the cause is BindException, we should retry for that.

Before:
```
    case e: BindException => // retry to setup mini cluster
```

Now:
```
    case e: IOException
         if e.isInstanceOf[BindException] || Option(e.getCause).exists(
           _.isInstanceOf[BindException]) =>  // retry to setup mini cluster
```

### Why are the changes needed?

To fix the UT for http server port already in use issue.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Will trigger GA for 3 three times.

Closes #2424 from turboFei/set_connector_stop_timeout.

Authored-by: Fei Wang <fwang12@ebay.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-03-28 10:28:47 +08:00
Fei Wang
adbc77cd4f [CELEBORN-1317] Refine celeborn http server and support swagger ui
### What changes were proposed in this pull request?

Before, there is no http request spec likes query param, http method and response mediaType.
And for each api, a HttpEndpoint class is needed.

In this PR, we refine the code for http service and provide swagger ui.

Note that: This pr does not change the orignal api request and response behavior, including metrics APIs.

TODO:
1. define DTO
2. http request authentication

<img width="1900" alt="image" src="https://github.com/apache/incubator-celeborn/assets/6757692/7f8c2363-170d-4bdf-b2c9-74260e31d3e5">

<img width="1138" alt="image" src="https://github.com/apache/incubator-celeborn/assets/6757692/3ae6ec8e-00a8-475b-bb37-0329536185f6">

### Why are the changes needed?

To close CELEBORN-1317

### Does this PR introduce _any_ user-facing change?

The api is align with before.

### How was this patch tested?
UT.

Closes #2371 from turboFei/jetty.

Authored-by: Fei Wang <fwang12@ebay.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-03-27 23:18:18 +08:00
Angerszhuuuu
c71d1068cf [CELEBORN-1297] Change DB script column from user to name
### What changes were proposed in this pull request?
Change DB script column from user to name

### Why are the changes needed?
Change DB script column from user to name

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #2340 from AngersZhuuuu/CELEBORN-1297.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
2024-02-28 10:47:17 +08:00
SteNicholas
2dd0a1df4a [CELEBORN-1296] Introduce celeborn.dynamicConfig.store.fs.path config to configure the path of dynamic config file for fs store backend
### What changes were proposed in this pull request?

Introduce `celeborn.dynamicConfig.store.fs.path` config to configure the path of dynamic config file for fs store backend.

### Why are the changes needed?

`FsConfigServiceImpl` uses `celeborn.quota.configuration.path` to configure the path of dynamic config file for fs store backend at present. The path of dynamic config file should be introduced with `celeborn.dynamicConfig.store.fs.path` instead of quota configuration path.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #2337 from SteNicholas/CELEBORN-1296.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2024-02-27 19:17:13 +08:00
jiaoqingbo
96344561e8 [CELEBRON-1285] Add check tenantConfig.getConfigs().isEmpty() in getTenantUserConfigFromCache
### What changes were proposed in this pull request?

 Add check tenantConfig.getConfigs().isEmpty() in getTenantUserConfigFromCache

### Why are the changes needed?

 Add check tenantConfig.getConfigs().isEmpty() in getTenantUserConfigFromCache

### Does this PR introduce _any_ user-facing change?

NO

### How was this patch tested?

PASS GA

Closes #2324 from jiaoqingbo/1285.

Authored-by: jiaoqingbo <1178404354@qq.com>
Signed-off-by: waitinfuture <zky.zhoukeyong@alibaba-inc.com>
2024-02-25 22:20:41 +08:00
SteNicholas
a1c9d01739 [CELEBORN-1056] Introduce Rest API of listing dynamic configuration
### What changes were proposed in this pull request?

Introduce Rest API of listing dynamic configuration `/listDynamicConfigs` to list the dynamic configs. The result of `/listDynamicConfigs` is as follows:

```
=========================== Dynamic Configuration ============================
celeborn.master.ha.ratis.raft.server.snapshot.auto.trigger.threshold          100000
celeborn.worker.flusher.buffer.size                                           64k
=========================== SYSTEM ============================
celeborn.master.ha.ratis.raft.server.snapshot.auto.trigger.threshold          200000
celeborn.worker.flusher.buffer.size                                           128k
=========================== TENANT ============================
=========================== Tenant: tenantId1 ============================
celeborn.master.ha.ratis.raft.server.snapshot.auto.trigger.threshold          300000
celeborn.worker.flusher.buffer.size                                           256k
=========================== TENANT_USER ============================
=========================== Tenant: tenantId1, Name: user1 ============================
celeborn.master.ha.ratis.raft.server.snapshot.auto.trigger.threshold          400000
celeborn.worker.flusher.buffer.size                                           512k
```

### Why are the changes needed?

Celeborn supports dynamic configuration with `ConfigService` at present. It's recommend to introduce Rest API of dynamic configuration management.

### Does this PR introduce _any_ user-facing change?

- Introduce Rest API of listing dynamic configuration: `/listDynamicConfigs?level=[system|tenant|tenant_user]&tenant=tenantId1&name=user1`.

### How was this patch tested?

- `HttpUtilsSuite#CELEBORN-1056: Introduce Rest API of listing dynamic configuration`

Closes #2311 from SteNicholas/CELEBORN-1056.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2024-02-23 10:30:11 +08:00
Angerszhuuuu
0c952ca915 [CELEBORN-1239][FEATURE] Celeborn QuotaManager support use ConfigService and support default quota setting
### What changes were proposed in this pull request?
This pr does 2 things:
1. Remove unnecessary conf QUOTA_MANAGER since we implement it with ConfigService and ConfigService already have a conf to indicate the implement method.
2. Move the quota manager to Master side since only master use this
3. Support quota manager use FsConfigService and support default system level

### Why are the changes needed?
1. Many times, for users who do not have a quota configured, we hope to have a default quota that applies to them.
2. Quota manager should support refresh
3. QuotaManager should support integrate with ConfigService

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
added ut

Closes #2298 from AngersZhuuuu/CELEBORN-1239.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
2024-02-22 18:00:19 +08:00
SteNicholas
d3d9614588 [CELEBORN-1264][FOLLOWUP] Improve tenant user level dynamic configuration interface of ConfigService
### What changes were proposed in this pull request?

Improve tenant user level dynamic configuration interface of `ConfigService` including:

- Renames `getRawTenantUserConfig` to `getRawTenantUserConfigFromCache`.
- Renames `getTenantUserConfig` to `getTenantUserConfigFromCache`.

### Why are the changes needed?

The naming of tenant user level dynamic configuration interface of `ConfigService` needs to be consistent with other interfaces which names with `FromCache`.

### Does this PR introduce _any_ user-facing change?

- Renames `getRawTenantUserConfig` to `getRawTenantUserConfigFromCache`.
- Renames `getTenantUserConfig` to `getTenantUserConfigFromCache`.

### How was this patch tested?

- `ConfigServiceSuiteJ`

Closes #2307 from SteNicholas/CELEBORN-1264.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2024-02-20 13:11:41 +08:00
SteNicholas
64b4338291 [CELEBORN-1052][FOLLOWUP] Improve the implementation of ConfigService
### What changes were proposed in this pull request?

Improve the implementation of `ConfigService` including:

- Removes `celeborn.dynamicConfig.enabled`.
- Changes `celeborn.dynamicConfig.store.backend` to optional.
- Renames `refreshAllCache` to `refreshCache` in `ConfigService`.
- Checks whether the dynamic config file exists and is file in `FsConfigServiceImpl`.

### Why are the changes needed?

Whether to enable dynamic config could check via whether `celeborn.dynamicConfig.store.backend` is provided, instead of `celeborn.dynamicConfig.enabled`. The `refreshAllCache` interface could rename to `refreshCache` and throw Exception simply. Meanwhile, `FsConfigServiceImpl` should check whether the dynamic config file exists and is file.

### Does this PR introduce _any_ user-facing change?

- Renames `refreshAllCache` to `refreshCache` in `ConfigService`.

### How was this patch tested?

- `ConfigServiceSuiteJ`

Closes #2304 from SteNicholas/CELEBORN-1052.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2024-02-19 22:42:10 +08:00
Angerszhuuuu
e4f7ea8e01 [CELEBORN-1264] ConfigService supports TENANT_USER config level
### What changes were proposed in this pull request?
 ConfigService support user level config

### Why are the changes needed?
Support more case of config, later can integrate with quota manager

### Does this PR introduce _any_ user-facing change?
With this pr, user's setting form config service will have three level

- User
- Tenant
- System

User identifier is construct by username and tenantId,
If there is no specify setting for username, will fallback to tenant level setting, if tenant level setting also not set, fallback to system setting

### How was this patch tested?
Added UT

Closes #2285 from AngersZhuuuu/CELEBORN-1264.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
2024-02-18 16:22:22 +08:00
Shuang
d89dcf0e06 [CELEBORN-1054] Support db based dynamic config service
### What changes were proposed in this pull request?

Support database based store backend implementation for dynamic configuration management

### Why are the changes needed?

Currently celeborn provides `FsConfigServiceImpl` implementation for dynamic config service which is based on file system, We cloud Support database based store backend implementation.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

- `ConfigServiceSuiteJ#testDbConfig`

Closes #2273 from RexXiong/CELEBORN-1054.

Authored-by: Shuang <lvshuang.xjs@alibaba-inc.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-02-05 13:23:25 +08:00
SteNicholas
aad3929018
[CELEBORN-1259] Improve the default gracePeriod of ThreadUtils#shutdown
### What changes were proposed in this pull request?

Introduce `ThreadUtils#shutdown(executor)` method to improve the default gracePeriod of `ThreadUtils#shutdown`.

### Why are the changes needed?

The default value of `gracePeriod` for `ThreadUtils#shutdown` is 30 seconds at present. Meanwhile, the `gracePeriod` of most invoker for `ThreadUtils#shutdown` is 800 milliseconds. Therefore, the default `gracePeriod` of `ThreadUtils#shutdown` could be improved as 800 milliseconds.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #2276 from SteNicholas/CELEBORN-1259.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-02-01 18:13:36 +08:00
Shuang
e71d912d50 [CELEBORN-1245] Support Celeborn Master(Leader) to manage workers
### What changes were proposed in this pull request?
1. Support Celeborn Master(Leader) to manage workers by sending event when heartbeat
2. Add Worker Status to Worker then we can know the status of the workers(such as during decommission...)
3. Add Http interface for master to handleWorkerEvent/getWorkerEvent

### Why are the changes needed?
Currently, we only support managing the status of workers on the worker side. This pr supports the master to manage the status of all workers. By sending events such as (Decommission/Graceful/Exit) when heartbeat, workers can be asynchronously execute the command from master. MeanWhile we can't know what the worker status during worker decommission so this pr add worker status to tell the exactly status of the worker.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass GA

Closes #2255 from RexXiong/CELEBORN-1245.

Authored-by: Shuang <lvshuang.xjs@alibaba-inc.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-02-01 09:44:59 +08:00
Angerszhuuuu
67e6cbfb51
[CELEBORN-1242] Unify celeborn thread name format
### What changes were proposed in this pull request?

Unify celeborn thread name format with the following pattern:

- client: `celeborn-client-[component]-[function]er`
- service: `[master|worker]-[component]-[function]er`
- other: `celeborn-[component]-[function]er`

### Why are the changes needed?

It's recommended to unify celeborn thread name format especially client side for application.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #2248 from AngersZhuuuu/CELEBORN-1242.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-01-23 16:56:40 +08:00
Fei Wang
d46b6623b3
[CELEBORN-1228] Format the timestamp when recording worker failure
### What changes were proposed in this pull request?

Format the timestamp when recoding worker failure inforamtion.

### Why are the changes needed?

Now the long type timestamp is difficult to view and confuse without reading source code.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #2230 from turboFei/date_format.

Authored-by: Fei Wang <fwang12@ebay.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-01-17 14:04:30 +08:00
SteNicholas
e7e39a51be
[CELEBORN-1189] Introduce RunningApplicationCount metric and /applications API to record running applications of worker
### What changes were proposed in this pull request?

Introduce `RunningApplicationCount` metric and `/applications` API to record running applications for Celeborn worker.

### Why are the changes needed?

`RunningApplicationCount` metrics only monitors the count of running applications in the cluster for master. Meanwhile, `/listTopDiskUsedApps` API lists the top disk usage application ids for master and worker. Therefore `RunningApplicationCount` metric and `/applications` API could be introduced to record running applications of worker.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Internal tests.

Closes #2172 from SteNicholas/CELEBORN-1189.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-12-27 09:51:16 +08:00
sychen
7f653ce7d6 [CELEBORN-1190] Apply error prone patch and suppress some problems
### What changes were proposed in this pull request?
1.  Fix MissingOverride, DefaultCharset, UnnecessaryParentheses Rule
2. Exclude generated sources, FutureReturnValueIgnored, TypeParameterUnusedInFormals, UnusedVariable

### Why are the changes needed?
```
./build/make-distribution.sh --release
```
We get a lot of WARNINGs.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
GA

Closes #2177 from cxzl25/error_prone_patch.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: Fu Chen <cfmcgrady@gmail.com>
2023-12-20 20:54:18 +08:00
qinrui
04a1e90207 [CELEBORN-1122] Metrics supports json format
### What changes were proposed in this pull request?
If the user does not use prometheus to collect monitoring metrics, but rather some other ones. Using metrics in JSON format would be more user-friendly.The PR supports JSON format for metrics.

### Why are the changes needed?
Ditto.

### Does this PR introduce _any_ user-facing change?
Metrics supports JSON format

### How was this patch tested?
Cluster test.

Closes #2089 from suizhe007/CELEBORN-1122.

Authored-by: qinrui <qr7972@gmail.com>
Signed-off-by: Shuang <lvshuang.tb@gmail.com>
2023-12-06 09:24:28 +08:00
SteNicholas
406cef8392 [CELEBORN-1052][FOLLOWUP] Introduce dynamic ConfigService at SystemLevel and TenantLevel
### What changes were proposed in this pull request?

Follow up #2100. Mainly changes the package from scala to java of the codes in #2100. Meanwhile, `FsConfigServiceImpl#refresh` should directly return instead of refreshing configs.

### Why are the changes needed?

This PR follow up dynamic `ConfigService` at `SystemLevel` and `TenantLevel`, Dynamic configuration is a type of configuration that can be changed at runtime as needed in #2100. The implementation of `ConfigService` is based on Java codes, which are put into Scala package and cause that the spotless plugin does not format well. After the changes of the pull request, there are much code style changes generated from the package moving behavior.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

`ConfigServiceSuiteJ`.

Closes #2125 from SteNicholas/CELEBORN-1052.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Shuang <lvshuang.tb@gmail.com>
2023-12-04 19:03:59 +08:00
Shuang
ad57c8b91e
[CELEBORN-1052] Introduce dynamic ConfigService at SystemLevel and TenantLevel
### What changes were proposed in this pull request?
This PR introduce dynamic ConfigService at SystemLevel and TenantLevel, Dynamic configuration is a type of configuration that can be changed at runtime as needed. It can be used at system level/tenant level. When applying dynamic configuration, the priority order is as follows: tenant level overrides system level, which in turn overrides static configuration(CelebornConf). This means that if a configuration is defined at the tenant level, it will be used instead of the system level or static configuration(CelebornConf). If the tenant-level configuration is missing,
the system-level configuration will be used. If the system-level configuration is also missing, CelebornConf
will be used as the default value.

There are several other tasks related to this feature that will be implemented in the future.

- [ ]  [Add isDynamic property for CelebornConf](https://issues.apache.org/jira/browse/CELEBORN-1051)
- [ ]  [Support DB based Configserver](https://issues.apache.org/jira/browse/CELEBORN-1054)
- [ ]  [Add restAPI for configuration management](https://issues.apache.org/jira/browse/CELEBORN-1056)

### Why are the changes needed?
The current configuration of the server (CelebornConf) is static. When the configuration is changed, the service needs to be restarted. This PR introduces a dynamic configuration solution. The server side can use dynamic configuration as needed. At the same time, it is considered that the tenant level will be supported in the future (such as supporting tenant level dynamic quota control) configuration, so this time we will also consider supporting dynamic tenant-level configuration, and this PR will provide a default implementation based on the file system.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
UT

Closes #2100 from RexXiong/CELEBORN-1052.

Authored-by: Shuang <lvshuang.tb@gmail.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-11-27 12:17:05 +08:00
SteNicholas
52eddc59f3
[CELEBORN-448] Support exclude worker manually
### What changes were proposed in this pull request?

Support exclude worker manually given worker id. This worker is added into excluded workers manually.

### Why are the changes needed?

Celeborn supports to shuffle client-side fetch and push exclude workers on failure at present. It's necessary to exclude worker manually for maintaining the Celeborn cluster.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

- `HttpUtilsSuite`
- `DefaultMetaSystemSuiteJ#testHandleWorkerExclude`
- `RatisMasterStatusSystemSuiteJ#testHandleWorkerExclude`
- `MasterStateMachineSuiteJ#testObjSerde`

Closes #1997 from SteNicholas/CELEBORN-448.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-11-07 16:25:24 +08:00
fwang12
655d5762ca [CELEBORN-1076] Using text/plain content type for prometheus metrics
### What changes were proposed in this pull request?
Refer https://github.com/prometheus/docs/blob/main/content/docs/instrumenting/exposition_formats.md#basic-info

The http content type is better be `text/plain`.

### Why are the changes needed?

As describe in https://github.com/prometheus/docs/blob/main/content/docs/instrumenting/exposition_formats.md#basic-info.

Advantages
```
Human-readable
Easy to assemble, especially for minimalistic cases (no nesting required)
Readable line by line (with the exception of type hints and docstrings)
```
### Does this PR introduce _any_ user-facing change?
The http content type change.

### How was this patch tested?
<img width="910" alt="image" src="https://github.com/apache/incubator-celeborn/assets/6757692/6cc9b071-3149-48fb-9aab-66506a72be3f">

Closes #2014 from turboFei/metrics_plain.

Authored-by: fwang12 <fwang12@ebay.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-10-23 17:56:01 +08:00
xleoken
f6dcfaa37f [CELEBORN-1044] Enhance the check of parameter array length
### What changes were proposed in this pull request?

We can't get any response from /conf when the master started with default celeborn conf.

![e8c649b733e0c8495bb6555dfb7c5e58_13063594_image-2023-10-17-11-37-15-261](https://github.com/apache/incubator-celeborn/assets/95013770/a6de4496-f53f-46ad-94b6-e02adaa6fbfc)

**Internal Exception**
```
empty.max
java.lang.UnsupportedOperationException: empty.max
	at scala.collection.TraversableOnce.max(TraversableOnce.scala:275)
	at scala.collection.TraversableOnce.max$(TraversableOnce.scala:273)
	at scala.collection.AbstractTraversable.max(Traversable.scala:108)
	at org.apache.celeborn.server.common.HttpService.getConf(HttpService.scala:36)
	at org.apache.celeborn.service.deploy.master.MasterSuite.$anonfun$new$1(MasterSuite.scala:46)
```

### Why are the changes needed?

Bug.

### How was this patch tested?

Local

Closes #1995 from xleoken/patch5.

Authored-by: xleoken <leo65535@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-10-17 20:52:36 +08:00
SteNicholas
f2d6cc7525 [CELEBORN-829] Improve response message of invalid HTTP request
### What changes were proposed in this pull request?

Improve response message of invalid HTTP request, which lists available API providers like as below:

- master

```
Invalid uri of the master. Available API providers include:
/applications        List all running application's ids of the cluster.
/conf                List the conf setting of the master.
/excludedWorkers     List all excluded workers of the master.
/help                List the available API providers of the master.
/hostnames           List all running application's LifecycleManager's hostnames of the cluster.
/listTopDiskUsedApps List the top disk usage application ids. It will return the top disk usage application ids for the cluster.
/lostWorkers         List all lost workers of the master.
/masterGroupInfo     List master group information of the service. It will list all master's LEADER, FOLLOWER information.
/shuffles            List all running shuffle keys of the service. It will return all running shuffle's key of the cluster.
/shutdownWorkers     List all shutdown workers of the master.
/threadDump          List the current thread dump of the master.
/workerInfo          List worker information of the service. It will list all registered workers 's information.
```

- worker

```
Invalid uri of the worker. Available API providers include:
/conf                      List the conf setting of the worker.
/exit                      Trigger this worker to exit. Legal types are 'DECOMMISSION‘, 'GRACEFUL' and 'IMMEDIATELY'
/help                      List the available API providers of the worker.
/isRegistered              Show if the worker is registered to the master success.
/isShutdown                Show if the worker is during the process of shutdown.
/listPartitionLocationInfo List all the living PartitionLocation information in that worker.
/listTopDiskUsedApps       List the top disk usage application ids. It only return application ids running in that worker.
/shuffles                  List all the running shuffle keys of the worker. It only return keys of shuffles running in that worker.
/threadDump                List the current thread dump of the worker.
/unavailablePeers          List the unavailable peers of the worker, this always means the worker connect to the peer failed.
/workerInfo                List the worker information of the worker.
```

### Why are the changes needed?

Response message of invalid HTTP request could not help users with correct HTTP path.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

`HttpUtilsSuite#CELEBORN-829: Improve response message of invalid HTTP request`

Closes #1986 from SteNicholas/CELEBORN-829.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-10-16 16:37:51 +08:00
sychen
dd65e74f99 [CELEBORN-983] Rename PrometheusMetric configuration
### What changes were proposed in this pull request?
Replace
```properties
celeborn.metrics.master.prometheus.host
celeborn.metrics.master.prometheus.port
celeborn.metrics.worker.prometheus.host
celeborn.metrics.worker.prometheus.port
```
With
```properties
celeborn.master.http.host
celeborn.master.http.port
celeborn.worker.http.host
celeborn.worker.http.port
```

### Why are the changes needed?
The `celeborn.master.metrics.prometheus.port` and `celeborn.metrics.worker.prometheus.port` bind port not only serve prometheus metrics, but also provide some useful API services.

https://celeborn.apache.org/docs/latest/monitoring/#rest-api

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1919 from cxzl25/CELEBORN-983.

Lead-authored-by: sychen <sychen@ctrip.com>
Co-authored-by: Keyong Zhou <zhouky@apache.org>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-10-13 13:28:58 +08:00
SteNicholas
438cdf6747 [CELEBORN-973] Improve HttpRequestHandler handle HTTP request with base, master and worker
### What changes were proposed in this pull request?

The code that `HttpRequestHandler` handles HTTP request could be improved with handling HTTP request with base, master and worker.

### Why are the changes needed?

Improves `HttpRequestHandler` handle HTTP request with base, master and worker.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Internal tests.

Closes #1977 from SteNicholas/http-request-handler.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-10-12 15:08:15 +08:00
sychen
5310bcaf6b
[CELEBORN-313] Add rest endpoint to show master group info
### What changes were proposed in this pull request?

<img width="1347" alt="image" src="https://github.com/apache/incubator-celeborn/assets/3898450/43d10bff-6878-4591-9461-889494d797f9">

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?

```bash
./bin/celeborn-ratis sh -Draft.rpc.type=NETTY  group info   -peers clb-1:9872,clb-2:9873,clb-3:9874
```

```
group id: c5196f6d-2c34-3ed3-8b8a-47bede733167
leader info: 1(clb-1:9872)

[server {
  id: "1"
  address: "clb-1:9872"
  clientAddress: "clb-1:9097"
  startupRole: FOLLOWER
}
commitIndex: 316
, server {
  id: "2"
  address: "clb-2:9873"
  clientAddress: "clb-2:9098"
  startupRole: FOLLOWER
}
commitIndex: 316
, server {
  id: "3"
  address: "clb-3:9874"
  clientAddress: "clb-3:9099"
  startupRole: FOLLOWER
}
commitIndex: 316
]
```

```bash
curl http://clb-3:9983/masterGroupInfo
```

```
====================== Master Group INFO ==============================
group id: c5196f6d-2c34-3ed3-8b8a-47bede733167
leader info: 1(clb-1:9872)

[server {
  id: "3"
  address: "clb-3:9874"
  clientAddress: "clb-3:9099"
  startupRole: FOLLOWER
}
commitIndex: 316
, server {
  id: "1"
  address: "clb-1:9872"
  clientAddress: "clb-1:9097"
  startupRole: FOLLOWER
}
commitIndex: 316
, server {
  id: "2"
  address: "clb-2:9873"
  clientAddress: "clb-2:9098"
  startupRole: FOLLOWER
}
commitIndex: 316
]
```

Closes #1946 from cxzl25/CELEBORN-313.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-09-28 20:08:31 +08:00
sychen
7e944c1a50
[CELEBORN-1014] Output log with bound address and port
### What changes were proposed in this pull request?

### Why are the changes needed?
Make it easy for administrators to find the address of the http service bindings.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

```
23/09/28 17:10:50,465 INFO [main] HttpServer: master: HttpServer started on port 9983.
```

PR
```
23/09/28 17:28:29,797 INFO [main] HttpServer: master: HttpServer started on clb-3 with port 9983.
```

Closes #1947 from cxzl25/CELEBORN-1014.

Lead-authored-by: sychen <sychen@ctrip.com>
Co-authored-by: Cheng Pan <pan3793@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-09-28 19:07:12 +08:00
Angerszhuuuu
17de30009b [CELEBORN-847] Support use RESTful API to trigger worker exit and exitImmediately
### What changes were proposed in this pull request?
As title

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1768 from AngersZhuuuu/CELEBORN-847.

Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Co-authored-by: Keyong Zhou <zhouky@apache.org>
Co-authored-by: Keyong Zhou <waitinfuture@gmail.com>
Co-authored-by: Keyong Zhou <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-08-15 20:04:26 +08:00
Angerszhuuuu
bacfb54447 [CELEBORN-832] Support use RESTful API to trigger worker decommission
### What changes were proposed in this pull request?
As title

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1759 from AngersZhuuuu/CELEBORN-832.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
2023-07-27 15:40:14 +08:00
Angerszhuuuu
2ab88f773a [CELEBORN-819] Worker close should pass close status to support handle graceful shutdown and decommission
### What changes were proposed in this pull request?
Pass exit kind to each component, if the exit kind match:

- GRACEFUL_SHUTDOWN: Behavior as origin code's graceful == true
- Others: will clean the level db file.

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1748 from AngersZhuuuu/CELEBORN-819.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
2023-07-25 14:54:01 +08:00
Angerszhuuuu
76201c92f8 [CELEBORN-820] Merge service shutdown and close method
### What changes were proposed in this pull request?
As title

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1742 from AngersZhuuuu/CELEBORN-820.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
2023-07-22 21:04:29 +08:00
Fu Chen
7c6644b1a7
[CELEBORN-805] Immediate shutdown of server upon completion of unit test to prevent potential resource leakage
### What changes were proposed in this pull request?

As title

### Why are the changes needed?

Recently, while conducting the sbt build test, it came to my attention that certain resources such as ports and threads were not being released promptly.

This pull request introduces a new method, `shutdown(graceful: Boolean)`, to the `Service` trait. When invoked by `MiniClusterFeature.shutdownMiniCluster`, it calls `worker.shutdown(graceful = false)`. This implementation aims to prevent possible memory leaks during CI processes.

Before this PR the unit tests in the `client/common/master/service/worker` modules resulted in leaked ports.

```
$ jps
1138131 Jps
1130743 sbt-launch-1.9.0.jar
$ netstat -lntp | grep 1130743
(Not all processes could be identified, non-owned process info
 will not be shown, you would have to be root to see it all.)
tcp        0      0 127.0.0.1:12345         0.0.0.0:*               LISTEN      1130743/java
tcp        0      0 0.0.0.0:41563           0.0.0.0:*               LISTEN      1130743/java
tcp        0      0 0.0.0.0:42905           0.0.0.0:*               LISTEN      1130743/java
tcp        0      0 0.0.0.0:44419           0.0.0.0:*               LISTEN      1130743/java
tcp        0      0 0.0.0.0:45025           0.0.0.0:*               LISTEN      1130743/java
tcp        0      0 0.0.0.0:44799           0.0.0.0:*               LISTEN      1130743/java
tcp        0      0 0.0.0.0:39053           0.0.0.0:*               LISTEN      1130743/java
tcp        0      0 0.0.0.0:39029           0.0.0.0:*               LISTEN      1130743/java
tcp        0      0 0.0.0.0:39475           0.0.0.0:*               LISTEN      1130743/java
tcp        0      0 0.0.0.0:40153           0.0.0.0:*               LISTEN      1130743/java
tcp        0      0 0.0.0.0:33051           0.0.0.0:*               LISTEN      1130743/java
tcp        0      0 0.0.0.0:33449           0.0.0.0:*               LISTEN      1130743/java
tcp        0      0 0.0.0.0:34073           0.0.0.0:*               LISTEN      1130743/java
tcp        0      0 0.0.0.0:35347           0.0.0.0:*               LISTEN      1130743/java
tcp        0      0 0.0.0.0:35971           0.0.0.0:*               LISTEN      1130743/java
tcp        0      0 0.0.0.0:36799           0.0.0.0:*               LISTEN      1130743/java
tcp        0      0 192.168.1.151:40775     0.0.0.0:*               LISTEN      1130743/java
tcp        0      0 192.168.1.151:44457     0.0.0.0:*               LISTEN      1130743/java
```

After this PR:

```
$ jps
1114423 Jps
1107544 sbt-launch-1.9.0.jar
$ netstat -lntp | grep 1107544
(Not all processes could be identified, non-owned process info
 will not be shown, you would have to be root to see it all.)
```

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Pass GA

Closes #1727 from cfmcgrady/shutdown.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-07-18 13:12:51 +08:00
Angerszhuuuu
3985a5cbd7 [CELEBORN-666][FOLLOWUP] Unify all blacklist related code and comment
### What changes were proposed in this pull request?
Unify all blacklist related code and comment

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1638 from AngersZhuuuu/CELEBORN-666-FOLLOWUP.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-06-28 16:28:03 +08:00
Cheng Pan
3c7d179e05
[CELEBORN-636] Replace SimpleDateFormat with FastDateFormat
### What changes were proposed in this pull request?

`SimpleDateFormat` is not thread-safe, replace it with a thread-safe `FastDateFormat`

### Why are the changes needed?

`FastDateFormat` is a fast and thread-safe version of `java.text.SimpleDateFormat`.

### Does this PR introduce _any_ user-facing change?

Yes, it's a bug fix.

### How was this patch tested?

Manually review.

Closes #1545 from pan3793/CELEBORN-636.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Ethan Feng <ethanfeng@apache.org>
2023-06-06 12:59:32 +08:00
Angerszhuuuu
f574a4dafa
[CELEBORN-512][IMPROVEMENT] Sort timestamp and show in date format (#1416) 2023-04-11 19:56:48 +08:00
Angerszhuuuu
b4f8ab19bd
[CELEBORN-484][PERF] Master trigger LifecycleManager commit shutdown worker's partition location. (#1395)
* [CELEBORN-484][PERF] Master trigger LifecycleManager commit shutdown worker's  partition location.
2023-04-02 09:18:12 +08:00
Fei Wang
c609c0ebaa
[MINOR] Fix typo and remove unused code (#1381)
* fix typo

* remove unused
2023-03-25 23:20:33 +08:00
Keyong Zhou
3d6fba553b
[CELEBORN-454] Code refine for worker (#1371) 2023-03-22 10:39:14 +08:00
Angerszhuuuu
e61130d397
[CELEBORN-423][FOLLOWUP] Format http request (#1353)
* [CELEBORN-423][FOLLOWUP] Format http request
2023-03-15 16:30:23 +08:00
Angerszhuuuu
889e8ca644
[CELEBORN-423][FOLLOWUP] Format http request (#1351) 2023-03-15 14:40:05 +08:00
Angerszhuuuu
1f56a5e5d1
[CELEBORN-423] Format http request result (#1349) 2023-03-15 10:32:01 +08:00
Angerszhuuuu
3907d70212
[CELEBORN-421] Add shutdown and registered to http request (#1346)
* [CELEBORN-421] Add shutdown and registered to http request
2023-03-14 18:23:21 +08:00
Angerszhuuuu
7d7279a9bc
[CELEBORN-420] Add unavailablePeers to http request (#1345)
* [CELEBORN-420] Add unavailablePeers to http request
2023-03-14 17:23:45 +08:00
Angerszhuuuu
364acbc66a
[CELEBORN-407] Add conf setting to http request (#1337)
* [CELEBORN-407] Add conf setting to http request
2023-03-14 14:47:56 +08:00
Angerszhuuuu
3600ccc4e3
[CELEBORN-409] Add PartitionLocationInfo to worker's http request (#1335) 2023-03-13 17:02:28 +08:00
Angerszhuuuu
6f1ab70403
[CELEBORN-406] Add blacklist to http request to indicate blacklisted worker (#1334) 2023-03-13 16:44:46 +08:00
Angerszhuuuu
144a8cdb3f
[CELEBORN-408] Add lost worker infos to http request (#1333) 2023-03-13 15:27:41 +08:00
Ethan Feng
ee243f286d
[CELEBORN-4] Add metrics about top disk used apps. (#985) 2022-11-22 20:06:36 +08:00
AngersZhuuuu
a773c8e6db
[ISSUE-820][Refactor] Rename RssConf to CelebornConf (#826) 2022-10-20 20:13:13 +08:00
AngersZhuuuu
8344479df1
[ISSUE-818][REFACTOR] Move existing RssConf.xxx conf method to RssConf class (#822)
* [ISSUE-818][REFACTOR] Move existing RssConf.xxx conf method to RssConf class


Co-authored-by: Ethan Feng <ethan.aquarius.fmx@gmail.com>
2022-10-20 18:10:59 +08:00
Cheng Pan
96e969f46e
[BUILD] Extract project.version to Maven Property (#772) 2022-10-16 19:01:40 +08:00
Cheng Pan
ab16b4f101
[INFRA] Rename modules w/ celeborn prefix (#723) 2022-10-08 08:05:57 +08:00