Commit Graph

34 Commits

Author SHA1 Message Date
Shuang
ad57c8b91e
[CELEBORN-1052] Introduce dynamic ConfigService at SystemLevel and TenantLevel
### What changes were proposed in this pull request?
This PR introduce dynamic ConfigService at SystemLevel and TenantLevel, Dynamic configuration is a type of configuration that can be changed at runtime as needed. It can be used at system level/tenant level. When applying dynamic configuration, the priority order is as follows: tenant level overrides system level, which in turn overrides static configuration(CelebornConf). This means that if a configuration is defined at the tenant level, it will be used instead of the system level or static configuration(CelebornConf). If the tenant-level configuration is missing,
the system-level configuration will be used. If the system-level configuration is also missing, CelebornConf
will be used as the default value.

There are several other tasks related to this feature that will be implemented in the future.

- [ ]  [Add isDynamic property for CelebornConf](https://issues.apache.org/jira/browse/CELEBORN-1051)
- [ ]  [Support DB based Configserver](https://issues.apache.org/jira/browse/CELEBORN-1054)
- [ ]  [Add restAPI for configuration management](https://issues.apache.org/jira/browse/CELEBORN-1056)

### Why are the changes needed?
The current configuration of the server (CelebornConf) is static. When the configuration is changed, the service needs to be restarted. This PR introduces a dynamic configuration solution. The server side can use dynamic configuration as needed. At the same time, it is considered that the tenant level will be supported in the future (such as supporting tenant level dynamic quota control) configuration, so this time we will also consider supporting dynamic tenant-level configuration, and this PR will provide a default implementation based on the file system.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
UT

Closes #2100 from RexXiong/CELEBORN-1052.

Authored-by: Shuang <lvshuang.tb@gmail.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-11-27 12:17:05 +08:00
SteNicholas
52eddc59f3
[CELEBORN-448] Support exclude worker manually
### What changes were proposed in this pull request?

Support exclude worker manually given worker id. This worker is added into excluded workers manually.

### Why are the changes needed?

Celeborn supports to shuffle client-side fetch and push exclude workers on failure at present. It's necessary to exclude worker manually for maintaining the Celeborn cluster.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

- `HttpUtilsSuite`
- `DefaultMetaSystemSuiteJ#testHandleWorkerExclude`
- `RatisMasterStatusSystemSuiteJ#testHandleWorkerExclude`
- `MasterStateMachineSuiteJ#testObjSerde`

Closes #1997 from SteNicholas/CELEBORN-448.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-11-07 16:25:24 +08:00
fwang12
655d5762ca [CELEBORN-1076] Using text/plain content type for prometheus metrics
### What changes were proposed in this pull request?
Refer https://github.com/prometheus/docs/blob/main/content/docs/instrumenting/exposition_formats.md#basic-info

The http content type is better be `text/plain`.

### Why are the changes needed?

As describe in https://github.com/prometheus/docs/blob/main/content/docs/instrumenting/exposition_formats.md#basic-info.

Advantages
```
Human-readable
Easy to assemble, especially for minimalistic cases (no nesting required)
Readable line by line (with the exception of type hints and docstrings)
```
### Does this PR introduce _any_ user-facing change?
The http content type change.

### How was this patch tested?
<img width="910" alt="image" src="https://github.com/apache/incubator-celeborn/assets/6757692/6cc9b071-3149-48fb-9aab-66506a72be3f">

Closes #2014 from turboFei/metrics_plain.

Authored-by: fwang12 <fwang12@ebay.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-10-23 17:56:01 +08:00
xleoken
f6dcfaa37f [CELEBORN-1044] Enhance the check of parameter array length
### What changes were proposed in this pull request?

We can't get any response from /conf when the master started with default celeborn conf.

![e8c649b733e0c8495bb6555dfb7c5e58_13063594_image-2023-10-17-11-37-15-261](https://github.com/apache/incubator-celeborn/assets/95013770/a6de4496-f53f-46ad-94b6-e02adaa6fbfc)

**Internal Exception**
```
empty.max
java.lang.UnsupportedOperationException: empty.max
	at scala.collection.TraversableOnce.max(TraversableOnce.scala:275)
	at scala.collection.TraversableOnce.max$(TraversableOnce.scala:273)
	at scala.collection.AbstractTraversable.max(Traversable.scala:108)
	at org.apache.celeborn.server.common.HttpService.getConf(HttpService.scala:36)
	at org.apache.celeborn.service.deploy.master.MasterSuite.$anonfun$new$1(MasterSuite.scala:46)
```

### Why are the changes needed?

Bug.

### How was this patch tested?

Local

Closes #1995 from xleoken/patch5.

Authored-by: xleoken <leo65535@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-10-17 20:52:36 +08:00
SteNicholas
f2d6cc7525 [CELEBORN-829] Improve response message of invalid HTTP request
### What changes were proposed in this pull request?

Improve response message of invalid HTTP request, which lists available API providers like as below:

- master

```
Invalid uri of the master. Available API providers include:
/applications        List all running application's ids of the cluster.
/conf                List the conf setting of the master.
/excludedWorkers     List all excluded workers of the master.
/help                List the available API providers of the master.
/hostnames           List all running application's LifecycleManager's hostnames of the cluster.
/listTopDiskUsedApps List the top disk usage application ids. It will return the top disk usage application ids for the cluster.
/lostWorkers         List all lost workers of the master.
/masterGroupInfo     List master group information of the service. It will list all master's LEADER, FOLLOWER information.
/shuffles            List all running shuffle keys of the service. It will return all running shuffle's key of the cluster.
/shutdownWorkers     List all shutdown workers of the master.
/threadDump          List the current thread dump of the master.
/workerInfo          List worker information of the service. It will list all registered workers 's information.
```

- worker

```
Invalid uri of the worker. Available API providers include:
/conf                      List the conf setting of the worker.
/exit                      Trigger this worker to exit. Legal types are 'DECOMMISSION‘, 'GRACEFUL' and 'IMMEDIATELY'
/help                      List the available API providers of the worker.
/isRegistered              Show if the worker is registered to the master success.
/isShutdown                Show if the worker is during the process of shutdown.
/listPartitionLocationInfo List all the living PartitionLocation information in that worker.
/listTopDiskUsedApps       List the top disk usage application ids. It only return application ids running in that worker.
/shuffles                  List all the running shuffle keys of the worker. It only return keys of shuffles running in that worker.
/threadDump                List the current thread dump of the worker.
/unavailablePeers          List the unavailable peers of the worker, this always means the worker connect to the peer failed.
/workerInfo                List the worker information of the worker.
```

### Why are the changes needed?

Response message of invalid HTTP request could not help users with correct HTTP path.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

`HttpUtilsSuite#CELEBORN-829: Improve response message of invalid HTTP request`

Closes #1986 from SteNicholas/CELEBORN-829.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-10-16 16:37:51 +08:00
sychen
dd65e74f99 [CELEBORN-983] Rename PrometheusMetric configuration
### What changes were proposed in this pull request?
Replace
```properties
celeborn.metrics.master.prometheus.host
celeborn.metrics.master.prometheus.port
celeborn.metrics.worker.prometheus.host
celeborn.metrics.worker.prometheus.port
```
With
```properties
celeborn.master.http.host
celeborn.master.http.port
celeborn.worker.http.host
celeborn.worker.http.port
```

### Why are the changes needed?
The `celeborn.master.metrics.prometheus.port` and `celeborn.metrics.worker.prometheus.port` bind port not only serve prometheus metrics, but also provide some useful API services.

https://celeborn.apache.org/docs/latest/monitoring/#rest-api

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1919 from cxzl25/CELEBORN-983.

Lead-authored-by: sychen <sychen@ctrip.com>
Co-authored-by: Keyong Zhou <zhouky@apache.org>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-10-13 13:28:58 +08:00
SteNicholas
438cdf6747 [CELEBORN-973] Improve HttpRequestHandler handle HTTP request with base, master and worker
### What changes were proposed in this pull request?

The code that `HttpRequestHandler` handles HTTP request could be improved with handling HTTP request with base, master and worker.

### Why are the changes needed?

Improves `HttpRequestHandler` handle HTTP request with base, master and worker.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Internal tests.

Closes #1977 from SteNicholas/http-request-handler.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-10-12 15:08:15 +08:00
sychen
5310bcaf6b
[CELEBORN-313] Add rest endpoint to show master group info
### What changes were proposed in this pull request?

<img width="1347" alt="image" src="https://github.com/apache/incubator-celeborn/assets/3898450/43d10bff-6878-4591-9461-889494d797f9">

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?

```bash
./bin/celeborn-ratis sh -Draft.rpc.type=NETTY  group info   -peers clb-1:9872,clb-2:9873,clb-3:9874
```

```
group id: c5196f6d-2c34-3ed3-8b8a-47bede733167
leader info: 1(clb-1:9872)

[server {
  id: "1"
  address: "clb-1:9872"
  clientAddress: "clb-1:9097"
  startupRole: FOLLOWER
}
commitIndex: 316
, server {
  id: "2"
  address: "clb-2:9873"
  clientAddress: "clb-2:9098"
  startupRole: FOLLOWER
}
commitIndex: 316
, server {
  id: "3"
  address: "clb-3:9874"
  clientAddress: "clb-3:9099"
  startupRole: FOLLOWER
}
commitIndex: 316
]
```

```bash
curl http://clb-3:9983/masterGroupInfo
```

```
====================== Master Group INFO ==============================
group id: c5196f6d-2c34-3ed3-8b8a-47bede733167
leader info: 1(clb-1:9872)

[server {
  id: "3"
  address: "clb-3:9874"
  clientAddress: "clb-3:9099"
  startupRole: FOLLOWER
}
commitIndex: 316
, server {
  id: "1"
  address: "clb-1:9872"
  clientAddress: "clb-1:9097"
  startupRole: FOLLOWER
}
commitIndex: 316
, server {
  id: "2"
  address: "clb-2:9873"
  clientAddress: "clb-2:9098"
  startupRole: FOLLOWER
}
commitIndex: 316
]
```

Closes #1946 from cxzl25/CELEBORN-313.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-09-28 20:08:31 +08:00
sychen
7e944c1a50
[CELEBORN-1014] Output log with bound address and port
### What changes were proposed in this pull request?

### Why are the changes needed?
Make it easy for administrators to find the address of the http service bindings.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

```
23/09/28 17:10:50,465 INFO [main] HttpServer: master: HttpServer started on port 9983.
```

PR
```
23/09/28 17:28:29,797 INFO [main] HttpServer: master: HttpServer started on clb-3 with port 9983.
```

Closes #1947 from cxzl25/CELEBORN-1014.

Lead-authored-by: sychen <sychen@ctrip.com>
Co-authored-by: Cheng Pan <pan3793@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-09-28 19:07:12 +08:00
Angerszhuuuu
17de30009b [CELEBORN-847] Support use RESTful API to trigger worker exit and exitImmediately
### What changes were proposed in this pull request?
As title

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1768 from AngersZhuuuu/CELEBORN-847.

Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Co-authored-by: Keyong Zhou <zhouky@apache.org>
Co-authored-by: Keyong Zhou <waitinfuture@gmail.com>
Co-authored-by: Keyong Zhou <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-08-15 20:04:26 +08:00
Angerszhuuuu
bacfb54447 [CELEBORN-832] Support use RESTful API to trigger worker decommission
### What changes were proposed in this pull request?
As title

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1759 from AngersZhuuuu/CELEBORN-832.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
2023-07-27 15:40:14 +08:00
Angerszhuuuu
2ab88f773a [CELEBORN-819] Worker close should pass close status to support handle graceful shutdown and decommission
### What changes were proposed in this pull request?
Pass exit kind to each component, if the exit kind match:

- GRACEFUL_SHUTDOWN: Behavior as origin code's graceful == true
- Others: will clean the level db file.

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1748 from AngersZhuuuu/CELEBORN-819.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
2023-07-25 14:54:01 +08:00
Angerszhuuuu
76201c92f8 [CELEBORN-820] Merge service shutdown and close method
### What changes were proposed in this pull request?
As title

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1742 from AngersZhuuuu/CELEBORN-820.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
2023-07-22 21:04:29 +08:00
Fu Chen
7c6644b1a7
[CELEBORN-805] Immediate shutdown of server upon completion of unit test to prevent potential resource leakage
### What changes were proposed in this pull request?

As title

### Why are the changes needed?

Recently, while conducting the sbt build test, it came to my attention that certain resources such as ports and threads were not being released promptly.

This pull request introduces a new method, `shutdown(graceful: Boolean)`, to the `Service` trait. When invoked by `MiniClusterFeature.shutdownMiniCluster`, it calls `worker.shutdown(graceful = false)`. This implementation aims to prevent possible memory leaks during CI processes.

Before this PR the unit tests in the `client/common/master/service/worker` modules resulted in leaked ports.

```
$ jps
1138131 Jps
1130743 sbt-launch-1.9.0.jar
$ netstat -lntp | grep 1130743
(Not all processes could be identified, non-owned process info
 will not be shown, you would have to be root to see it all.)
tcp        0      0 127.0.0.1:12345         0.0.0.0:*               LISTEN      1130743/java
tcp        0      0 0.0.0.0:41563           0.0.0.0:*               LISTEN      1130743/java
tcp        0      0 0.0.0.0:42905           0.0.0.0:*               LISTEN      1130743/java
tcp        0      0 0.0.0.0:44419           0.0.0.0:*               LISTEN      1130743/java
tcp        0      0 0.0.0.0:45025           0.0.0.0:*               LISTEN      1130743/java
tcp        0      0 0.0.0.0:44799           0.0.0.0:*               LISTEN      1130743/java
tcp        0      0 0.0.0.0:39053           0.0.0.0:*               LISTEN      1130743/java
tcp        0      0 0.0.0.0:39029           0.0.0.0:*               LISTEN      1130743/java
tcp        0      0 0.0.0.0:39475           0.0.0.0:*               LISTEN      1130743/java
tcp        0      0 0.0.0.0:40153           0.0.0.0:*               LISTEN      1130743/java
tcp        0      0 0.0.0.0:33051           0.0.0.0:*               LISTEN      1130743/java
tcp        0      0 0.0.0.0:33449           0.0.0.0:*               LISTEN      1130743/java
tcp        0      0 0.0.0.0:34073           0.0.0.0:*               LISTEN      1130743/java
tcp        0      0 0.0.0.0:35347           0.0.0.0:*               LISTEN      1130743/java
tcp        0      0 0.0.0.0:35971           0.0.0.0:*               LISTEN      1130743/java
tcp        0      0 0.0.0.0:36799           0.0.0.0:*               LISTEN      1130743/java
tcp        0      0 192.168.1.151:40775     0.0.0.0:*               LISTEN      1130743/java
tcp        0      0 192.168.1.151:44457     0.0.0.0:*               LISTEN      1130743/java
```

After this PR:

```
$ jps
1114423 Jps
1107544 sbt-launch-1.9.0.jar
$ netstat -lntp | grep 1107544
(Not all processes could be identified, non-owned process info
 will not be shown, you would have to be root to see it all.)
```

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Pass GA

Closes #1727 from cfmcgrady/shutdown.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-07-18 13:12:51 +08:00
Angerszhuuuu
3985a5cbd7 [CELEBORN-666][FOLLOWUP] Unify all blacklist related code and comment
### What changes were proposed in this pull request?
Unify all blacklist related code and comment

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1638 from AngersZhuuuu/CELEBORN-666-FOLLOWUP.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-06-28 16:28:03 +08:00
Cheng Pan
3c7d179e05
[CELEBORN-636] Replace SimpleDateFormat with FastDateFormat
### What changes were proposed in this pull request?

`SimpleDateFormat` is not thread-safe, replace it with a thread-safe `FastDateFormat`

### Why are the changes needed?

`FastDateFormat` is a fast and thread-safe version of `java.text.SimpleDateFormat`.

### Does this PR introduce _any_ user-facing change?

Yes, it's a bug fix.

### How was this patch tested?

Manually review.

Closes #1545 from pan3793/CELEBORN-636.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Ethan Feng <ethanfeng@apache.org>
2023-06-06 12:59:32 +08:00
Angerszhuuuu
f574a4dafa
[CELEBORN-512][IMPROVEMENT] Sort timestamp and show in date format (#1416) 2023-04-11 19:56:48 +08:00
Angerszhuuuu
b4f8ab19bd
[CELEBORN-484][PERF] Master trigger LifecycleManager commit shutdown worker's partition location. (#1395)
* [CELEBORN-484][PERF] Master trigger LifecycleManager commit shutdown worker's  partition location.
2023-04-02 09:18:12 +08:00
Fei Wang
c609c0ebaa
[MINOR] Fix typo and remove unused code (#1381)
* fix typo

* remove unused
2023-03-25 23:20:33 +08:00
Keyong Zhou
3d6fba553b
[CELEBORN-454] Code refine for worker (#1371) 2023-03-22 10:39:14 +08:00
Angerszhuuuu
e61130d397
[CELEBORN-423][FOLLOWUP] Format http request (#1353)
* [CELEBORN-423][FOLLOWUP] Format http request
2023-03-15 16:30:23 +08:00
Angerszhuuuu
889e8ca644
[CELEBORN-423][FOLLOWUP] Format http request (#1351) 2023-03-15 14:40:05 +08:00
Angerszhuuuu
1f56a5e5d1
[CELEBORN-423] Format http request result (#1349) 2023-03-15 10:32:01 +08:00
Angerszhuuuu
3907d70212
[CELEBORN-421] Add shutdown and registered to http request (#1346)
* [CELEBORN-421] Add shutdown and registered to http request
2023-03-14 18:23:21 +08:00
Angerszhuuuu
7d7279a9bc
[CELEBORN-420] Add unavailablePeers to http request (#1345)
* [CELEBORN-420] Add unavailablePeers to http request
2023-03-14 17:23:45 +08:00
Angerszhuuuu
364acbc66a
[CELEBORN-407] Add conf setting to http request (#1337)
* [CELEBORN-407] Add conf setting to http request
2023-03-14 14:47:56 +08:00
Angerszhuuuu
3600ccc4e3
[CELEBORN-409] Add PartitionLocationInfo to worker's http request (#1335) 2023-03-13 17:02:28 +08:00
Angerszhuuuu
6f1ab70403
[CELEBORN-406] Add blacklist to http request to indicate blacklisted worker (#1334) 2023-03-13 16:44:46 +08:00
Angerszhuuuu
144a8cdb3f
[CELEBORN-408] Add lost worker infos to http request (#1333) 2023-03-13 15:27:41 +08:00
Ethan Feng
ee243f286d
[CELEBORN-4] Add metrics about top disk used apps. (#985) 2022-11-22 20:06:36 +08:00
AngersZhuuuu
a773c8e6db
[ISSUE-820][Refactor] Rename RssConf to CelebornConf (#826) 2022-10-20 20:13:13 +08:00
AngersZhuuuu
8344479df1
[ISSUE-818][REFACTOR] Move existing RssConf.xxx conf method to RssConf class (#822)
* [ISSUE-818][REFACTOR] Move existing RssConf.xxx conf method to RssConf class


Co-authored-by: Ethan Feng <ethan.aquarius.fmx@gmail.com>
2022-10-20 18:10:59 +08:00
Cheng Pan
96e969f46e
[BUILD] Extract project.version to Maven Property (#772) 2022-10-16 19:01:40 +08:00
Cheng Pan
ab16b4f101
[INFRA] Rename modules w/ celeborn prefix (#723) 2022-10-08 08:05:57 +08:00