### What changes were proposed in this pull request?
This PR introduce dynamic ConfigService at SystemLevel and TenantLevel, Dynamic configuration is a type of configuration that can be changed at runtime as needed. It can be used at system level/tenant level. When applying dynamic configuration, the priority order is as follows: tenant level overrides system level, which in turn overrides static configuration(CelebornConf). This means that if a configuration is defined at the tenant level, it will be used instead of the system level or static configuration(CelebornConf). If the tenant-level configuration is missing,
the system-level configuration will be used. If the system-level configuration is also missing, CelebornConf
will be used as the default value.
There are several other tasks related to this feature that will be implemented in the future.
- [ ] [Add isDynamic property for CelebornConf](https://issues.apache.org/jira/browse/CELEBORN-1051)
- [ ] [Support DB based Configserver](https://issues.apache.org/jira/browse/CELEBORN-1054)
- [ ] [Add restAPI for configuration management](https://issues.apache.org/jira/browse/CELEBORN-1056)
### Why are the changes needed?
The current configuration of the server (CelebornConf) is static. When the configuration is changed, the service needs to be restarted. This PR introduces a dynamic configuration solution. The server side can use dynamic configuration as needed. At the same time, it is considered that the tenant level will be supported in the future (such as supporting tenant level dynamic quota control) configuration, so this time we will also consider supporting dynamic tenant-level configuration, and this PR will provide a default implementation based on the file system.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
UT
Closes#2100 from RexXiong/CELEBORN-1052.
Authored-by: Shuang <lvshuang.tb@gmail.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
1. Make Celeborn read configs from HADOOP_COND_DIR.
2. Remove unnecessary Kerberos configs.
### Why are the changes needed?
To support HDFS with Kerberos.
### Does this PR introduce _any_ user-facing change?
NO.
### How was this patch tested?
GA and cluster.
Closes#2082 from FMX/B1116.
Lead-authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Co-authored-by: Fu Chen <cfmcgrady@gmail.com>
Co-authored-by: Cheng Pan <pan3793@gmail.com>
Co-authored-by: Ethan Feng <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
Adding Kerberos support for HDFS storage type.
The following five parameters need to be configured:
| key | value |
| :--: | :--: |
| celeborn.storage.hdfs.kerberos.enabled | true |
| celeborn.storage.hdfs.kerberos.principal | userREALM |
| celeborn.storage.hdfs.kerberos.keytab | /path/test.keytab |
| celeborn.hadoop.hadoop.security.authorization | kerberos |
| celeborn.hadoop.dfs.namenode.kerberos.principal | hdfs/_HOSTREALM |
### Why are the changes needed?
Connecting to HDFS with Kerberos enabled requires support for keytab login.
### Does this PR introduce _any_ user-facing change?
Add 3 configurations.
celeborn.storage.hdfs.kerberos.enabled
celeborn.storage.hdfs.kerberos.principal
celeborn.storage.hdfs.kerberos.keytab
### How was this patch tested?
Test in Kerberos enabled HDFS cluster.
Closes#2072 from liujiayi771/hdfs-kerberos.
Authored-by: joey.ljy <joey.ljy@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
1.To support `celeborn.storage.activeTypes` in Client.
2.Master will ignore slots for "UNKNOWN_DISK".
### Why are the changes needed?
Enable client application to select storage types to use.
### Does this PR introduce _any_ user-facing change?
Yes.
### How was this patch tested?
GA and cluster.
Closes#2045 from FMX/B1081.
Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: Shuang <lvshuang.tb@gmail.com>
### What changes were proposed in this pull request?
Replace
```properties
celeborn.metrics.master.prometheus.host
celeborn.metrics.master.prometheus.port
celeborn.metrics.worker.prometheus.host
celeborn.metrics.worker.prometheus.port
```
With
```properties
celeborn.master.http.host
celeborn.master.http.port
celeborn.worker.http.host
celeborn.worker.http.port
```
### Why are the changes needed?
The `celeborn.master.metrics.prometheus.port` and `celeborn.metrics.worker.prometheus.port` bind port not only serve prometheus metrics, but also provide some useful API services.
https://celeborn.apache.org/docs/latest/monitoring/#rest-api
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Closes#1919 from cxzl25/CELEBORN-983.
Lead-authored-by: sychen <sychen@ctrip.com>
Co-authored-by: Keyong Zhou <zhouky@apache.org>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
Change the `.version("0.3.2")` to `.version("0.3.1")`
### Why are the changes needed?
0.3.1 is not release yet.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass GA.
Closes#1948 from pan3793/minor-version.
Lead-authored-by: Cheng Pan <chengpan@apache.org>
Co-authored-by: Cheng Pan <pan3793@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
### What changes were proposed in this pull request?
As title
### Why are the changes needed?
If Worker lost or lost after graceful shutdown, Master would retain these lostWorker/shutdownWorkers meta permanently,
These meta would cause some noisy message in lifecycleManager. For these meta better to delete them after a while
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
UT & E2E test
Closes#1916 from RexXiong/CELEBORN-468.
Authored-by: Shuang <lvshuang.tb@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
As title
### Why are the changes needed?
As title
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Manual test
Closes#1795 from cfmcgrady/sbt-docs.
Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
### What changes were proposed in this pull request?
Add config to limit max workers when offering slots, the config can be set both
in server side and client side. Celeborn will choose the smaller positive configs from client and master.
### Why are the changes needed?
For large Celeborn clusters, users may want to limit the number of workers that
a shuffle can spread, reasons are:
1. One worker failure will not affect all applications
2. One huge shuffle will not affect all applications
3. It's more efficient to limit a shuffle within a restricted number of workers, say 100, than
spreading across a large number of workers, say 1000, because the network connections
in pushing data is `number of ShuffleClient` * `number of allocated Workers`
The recommended number of Workers should depend on workload and Worker hardware,
and this can be configured per application, so it's relatively flexible.
### Does this PR introduce _any_ user-facing change?
No, added a new configuration.
### How was this patch tested?
Added ITs and passes GA.
Closes#1790 from waitinfuture/152.
Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
Make Celeborn leader clean expired app dirs on HDFS when an application is Lost.
### Why are the changes needed?
If Celeborn is working on HDFS, the storage manager starts and cleans expired app directories, and the newly created worker will want to delete any unknown app directories.
This will cause using app directories to be deleted unexpectedly.
### Does this PR introduce _any_ user-facing change?
NO.
### How was this patch tested?
UT and cluster.
Closes#1678 from FMX/CELEBORN-764.
Lead-authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Co-authored-by: Cheng Pan <pan3793@gmail.com>
Co-authored-by: Ethan Feng <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
1. Celeborn supports storage type selection. HDD, SSD, and HDFS are available for now.
2. Add new buffer size for HDFS file writers.
3. Worker support empty working dirs.
### Why are the changes needed?
Support HDFS only scenario.
### Does this PR introduce _any_ user-facing change?
NO.
### How was this patch tested?
UT and cluster.
Closes#1619 from FMX/CELEBORN-568.
Lead-authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Co-authored-by: Ethan Feng <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
* [ISSUE-332][FOLLOWUP] Add deps in worker's pom
* [Refactor] Modify package name of utils to keep consistence
* [Refactor] Modify package name of utils to keep consistence
* [REFACTOR] Remove unused isRegistered in controller
* [ISSUE-887][REFACTOR] Configuration type convert to Enum
* update
* update
* Update RssShuffleManager.java