celeborn

Author	SHA1	Message	Date
Shuang	ad57c8b91e	[CELEBORN-1052] Introduce dynamic ConfigService at SystemLevel and TenantLevel ### What changes were proposed in this pull request? This PR introduce dynamic ConfigService at SystemLevel and TenantLevel, Dynamic configuration is a type of configuration that can be changed at runtime as needed. It can be used at system level/tenant level. When applying dynamic configuration, the priority order is as follows: tenant level overrides system level, which in turn overrides static configuration(CelebornConf). This means that if a configuration is defined at the tenant level, it will be used instead of the system level or static configuration(CelebornConf). If the tenant-level configuration is missing, the system-level configuration will be used. If the system-level configuration is also missing, CelebornConf will be used as the default value. There are several other tasks related to this feature that will be implemented in the future. - [ ] [Add isDynamic property for CelebornConf](https://issues.apache.org/jira/browse/CELEBORN-1051) - [ ] [Support DB based Configserver](https://issues.apache.org/jira/browse/CELEBORN-1054) - [ ] [Add restAPI for configuration management](https://issues.apache.org/jira/browse/CELEBORN-1056) ### Why are the changes needed? The current configuration of the server (CelebornConf) is static. When the configuration is changed, the service needs to be restarted. This PR introduces a dynamic configuration solution. The server side can use dynamic configuration as needed. At the same time, it is considered that the tenant level will be supported in the future (such as supporting tenant level dynamic quota control) configuration, so this time we will also consider supporting dynamic tenant-level configuration, and this PR will provide a default implementation based on the file system. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT Closes #2100 from RexXiong/CELEBORN-1052. Authored-by: Shuang <lvshuang.tb@gmail.com> Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>	2023-11-27 12:17:05 +08:00
mingji	02cea042a0	[CELEBORN-1116] Read authentication configs from `HADOOP_CONF_DIR` ### What changes were proposed in this pull request? 1. Make Celeborn read configs from HADOOP_COND_DIR. 2. Remove unnecessary Kerberos configs. ### Why are the changes needed? To support HDFS with Kerberos. ### Does this PR introduce _any_ user-facing change? NO. ### How was this patch tested? GA and cluster. Closes #2082 from FMX/B1116. Lead-authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com> Co-authored-by: Fu Chen <cfmcgrady@gmail.com> Co-authored-by: Cheng Pan <pan3793@gmail.com> Co-authored-by: Ethan Feng <fengmingxiao.fmx@alibaba-inc.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-11-09 11:07:13 +08:00
joey.ljy	455cd40137	[CELEBORN-1111] Supporting connection to HDFS with Kerberos authentication enabled ### What changes were proposed in this pull request? Adding Kerberos support for HDFS storage type. The following five parameters need to be configured: \| key \| value \| \| :--: \| :--: \| \| celeborn.storage.hdfs.kerberos.enabled \| true \| \| celeborn.storage.hdfs.kerberos.principal \| userREALM \| \| celeborn.storage.hdfs.kerberos.keytab \| /path/test.keytab \| \| celeborn.hadoop.hadoop.security.authorization \| kerberos \| \| celeborn.hadoop.dfs.namenode.kerberos.principal \| hdfs/_HOSTREALM \| ### Why are the changes needed? Connecting to HDFS with Kerberos enabled requires support for keytab login. ### Does this PR introduce _any_ user-facing change? Add 3 configurations. celeborn.storage.hdfs.kerberos.enabled celeborn.storage.hdfs.kerberos.principal celeborn.storage.hdfs.kerberos.keytab ### How was this patch tested? Test in Kerberos enabled HDFS cluster. Closes #2072 from liujiayi771/hdfs-kerberos. Authored-by: joey.ljy <joey.ljy@alibaba-inc.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-11-04 17:21:41 +08:00
mingji	5e77b851c9	[CELEBORN-1081] Client support `celeborn.storage.activeTypes` config ### What changes were proposed in this pull request? 1.To support `celeborn.storage.activeTypes` in Client. 2.Master will ignore slots for "UNKNOWN_DISK". ### Why are the changes needed? Enable client application to select storage types to use. ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? GA and cluster. Closes #2045 from FMX/B1081. Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com> Signed-off-by: Shuang <lvshuang.tb@gmail.com>	2023-11-03 20:03:11 +08:00
sychen	dd65e74f99	[CELEBORN-983] Rename PrometheusMetric configuration ### What changes were proposed in this pull request? Replace ```properties celeborn.metrics.master.prometheus.host celeborn.metrics.master.prometheus.port celeborn.metrics.worker.prometheus.host celeborn.metrics.worker.prometheus.port ``` With ```properties celeborn.master.http.host celeborn.master.http.port celeborn.worker.http.host celeborn.worker.http.port ``` ### Why are the changes needed? The `celeborn.master.metrics.prometheus.port` and `celeborn.metrics.worker.prometheus.port` bind port not only serve prometheus metrics, but also provide some useful API services. https://celeborn.apache.org/docs/latest/monitoring/#rest-api ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes #1919 from cxzl25/CELEBORN-983. Lead-authored-by: sychen <sychen@ctrip.com> Co-authored-by: Keyong Zhou <zhouky@apache.org> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-10-13 13:28:58 +08:00
Cheng Pan	ab68a4ae1b	[MINOR] Fix configuration version ### What changes were proposed in this pull request? Change the `.version("0.3.2")` to `.version("0.3.1")` ### Why are the changes needed? 0.3.1 is not release yet. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. Closes #1948 from pan3793/minor-version. Lead-authored-by: Cheng Pan <chengpan@apache.org> Co-authored-by: Cheng Pan <pan3793@gmail.com> Signed-off-by: Cheng Pan <chengpan@apache.org>	2023-09-28 19:58:06 +08:00
Shuang	615479c442	[CELEBORN-468] Timeout useless lostWorkers/shutdownWorkers meta ### What changes were proposed in this pull request? As title ### Why are the changes needed? If Worker lost or lost after graceful shutdown, Master would retain these lostWorker/shutdownWorkers meta permanently, These meta would cause some noisy message in lifecycleManager. For these meta better to delete them after a while ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT & E2E test Closes #1916 from RexXiong/CELEBORN-468. Authored-by: Shuang <lvshuang.tb@gmail.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-09-18 18:39:43 +08:00
Fu Chen	516bdc7e08	[CELEBORN-877][DOC] Document on SBT ### What changes were proposed in this pull request? As title ### Why are the changes needed? As title ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manual test Closes #1795 from cfmcgrady/sbt-docs. Authored-by: Fu Chen <cfmcgrady@gmail.com> Signed-off-by: Cheng Pan <chengpan@apache.org>	2023-08-11 12:17:55 +08:00
zky.zhoukeyong	6ea1ee2ec4	[CELEBORN-152] Add config to limit max workers when offering slots ### What changes were proposed in this pull request? Add config to limit max workers when offering slots, the config can be set both in server side and client side. Celeborn will choose the smaller positive configs from client and master. ### Why are the changes needed? For large Celeborn clusters, users may want to limit the number of workers that a shuffle can spread, reasons are: 1. One worker failure will not affect all applications 2. One huge shuffle will not affect all applications 3. It's more efficient to limit a shuffle within a restricted number of workers, say 100, than spreading across a large number of workers, say 1000, because the network connections in pushing data is `number of ShuffleClient` * `number of allocated Workers` The recommended number of Workers should depend on workload and Worker hardware, and this can be configured per application, so it's relatively flexible. ### Does this PR introduce _any_ user-facing change? No, added a new configuration. ### How was this patch tested? Added ITs and passes GA. Closes #1790 from waitinfuture/152. Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-08-07 10:13:53 +08:00
mingji	d0ecf83fec	[CELEBORN-764] Fix celeborn on HDFS might clean using app directories ### What changes were proposed in this pull request? Make Celeborn leader clean expired app dirs on HDFS when an application is Lost. ### Why are the changes needed? If Celeborn is working on HDFS, the storage manager starts and cleans expired app directories, and the newly created worker will want to delete any unknown app directories. This will cause using app directories to be deleted unexpectedly. ### Does this PR introduce _any_ user-facing change? NO. ### How was this patch tested? UT and cluster. Closes #1678 from FMX/CELEBORN-764. Lead-authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com> Co-authored-by: Cheng Pan <pan3793@gmail.com> Co-authored-by: Ethan Feng <fengmingxiao.fmx@alibaba-inc.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-07-05 23:11:50 +08:00
mingji	40760ede3a	[CELEBORN-568] Support storage type selection ### What changes were proposed in this pull request? 1. Celeborn supports storage type selection. HDD, SSD, and HDFS are available for now. 2. Add new buffer size for HDFS file writers. 3. Worker support empty working dirs. ### Why are the changes needed? Support HDFS only scenario. ### Does this PR introduce _any_ user-facing change? NO. ### How was this patch tested? UT and cluster. Closes #1619 from FMX/CELEBORN-568. Lead-authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com> Co-authored-by: Ethan Feng <fengmingxiao.fmx@alibaba-inc.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-06-27 18:07:08 +08:00
Angerszhuuuu	6d5dd50915	[CELEBORN-595][FOLLOWUP] Fix change version to 0.3.0. (#1522 )	2023-05-30 20:12:56 +08:00
Angerszhuuuu	62681ba85d	[CELEBORN-595] Rename and refactor the configuration doc. (#1501 )	2023-05-30 15:14:12 +08:00
Angerszhuuuu	d244f44518	[CELEBORN-593] Refine some RPC related default configurations (#1498 )	2023-05-19 18:23:12 +08:00
zhongqiangchen	cd92c423cd	[CELEBORN-475] Support extra tags for prometheus metrics (#1385 ) [CELEBORN-475] Support extra tags for prometheus metrics	2023-03-28 21:22:28 +08:00
Keyong Zhou	dcedf7b0a9	[CELEBORN-348] Support fetchTime in load-aware slots assignment strategy (#1287 )	2023-03-02 18:31:50 +08:00
Angerszhuuuu	04427f2b16	[CELEBORN-247] Add metrics for each user's quota usage (#1182 )	2023-02-02 18:31:08 +08:00
Ethan Feng	ee243f286d	[CELEBORN-4] Add metrics about top disk used apps. (#985 )	2022-11-22 20:06:36 +08:00
Angerszhuuuu	87fcfa767f	[ISSUE-887][REFACTOR] Configuration type convert to Enum (#888 ) * [ISSUE-332][FOLLOWUP] Add deps in worker's pom * [Refactor] Modify package name of utils to keep consistence * [Refactor] Modify package name of utils to keep consistence * [REFACTOR] Remove unused isRegistered in controller * [ISSUE-887][REFACTOR] Configuration type convert to Enum * update * update * Update RssShuffleManager.java	2022-10-29 13:41:06 +08:00
Cheng Pan	d7be6006e7	Migrate network related conf to structured conf system (#875 ) * Migrate network related conf to structured conf system * migrate * fix * fix * worker * fix * nit * review * nit	2022-10-28 10:45:52 +08:00
Angerszhuuuu	d283cca4e1	[ISSUE-869][REFACTOR] Migrate partition size/sorter related conf to Celeborn ConfigEntity (#870 )	2022-10-27 16:49:55 +08:00
Angerszhuuuu	399236c880	[ISSUE-849][REFACTOR] Migrate master and common Celeborn Configuration System (#850 )	2022-10-26 17:09:27 +08:00
Cheng Pan	e71c0228aa	Migrate columnar shuffle configurations to ConfigEntry (#844 )	2022-10-25 14:26:11 +08:00
Cheng Pan	8d7d397e71	Fix Configuration page and polish naming (#838 ) * Fix Configuration page and polish naming * nit * nit * comment	2022-10-24 12:46:25 +08:00
nafiy	1a8a36e8fe	[ISSUE-812][Refactor] Migrate metrics system related configs to ConfigEntry (#821 )	2022-10-21 13:57:58 +08:00
Ethan Feng	5c761a8df3	[ISSUE-813][Refactor] Refactor flusher configurations. (#813 ) * Refactor flusher configurations. * Refactor flusher configurations. * Update. * remove brackets. * update docs. * rename. * update. * update docs. * update. * update. * update. * update. * update. * update. * update. * format. * update. * update.	2022-10-20 15:23:17 +08:00
Cheng Pan	cb07cf62c0	Auto generate configuration docs (#794 )	2022-10-19 10:50:22 +08:00

27 Commits