Commit Graph

45 Commits

Author SHA1 Message Date
Xianming Lei
1d44e5fbfe [CELEBORN-1487][PHASE1] CongestionController support control traffic by user/worker traffic speed
### What changes were proposed in this pull request?
Introduce support control traffic by user/worker traffic speed.

### Why are the changes needed?
Currently, Celeborn only supports quota management based on disk file bytes/count, and this quota management cannot cope with sudden increases in traffic, which will cause corrupt to the cluster.

### Does this PR introduce _any_ user-facing change?
Yes.

### How was this patch tested?
UTs.

Closes #2797 from leixm/issue_1487_1.

Authored-by: Xianming Lei <31424839+leixm@users.noreply.github.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-10-12 10:17:33 +08:00
SteNicholas
8bd5ac0b99 [MINOR] Add navigation for REST API document
### What changes were proposed in this pull request?

Add navigation for `REST API` document.

### Why are the changes needed?

`REST API` document does not have any navigation, which is better to add navigation to guide REST API.

<img width="1438" alt="image" src="https://github.com/user-attachments/assets/b5b3a14a-38d4-4769-bffb-3acd571d5dbb">

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #2775 from SteNicholas/navigate-rest-api.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-10-08 20:20:37 +08:00
Wang, Fei
8b7c2b3f12 [CELEBORN-1477][FOLLOWUP] Fix api v1 response issue
### What changes were proposed in this pull request?

1. Fix below api response:

- master GET /api/v1/masters
- master GET /api/v1/applications/top_disk_usages
- master&worker /api/v1/thread_dump

2. Fix typo in migration guide

3. refine the api annotation: METHOD -> PATH

4. enhance the `RestExceptionMapper`
### Why are the changes needed?

For /api/v1/masters, the `id` field is not in good format.
```
{
"groupId": "c5196f6d-2c34-3ed3-8b8a-47bede733167",
"leader": {
"id": "<ByteString4960c29e size=1 contents=\"0\">",
"address": "...:9872"
},
...
}
```

For `/api/v1/applications/top_disk_usages`, it thrown NPE, we shall filter the null items.
```
24/07/18 21:52:38,506 WARN [master-JettyThreadPool-40] RestExceptionMapper: Error occurs on accessing REST API.
java.lang.NullPointerException
	at org.apache.celeborn.service.deploy.master.http.api.v1.ApplicationResource.$anonfun$topDiskUsedApplications$2(ApplicationResource.scala:78)
```

For `api/v1/thread_dump`, seems need to add `Produces(Array(MediaType.APPLICATION_JSON))`:
```
Caused by: javax.ws.rs.InternalServerErrorException: HTTP 500 Internal Server Error
	at org.glassfish.jersey.server.internal.MappableExceptionWrapperInterceptor.aroundWriteTo(MappableExceptionWrapperInterceptor.java:65)
	at org.glassfish.jersey.message.internal.WriterInterceptorExecutor.proceed(WriterInterceptorExecutor.java:139)
	at org.glassfish.jersey.message.internal.MessageBodyFactory.writeTo(MessageBodyFactory.java:1116)
	at org.glassfish.jersey.server.ServerRuntime$Responder.writeResponse(ServerRuntime.java:649)
	at org.glassfish.jersey.server.ServerRuntime$Responder.processResponse(ServerRuntime.java:380)
	at org.glassfish.jersey.server.ServerRuntime$Responder.process(ServerRuntime.java:426)
	at org.glassfish.jersey.server.ServerRuntime$1.run(ServerRuntime.java:264)
	at org.glassfish.jersey.internal.Errors$1.call(Errors.java:248)
	at org.glassfish.jersey.internal.Errors$1.call(Errors.java:244)
	at org.glassfish.jersey.internal.Errors.process(Errors.java:292)
	at org.glassfish.jersey.internal.Errors.process(Errors.java:274)
	at org.glassfish.jersey.internal.Errors.process(Errors.java:244)
	at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:265)
	at org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:235)
	at org.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:684)
	at org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:394)
	... 36 more
Caused by: org.glassfish.jersey.message.internal.MessageBodyProviderNotFoundException: MessageBodyWriter not found for media type=text/html, type=class scala.collection.immutable.Map$Map1, genericType=class scala.collection.immutable.Map$Map1.
	at org.glassfish.jersey.message.internal.WriterInterceptorExecutor$TerminalWriterInterceptor.aroundWriteTo(WriterInterceptorExecutor.java:224)
	at org.glassfish.jersey.message.internal.WriterInterceptorExecutor.proceed(WriterInterceptorExecutor.java:139)
	at org.glassfish.jersey.server.internal.JsonWithPaddingInterceptor.aroundWriteTo(JsonWithPaddingInterceptor.java:85)
	at org.glassfish.jersey.message.internal.WriterInterceptorExecutor.proceed(WriterInterceptorExecutor.java:139)
	at org.glassfish.jersey.server.internal.MappableExceptionWrapperInterceptor.aroundWriteTo(MappableExceptionWrapperInterceptor.java:61)
	... 51 more
```

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?

Integration testing.

For `api/v1/masters`:
<img width="824" alt="image" src="https://github.com/user-attachments/assets/c0908d05-aebc-435a-8446-038dd18fb7cd">

For master `api/v1/applications/top_disk_usages`:
<img width="559" alt="image" src="https://github.com/user-attachments/assets/50860735-9975-449a-9f77-24d8eafd2018">

For `api/v1/thread_dump`:
<img width="1188" alt="image" src="https://github.com/user-attachments/assets/9844de22-45c6-46ba-9260-c8a7d28c2e1d">

Closes #2637 from turboFei/fix_id_info.

Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>
2024-07-22 19:02:36 -07:00
Fei Wang
9a4f0465fe [CELEBORN-1477][CIP-9][FOLLOWUP] User guide for /api/v1 migration
### What changes were proposed in this pull request?

Provide the user migration guide for /api/v1.

### Why are the changes needed?

For migration.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Verified the api mapping in swagger.

Closes #2618 from turboFei/cip_9_migrate.

Lead-authored-by: Fei Wang <fwang12@ebay.com>
Co-authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2024-07-16 11:55:00 +08:00
Fei Wang
bd3f8236d0 [CELEBORN-1317][FOLLOWUP] Fix media type annotations for form urlencoded APIs
### What changes were proposed in this pull request?

This PR is a follow up for https://github.com/apache/celeborn/pull/2495, fix the media types.

### Why are the changes needed?

The media types shown in the swagger UI are not correct.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Before:
<img width="1439" alt="image" src="https://github.com/apache/celeborn/assets/6757692/f287c02b-791c-4677-93b7-ac9c5e4ee34f">
After:
<img width="1341" alt="image" src="https://github.com/apache/celeborn/assets/6757692/13e5d310-7c97-4872-9496-f9b12113b7ab">

Closes #2616 from turboFei/form_app.

Authored-by: Fei Wang <fwang12@ebay.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2024-07-11 23:51:50 +08:00
SteNicholas
db163bd793 [CELEBORN-1317][FOLLOWUP] Improve parameters, description and document of REST API
### What changes were proposed in this pull request?

Improve parameters, description and document of Celeborn REST API, including:

1. The POST request uses `FormParam` instead of `QueryParam`.
2. The parameter name uses lowercase instead of uppercase.
3. The description of `/exclude` aligns with document in `monitoring.md`.
4. The document of `REST API` adds the `Method` and `Parameters` to document GET/POST method and corresponding interface.

### Why are the changes needed?

The parameters, description and document of REST API need to improve after http server refine.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

GA.

Closes #2495 from SteNicholas/CELEBORN-1317.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-05-09 17:41:13 +08:00
SteNicholas
1cd231f5e0
[CELEBORN-1412] celeborn.client.rpc.*.askTimeout should fallback to celeborn.rpc.askTimeout
### What changes were proposed in this pull request?

`celeborn.client.rpc.*.askTimeout` should fallback to `celeborn.rpc.askTimeout`.

### Why are the changes needed?

The config option series `celeborn.client.rpc.*.askTimeout` should fallback to `celeborn.rpc.askTimeout` instead of `celeborn.<module>.io.connectionTimeout`, which including `celeborn.client.rpc.getReducerFileGroup.askTimeout`, `celeborn.client.rpc.registerShuffle.askTimeout` and `celeborn.client.rpc.requestPartition.askTimeout`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

GA.

Closes #2492 from SteNicholas/CELEBORN-1412.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-05-07 13:47:22 +08:00
xinyuwang1
7b1645ff6a [CELEBORN-1369] Support for disable fallback to Spark's default shuffle
### What changes were proposed in this pull request?
An option to disable fallback is provided.

### Why are the changes needed?
It's dangerous to fallback to external shuffle when applications run on both online and offline nodes because online services could be impacted due to a shortage of disk capacity.

### Does this PR introduce _any_ user-facing change?
Yes, fallback to Spark's default shuffle can be disabled by setting `celeborn.client.spark.shuffle.fallback.enabled=false`

### How was this patch tested?
manual test

Closes #2444 from littlexyw/fallback_disable.

Lead-authored-by: xinyuwang1 <xinyuwang1@xiaohongshu.com>
Co-authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2024-05-03 14:32:28 +08:00
SteNicholas
82022a9427
[CELEBORN-1362] Remove unnecessary configuration celeborn.client.flink.inputGate.minMemory and celeborn.client.flink.resultPartition.minMemory
### What changes were proposed in this pull request?

Remove unnecessary configuration `celeborn.client.flink.inputGate.minMemory` and `celeborn.client.flink.resultPartition.minMemory`.

### Why are the changes needed?

`celeborn.client.flink.inputGate.minMemory` and `celeborn.client.flink.resultPartition.minMemory` are configured as min memory reserved at present. Meanwhile, `celeborn.client.flink.inputGate.memory` should be at least `networkBufferSize * MIN_BUFFERS_PER_GATE` bytes, and `celeborn.client.flink.resultPartition.memory` should be at least `networkBufferSize * MIN_BUFFERS_PER_PARTITION` bytes. Therefore, `celeborn.client.flink.inputGate.minMemory` and `celeborn.client.flink.resultPartition.minMemory` are unnecessary configuration for `celeborn.client.flink.inputGate.memory` and `celeborn.client.flink.resultPartition.memory`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

`PluginSideConfSuiteJ#testCoalesce`

Closes #2433 from SteNicholas/CELEBORN-1362.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-04-01 11:15:14 +08:00
lvshuang.xjs
9497d557e6
[CELEBORN-1345] Add a limit to the master's estimated partition size
### What changes were proposed in this pull request?
Currently, the Celeborn master calculates the estimatedPartitionSize based on the fileInfo committed by the application. This estimate is then used to allocate slots across all workers. However, this partition size may be too large or too small for Celeborn. For example, if an application commits a single file of 1TB to only one worker, using that partition size could result in all other workers having no available slots or only very small slots. To improve this, it would be better to implement a cap on the master's estimated partition size to prevent such imbalances.

### Why are the changes needed?
As title

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
UT

Closes #2412 from RexXiong/CELEBORN-1345.

Lead-authored-by: lvshuang.xjs <lvshuang.xjs@taobao.com>
Co-authored-by: Shuang <lvshuang.xjs@alibaba-inc.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-03-25 14:40:47 +08:00
sychen
91f6378682 [CELEBORN-1336] Remove client partition split pool
### What changes were proposed in this pull request?

### Why are the changes needed?
`CELEBORN-1320` uses `ReviveManager` to batch processing SOFT_SPLIT event RPC, so `partitionSplitPool` is no longer used, and the configuration item `celeborn.client.push.splitPartition.threads` is meaningless.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #2396 from cxzl25/CELEBORN-1336.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2024-03-18 21:48:59 +08:00
SteNicholas
dee4afc580
[CELEBORN-1322] Rename LostWorkers metric to LostWorkerCount to align the naming style
### What changes were proposed in this pull request?

Rename `LostWorkers` metric to `LostWorkerCount` to align the naming style of other worker count metrics.

### Why are the changes needed?

The naming of `LostWorkers` metric is different from other metric of `MasterSource` like `WorkerCount`, `ExcludedWorkerCount` etc, which could be renamed to `LostWorkerCount` to align the naming style.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #2378 from SteNicholas/CELEBORN-1322.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-03-11 20:41:22 +08:00
SteNicholas
aecae8161b [CELEBORN-1239][FOLLOWUP] Deprecate celeborn.quota.configuration.path config
### What changes were proposed in this pull request?

Deprecate `celeborn.quota.configuration.path` config. User `celeborn.dynamicConfig.store.fs.path` instead.

### Why are the changes needed?

`DefaultQuotaManager` is removed in #2298, which causes that `celeborn.quota.configuration.path` is useless. `celeborn.quota.configuration.path` could be deprecated that uses `celeborn.dynamicConfig.store.fs.path` to config quota.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #2339 from SteNicholas/CELEBORN-1239.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2024-02-27 22:58:25 +08:00
SteNicholas
b9bdea3c72
[CELEBORN-1280] Change default value of celeborn.worker.graceful.shutdown.recoverDbBackend to ROCKSDB
### What changes were proposed in this pull request?

Change the default value of `celeborn.worker.graceful.shutdown.recoverDbBackend` from `LEVELDB` to `ROCKSDB`.

### Why are the changes needed?

Because the LevelDB support will be removed, the default value of `celeborn.worker.graceful.shutdown.recoverDbBackend` could be changed to ROCKSDB instead of LEVELDB for preparation of LevelDB deprecation.

Backport:
 [[SPARK-45351][CORE] Change spark.shuffle.service.db.backend default value to ROCKSDB](https://github.com/apache/spark/pull/43142)
 [[SPARK-45413][CORE] Add warning for prepare drop LevelDB support](https://github.com/apache/spark/pull/43217)

### Does this PR introduce _any_ user-facing change?

The default value of `celeborn.worker.graceful.shutdown.recoverDbBackend` is changed from `LEVELDB` to `ROCKSDB`.

### How was this patch tested?

No.

Closes #2320 from SteNicholas/CELEBORN-1280.

Lead-authored-by: SteNicholas <programgeek@163.com>
Co-authored-by: Nicholas Jiang <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-02-23 14:53:24 +08:00
Angerszhuuuu
92704c7d06 [CELEBORN-1051] Add isDynamic property for CelebornConf
### What changes were proposed in this pull request?
Since we support ConfigService, many configuration can be dynamic, add `isDynamic` property for CelebornConf in this pr.

### Why are the changes needed?
Make configuration doc more cleear

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existed UT

Closes #2308 from AngersZhuuuu/CELEBORN-1051.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
2024-02-20 14:20:44 +08:00
Angerszhuuuu
5c54388bc2
[CELEBORN-1252] Fix resource consumption of worker does not update when update interval is greater than heartbeat interval
### What changes were proposed in this pull request?

 Resource consumption of worker does not update when update interval of resource consumpution is greater than heartbeat interval.

<img width="1741" alt="截屏2024-01-24 14 49 50" src="https://github.com/apache/incubator-celeborn/assets/46485123/21cfd412-c69e-4955-8bc8-155ee470697d">

This pull request introduces below changes:

1. Avoid master repeat add gauge for same user
2. For worker, user resource consumption can directly get from worker's snapshot, didn't need update interval

### Why are the changes needed?

No.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #2260 from AngersZhuuuu/CELEBORN-1252.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-01-25 20:28:19 +08:00
xianminglei
b90fb1fdb2 [CELEBORN-1237][METRICS] Refactor metrics name
### What changes were proposed in this pull request?
Refactor metrics name.

### Why are the changes needed?
Easier to understand the meaning of metrics

### Does this PR introduce _any_ user-facing change?
METRICS.md
migration.md
monitoring.md

### How was this patch tested?
Existing UTs.

Closes #2240 from leixm/metrics_name.

Authored-by: xianminglei <xianming.lei@shopee.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2024-01-18 18:15:43 +08:00
SteNicholas
277f7ced57
[CELEBORN-1187] Unify the size and file count of active shuffle metrics for master and worker
### What changes were proposed in this pull request?

Unify the size and file count of active shuffle metrics for `MasterSource` and `WorkerSource`.

### Why are the changes needed?

`MasterSource` uses `PartitionWritten` and `PartitionFileCount` metrics as the size and file count of active shuffle for all workers. Meanwhile, `WorkerSource` uses `ActiveShuffleSize` and `ActiveShuffleFileCount` metrics as the size and file count of active shuffle for a worker including master replica and slave replica. It's recommended to unify the size and file count of active shuffle metrics between `MasterSource` and `WorkerSource`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Internal tests.

Closes #2171 from SteNicholas/CELEBORN-1187.

Lead-authored-by: SteNicholas <programgeek@163.com>
Co-authored-by: 蒋晓峰 <jiangxiaofeng@bilibili.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-12-22 17:07:39 +08:00
mingji
5e77b851c9 [CELEBORN-1081] Client support celeborn.storage.activeTypes config
### What changes were proposed in this pull request?
1.To support `celeborn.storage.activeTypes` in Client.
2.Master will ignore slots for "UNKNOWN_DISK".

### Why are the changes needed?
Enable client application to select storage types to use.

### Does this PR introduce _any_ user-facing change?
Yes.

### How was this patch tested?
GA and cluster.

Closes #2045 from FMX/B1081.

Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: Shuang <lvshuang.tb@gmail.com>
2023-11-03 20:03:11 +08:00
mingji
69defcad7f [CELEBORN-1021] Celeborn support arbitary Ratis configs and client rpc timeout
### What changes were proposed in this pull request?
1. To support arbitrary Ratis configs
2. To support Ratis client rpc timeout

### Why are the changes needed?
After some digs that I found out Celeborn never changed the default config of ratis client's timeout.

### Does this PR introduce _any_ user-facing change?
NO.

### How was this patch tested?
GA and cluster.

Closes #1969 from FMX/CELEBORN-1021.

Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: Shuang <lvshuang.tb@gmail.com>
2023-10-18 10:26:11 +08:00
sychen
a8ac18f2e8 [CELEBORN-299] Deprecate celeborn.worker.storage.baseDir.prefix and celeborn.worker.storage.baseDir.number
### What changes were proposed in this pull request?

<img width="1460" alt="image" src="https://github.com/apache/incubator-celeborn/assets/3898450/ac3b29be-7c39-4c18-b71d-0e243797273e">

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
```
23/10/16 03:31:13,399 WARN [pool-1-thread-1-ScalaTest-running-CelebornConfSuite] CelebornConf: The configuration key 'celeborn.worker.storage.baseDir.prefix' has been deprecated in v0.4.0 and may be removed in the future. Please use celeborn.worker.storage.dirs
23/10/16 03:31:13,399 WARN [pool-1-thread-1-ScalaTest-running-CelebornConfSuite] CelebornConf: The configuration key 'celeborn.worker.storage.baseDir.number' has been deprecated in v0.4.0 and may be removed in the future. Please use celeborn.worker.storage.dirs
```

Closes #1993 from cxzl25/CELEBORN-299.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-10-16 19:10:13 +08:00
sychen
dd65e74f99 [CELEBORN-983] Rename PrometheusMetric configuration
### What changes were proposed in this pull request?
Replace
```properties
celeborn.metrics.master.prometheus.host
celeborn.metrics.master.prometheus.port
celeborn.metrics.worker.prometheus.host
celeborn.metrics.worker.prometheus.port
```
With
```properties
celeborn.master.http.host
celeborn.master.http.port
celeborn.worker.http.host
celeborn.worker.http.port
```

### Why are the changes needed?
The `celeborn.master.metrics.prometheus.port` and `celeborn.metrics.worker.prometheus.port` bind port not only serve prometheus metrics, but also provide some useful API services.

https://celeborn.apache.org/docs/latest/monitoring/#rest-api

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1919 from cxzl25/CELEBORN-983.

Lead-authored-by: sychen <sychen@ctrip.com>
Co-authored-by: Keyong Zhou <zhouky@apache.org>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-10-13 13:28:58 +08:00
Cheng Pan
84ef527181
[CELEBORN-1007][FOLLOWUP][DOCS] Update Migration Guide
### What changes were proposed in this pull request?

Mention metrics name change in Migration Guide

### Why are the changes needed?

https://github.com/apache/incubator-celeborn/pull/1939

### Does this PR introduce _any_ user-facing change?

Yes, docs updated.

### How was this patch tested?

Review.

Closes #1950 from pan3793/CELEBORN-1007-followup.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-09-28 21:08:11 +08:00
Cheng Pan
e4a60d15e4
[CELEBORN-909][FOLLOWUP][DOCS] Restore titles in migration guide
### What changes were proposed in this pull request?

Restore titles in migration guide

### Why are the changes needed?

Make title in migration guide consistent.

### Does this PR introduce _any_ user-facing change?

Yes, docs changed.

### How was this patch tested?

Pass GA.

Closes #1949 from pan3793/CELEBORN-909-followup.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-09-28 20:04:53 +08:00
Cheng Pan
ab68a4ae1b
[MINOR] Fix configuration version
### What changes were proposed in this pull request?

Change the `.version("0.3.2")` to `.version("0.3.1")`

### Why are the changes needed?

0.3.1 is not release yet.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass GA.

Closes #1948 from pan3793/minor-version.

Lead-authored-by: Cheng Pan <chengpan@apache.org>
Co-authored-by: Cheng Pan <pan3793@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-09-28 19:58:06 +08:00
sychen
42f08ca21a [CELEBORN-985] Change default value of numConnectionsPerPeer to 1
### What changes were proposed in this pull request?

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1943 from cxzl25/CELEBORN-985.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-09-27 22:50:23 +08:00
jiaoqingbo
107f3df8ba [CELEBORN-979] Reduce default disk Check Interval
### What changes were proposed in this pull request?

Reduce default disk Check Interval

### Why are the changes needed?

since https://github.com/apache/incubator-celeborn/pull/1909 ,In PushDataHandler#checkDiskFull method,Added check logic for DiskInfo status, the default disk Check Interval should be reduced

### Does this PR introduce _any_ user-facing change?

NO

### How was this patch tested?

PASS GA

Closes #1915 from jiaoqingbo/979.

Authored-by: jiaoqingbo <1178404354@qq.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-09-18 14:54:22 +08:00
zwangsheng
80948e89ae [CELEBORN-909][DOC] Mention celeborn.worker.directMemoryRatioToResume default value changed in main/0.4
### What changes were proposed in this pull request?
As title

### Why are the changes needed?
After #1829 we set `celeborn.worker.directMemoryRatioToResume` default value from `0.5` to `0.7`.

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
No

Closes #1836 from zwangsheng/CELEBORN-909.

Lead-authored-by: zwangsheng <2213335496@qq.com>
Co-authored-by: Keyong Zhou <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-08-24 21:08:38 +08:00
Fu Chen
516bdc7e08
[CELEBORN-877][DOC] Document on SBT
### What changes were proposed in this pull request?

As title

### Why are the changes needed?

As title

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Manual test

Closes #1795 from cfmcgrady/sbt-docs.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-08-11 12:17:55 +08:00
Angerszhuuuu
5cb73ed3b4 [CELEBORN-851] Mention Celeborn 0.4 server requires 0.3 or above clients
### What changes were proposed in this pull request?
As title

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1770 from AngersZhuuuu/CELEBORN-851.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
2023-07-27 18:07:44 +08:00
Angerszhuuuu
0db2150731 [CELEBORN-808] Remove unnecessary RssShuffleManager in 0.4.0
### What changes were proposed in this pull request?
As title

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1731 from AngersZhuuuu/CELEBORN-808.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
2023-07-27 17:47:44 +08:00
Angerszhuuuu
693172d0bd [CELEBORN-751] Rename remain rss related class name and filenames etc
### What changes were proposed in this pull request?
Rename remain rss related class name and filenames etc...

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1664 from AngersZhuuuu/CELEBORN-751.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
2023-07-04 10:20:08 +08:00
Cheng Pan
26aaba14d4 [CELEBORN-637][FOLLOWUP] Mention configurations change in migration guide
### What changes were proposed in this pull request?

as title

### Why are the changes needed?

mention configuration behavior change in migration guide

### Does this PR introduce _any_ user-facing change?

Yes, the migration guide is updated

### How was this patch tested?

review

Closes #1673 from pan3793/CELEBORN-637-followup.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-03 14:26:43 +08:00
Angerszhuuuu
5c7ecb8302
[CELEBORN-754][IMPORTANT] Provide a new SparkShuffleManager to replace RssShuffleManager in the future
### What changes were proposed in this pull request?
Provide a new SparkShuffleManager to replace RssShuffleManager in the future

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1667 from AngersZhuuuu/CELEBORN-754.

Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Co-authored-by: Cheng Pan <pan3793@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-06-30 17:27:33 +08:00
Fu Chen
17c1e01874
[CELEBORN-726] Update data replication terminology from master/slave to primary/replica for configurations and metrics
### What changes were proposed in this pull request?

This pull PR is an integral component of #1639 . It primarily focuses on updating configuration settings and metrics terminology, while ensuring compatibility with older client versions by refraining from introducing changes related to RPC.

### Why are the changes needed?

In order to distinguish it from the existing master/worker, refactor data replication terminology to 'primary/replica' for improved clarity and inclusivity in the codebase

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

existing tests.

Closes #1650 from cfmcgrady/primary-replica-metrics.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-06-29 09:47:02 +08:00
Angerszhuuuu
3985a5cbd7 [CELEBORN-666][FOLLOWUP] Unify all blacklist related code and comment
### What changes were proposed in this pull request?
Unify all blacklist related code and comment

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1638 from AngersZhuuuu/CELEBORN-666-FOLLOWUP.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-06-28 16:28:03 +08:00
zhongqiang.czq
374d735ae5
[CELEBORN-724] Fix the compatibility of HeartbeatFromApplicationRespo…
…nse with lower versions

### What changes were proposed in this pull request?
The master side will check HeartbeatFromApplication's reply field. if reply is true then it replies HeartbeatFromApplicationResponse otherwise OneWayMessageResponse.

The reply field is default false before the version 0.2.1, so master can be compatible with older client version

### Why are the changes needed?
Before the version `0.2.1`, the response of HeartbeatFromApplication is` OneWayMessageResponse`, but from `0.3.0`, the response of HeartbeatFromApplication is modified to `HeartbeatFromApplicationResponse`.
if the version of `client side `is `0.2.1` and the version of `server side is 0.3.0`, the `compatiblity issue `will occur.
The following compatiblity error will be printted.

``` java
java.io.InvalidObjectException: enum constant HEARTBEAT_FROM_APPLICATION_RESPONSE does not exist in class org.apache.celeborn.common.protocol.MessageType
	at java.io.ObjectInputStream.readEnum(ObjectInputStream.java:2157) ~[?:1.8.0_362]
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1662) ~[?:1.8.0_362]
	at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2430) ~[?:1.8.0_362]
	at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2354) ~[?:1.8.0_362]
	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2212) ~[?:1.8.0_362]
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1668) ~[?:1.8.0_362]
	at java.io.ObjectInputStream.readObject(ObjectInputStream.java:502) ~[?:1.8.0_362]
	at java.io.ObjectInputStream.readObject(ObjectInputStream.java:460) ~[?:1.8.0_362]
	at org.apache.celeborn.common.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76) ~[celeborn-client-spark-3-shaded_2.12-0.2.1-incubating.jar:?]
```
``` java
Caused by: java.lang.ClassCastException: Cannot cast org.apache.celeborn.common.protocol.message.ControlMessages$HeartbeatFromApplicationResponse to org.apache.celeborn.common.protocol.message.ControlMessages$OneWayMessageResponse$
	at java.lang.Class.cast(Class.java:3369) ~[?:1.8.0_362]
	at scala.concurrent.Future.$anonfun$mapTo$1(Future.scala:500) ~[scala-library-2.12.15.jar:?]
	at scala.util.Success.$anonfun$map$1(Try.scala:255) ~[scala-library-2.12.15.jar:?]
	at scala.util.Success.map(Try.scala:213) ~[scala-library-2.12.15.jar:?]
	at scala.concurrent.Future.$anonfun$map$1(Future.scala:292) ~[scala-library-2.12.15.jar:?]
	at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:33) ~[scala-library-2.12.15.jar:?]
	at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:33) ~[scala-library-2.12.15.jar:?]
	at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64) ~[scala-library-2.12.15.jar:?]
	at scala.concurrent.BatchingExecutor$Batch.processBatch$1(BatchingExecutor.scala:67) ~[scala-library-2.12.15.jar:?]
	at scala.concurrent.BatchingExecutor$Batch.$anonfun$run$1(BatchingExecutor.scala:82) ~[scala-library-2.12.15.jar:?]
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) ~[scala-library-2.12.15.jar:?]
	at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:85) ~[scala-library-2.12.15.jar:?]
	at scala.concurrent.BatchingExecutor$Batch.run(BatchingExecutor.scala:59) ~[scala-library-2.12.15.jar:?]
	at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:875) ~[scala-library-2.12.15.jar:?]
	at scala.concurrent.BatchingExecutor.execute(BatchingExecutor.scala:110) ~[scala-library-2.12.15.jar:?]
	at scala.concurrent.BatchingExecutor.execute$(BatchingExecutor.scala:107) ~[scala-library-2.12.15.jar:?]
	at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:873) ~[scala-library-2.12.15.jar:?]
	at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:72) ~[scala-library-2.12.15.jar:?]
	at scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1(Promise.scala:288) ~[scala-library-2.12.15.jar:?]
	at scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1$adapted(Promise.scala:288) ~[scala-library-2.12.15.jar:?]
	at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:288) ~[scala-library-2.12.15.jar:?]
	at scala.concurrent.Promise.trySuccess(Promise.scala:94) ~[scala-library-2.12.15.jar:?]
	at scala.concurrent.Promise.trySuccess$(Promise.scala:94) ~[scala-library-2.12.15.jar:?]
	at scala.concurrent.impl.Promise$DefaultPromise.trySuccess(Promise.scala:187) ~[scala-library-2.12.15.jar:?]
	at org.apache.celeborn.common.rpc.netty.NettyRpcEnv.onSuccess$1(NettyRpcEnv.scala:218) ~[celeborn-client-spark-3-shaded_2.12-0.2.1-incubating.jar:?]
```

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?
The pr is tested manually and the testing process is as follows:
1. server side is deploy using the code of latest branch-0.3.
2. spark client is deploy the version of 0.2.1, then run spark-sql to execute  3 tpcds queries( query1.sql/querey2/quere3.sql whose datasize is 1T), finnally verify that the queries are executed successfully and no above compatiblity error printted
3. spark client is deploy the version of 0.3.0,  then run spark-sql to execute 3 tpcds queries( query1.sql/querey2/quere3.sql whose datasize is 1T), finnally verify that the queries are executed successfully and no above compatiblity error printted

This patch had conflicts when merged, resolved by
Committer: Cheng Pan <chengpan@apache.org>

Closes #1635 from zhongqiangczq/heartbeat2.

Authored-by: zhongqiang.czq <zhongqiang.czq@alibaba-inc.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-06-28 16:04:18 +08:00
Angerszhuuuu
a2b215bd47 [CELEBORN-718] Support override Hadoop Conf by Celeborn Conf with celeborn.hadoop. prefix
### What changes were proposed in this pull request?
 Celeborn generate hadoop configuration should respect Celeborn conf

### Why are the changes needed?

In spark client side we should write like `spark.celeborn.hadoop.xxx.xx`
In server side we should write like `celeborn.hadoop.xxx.xxx`

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1629 from AngersZhuuuu/CELEBORN-719.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
2023-06-27 17:00:47 +08:00
Cheng Pan
2b82194ce0 [CELEBORN-715] Change master URL schema from rss to celeborn
### What changes were proposed in this pull request?

Change Celeborn Master URL from `rss://<host>:<port>` to `celeborn://<host>:<port>`

### Why are the changes needed?

Respect the project name.

### Does this PR introduce _any_ user-facing change?

Yes, migration guide is updated accordingly.

### How was this patch tested?

Pass GA.

Closes #1624 from pan3793/CELEBORN-715.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-06-26 22:30:20 +08:00
Cheng Pan
ac84d64d51 [CELEBORN-707][MASTER] Remove env CELEBORN_MASTER_HOST and CELEBORN_MASTER_PORT
### What changes were proposed in this pull request?

Remove environment variables `CELEBORN_MASTER_HOST` and `CELEBORN_MASTER_PORT`, and makes `CELEBORN_LOCAL_HOSTNAME` takes effect on both master and worker.

### Why are the changes needed?

There are many different ways to configure the master/worker host and port, which makes the thing complex and inconsistent.

After this change,

#### master

1. cli args `--host` `--port` takes the highest priority
2. then lookup env `CELEBORN_LOCAL_HOSTNAME`
3. things are different when HA enabled and disabled
  3.1. when HA is disabled, lookup configurations `celeborn.master.host` and `celeborn.master.port`
  3.2. when HA is enabled, each node needs to know the whole cluster info,
     ```
     celeborn.master.ha.node.1.host clb-1
     celeborn.master.ha.node.1.port 9097
     celeborn.master.ha.node.2.host clb-2
     celeborn.master.ha.node.2.port 9097
     celeborn.master.ha.node.3.host clb-3
     celeborn.master.ha.node.3.port 9097
     ```
     in addition, `celeborn.master.ha.node.id=1` can be used to indicate the node id, otherwise, the master will try to bind each host to match the node id.

#### worker

1. cli args `--host` `--port` takes the highest priority
2. then lookup env `CELEBORN_LOCAL_HOSTNAME`

things are simple than the master case because each worker is not required to know others.

### Does this PR introduce _any_ user-facing change?

Yes.

### How was this patch tested?

UT.

Closes #1616 from pan3793/CELEBORN-707.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-06-25 16:00:59 +08:00
zky.zhoukeyong
5f4f6d953f [CELEBORN-702][DOC] Extend doc about migration from 0.2.1 to 0.3.0
### What changes were proposed in this pull request?
Extend doc about migration from 0.2.1 to 0.3.0. Added the following contents:

<img width="1084" alt="image" src="https://github.com/apache/incubator-celeborn/assets/26535726/7a9d172c-09ba-48b6-9f5c-73a8b13d035f">

### Why are the changes needed?
ditto

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
No.

Closes #1612 from waitinfuture/702.

Lead-authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Co-authored-by: Cheng Pan <pan3793@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-06-20 20:45:58 +08:00
Cheng Pan
e22379c3ab [CELEBORN-638] Migrate configurations celeborn.ha.master.* to celeborn.master.ha.*
### What changes were proposed in this pull request?

It was discussed during the last meeting, but abandoned due to the complication.

### Why are the changes needed?

Make the configuration unified.

### Does this PR introduce _any_ user-facing change?

Yes, but the legacy configurations still take effect.

### How was this patch tested?

New UTs.

Closes #1549 from pan3793/CELEBORN-638.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-06-16 18:18:26 +08:00
Angerszhuuuu
1ba6dee324 [CELEBORN-680][DOC] Refresh celeborn configurations in doc
### What changes were proposed in this pull request?
Refresh celeborn configurations in doc

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1592 from AngersZhuuuu/CELEBORN-680.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
2023-06-15 13:59:38 +08:00
Angerszhuuuu
791d72d45f
[CELEBORN-590] Remove hadoop prefix of WORKER_WORKING_DIR (#1494) 2023-05-17 17:57:27 +08:00
Cheng Pan
fb7b311c89
[CELEBORN-499] Move version specific resource to main repo (#1429)
* [CELEBORN-499] Move version specific resource to main repo

* license
2023-04-14 16:20:51 +08:00