Commit Graph

60 Commits

Author SHA1 Message Date
sychen
185890381b [CELEBORN-2135] Rename Blaze to Auron
### What changes were proposed in this pull request?

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
<img width="1370" height="1100" alt="image" src="https://github.com/user-attachments/assets/dce7f5b4-a166-4547-bc08-4a8162f129d7" />

Closes #3457 from cxzl25/CELEBORN-2135.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2025-08-29 18:55:45 +08:00
sychen
3c9fe9897c [MINOR] Fix doc about PushMergedData split
### What changes were proposed in this pull request?

### Why are the changes needed?

[CELEBORN-1721][CIP-12] Support HARD_SPLIT in PushMergedData

https://issues.apache.org/jira/browse/CELEBORN-1721

<img width="775" height="149" alt="image" src="https://github.com/user-attachments/assets/deb7a741-5d72-403c-8405-77f837c25f59" />

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

<img width="675" height="108" alt="image" src="https://github.com/user-attachments/assets/b33bead1-6f26-42d7-8ef3-7fd6df3b334e" />

Closes #3442 from cxzl25/doc_PushMergedData_split.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2025-08-22 20:50:05 +08:00
SteNicholas
75446a05d3 [CELEBORN-2093] Support Flink 2.1
### What changes were proposed in this pull request?

Support Flink 2.1.

### Why are the changes needed?

Flink 2.1 has already released, which release notes refer to [Release notes - Flink 2.1](https://nightlies.apache.org/flink/flink-docs-master/release-notes/flink-2.1).

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI.

Closes #3404 from SteNicholas/CELEBORN-2093.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2025-08-04 14:12:55 +08:00
SteNicholas
cfb4438ade [CELEBORN-2057] Bump ap-loader version from 3.0-9 to 4.0-10
### What changes were proposed in this pull request?

Bump ap-loader version from 3.0-9 to 4.0-10.

### Why are the changes needed?

`ap-loader` has already released v4.0-10, which release note refers to [Loader for 4.0 (v10): Heatmaps and Native memory profiling](https://github.com/jvm-profiling-tools/ap-loader/releases/tag/4.0-10). It should bump version from 3.0-9 to 4.0-10 for `JVMProfiler`.

Backport https://github.com/apache/spark/pull/51257.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual test.

Closes #3359 from SteNicholas/CELEBORN-2057.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2025-07-10 16:18:28 +08:00
Fei Wang
b44730771d [CELEBORN-1413][FOLLOWUP] Bump spark 4.0 version to 4.0.0
### What changes were proposed in this pull request?
Bump spark 4.0 version to 4.0.0.

### Why are the changes needed?
Spark 4.0.0 is ready.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?
GA.

Closes #3282 from turboFei/spark_4.0.

Lead-authored-by: Fei Wang <fwang12@ebay.com>
Co-authored-by: Wang, Fei <fwang12@ebay.com>
Co-authored-by: Fei Wang <cn.feiwang@gmail.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2025-05-28 17:56:08 +08:00
sychen
14d721212c [MINOR][DOC] Correct configuration values ​​in slotsallocation
### What changes were proposed in this pull request?

### Why are the changes needed?
Config `celeborn.master.slot.assign.loadAware.fetchTimeWeight` default value is 1, and slotsallocation document is configured as 0.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
GA

Closes #3287 from cxzl25/minor_doc_slot.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>
2025-05-26 23:45:34 -07:00
Fei Wang
637c42338e [CELEBORN-2010][FOLLOWUP] Fix svn staging dir
### What changes were proposed in this pull request?

Use `tmp` subfolder for svc staging dir.

### Why are the changes needed?
Refer:
81c3d91f75/build/release/release.sh (L67)
### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Local.

Closes #3278 from turboFei/release_guide_follow.

Authored-by: Fei Wang <fwang12@ebay.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>
2025-05-25 00:29:23 -07:00
Fei Wang
81c3d91f75 [CELEBORN-2010][INFRA] Add release guide
### What changes were proposed in this pull request?

Add release guide and fix several issues during 0.6.0 release.

### Why are the changes needed?
Add docs.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Tested locally.

Closes #3271 from turboFei/release_guide.

Authored-by: Fei Wang <fwang12@ebay.com>
Signed-off-by: Fei Wang <fwang12@ebay.com>
2025-05-24 17:55:10 -07:00
SteNicholas
2b8f3520f9 [CELEBORN-1925] Support Flink 2.0
### What changes were proposed in this pull request?

Support Flink 2.0. The major changes of Flink 2.0 include:

- https://github.com/apache/flink/pull/25406: Bump target Java version to 11 and drop support for Java 8.
- https://github.com/apache/flink/pull/25551: Replace `InputGateDeploymentDescriptor#getConsumedSubpartitionIndexRange` with `InputGateDeploymentDescriptor#getConsumedSubpartitionRange(index)`.
- https://github.com/apache/flink/pull/25314: Replace `NettyShuffleEnvironmentOptions#NETWORK_EXCLUSIVE_BUFFERS_REQUEST_TIMEOUT_MILLISECONDS` with `NettyShuffleEnvironmentOptions#NETWORK_BUFFERS_REQUEST_TIMEOUT`.
- https://github.com/apache/flink/pull/25731: Introduce `InputGate#resumeGateConsumption`.

### Why are the changes needed?

Flink 2.0 is released which refers to [Release notes - Flink 2.0](https://nightlies.apache.org/flink/flink-docs-master/release-notes/flink-2.0).

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI.

Closes #3179 from SteNicholas/CELEBORN-1925.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Weijie Guo <reswqa@163.com>
2025-04-07 15:23:20 +08:00
KenGeng
2097fcdfea [CELEBORN-1870] Fix typos in in 'Developer' documents
### What changes were proposed in this pull request?
Fix typo in 'Developer' documents.

### Why are the changes needed?
Improve the accurary of the doc.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Only doc changed. No test.

Closes #3108 from bgeng777/CELEBORN-1870.

Authored-by: KenGeng <samuelgeng7@gmail.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2025-02-19 14:31:14 +08:00
SteNicholas
30e46eee28
[CELEBORN-1842] Bump ap-loader version from 3.0-8 to 3.0-9
### What changes were proposed in this pull request?

Bump ap-loader version from 3.0-8 to 3.0-9.

### Why are the changes needed?

ap-loader has already released v3.0-9, which should bump version from 3.0-8 for `JVMProfiler`.

Backport:

1. https://github.com/apache/spark/pull/46402
2. https://github.com/apache/spark/pull/49440

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI.

Closes #3072 from SteNicholas/CELEBORN-1842.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2025-01-21 12:22:00 +08:00
codenohup
a57238024e
[CELEBORN-1801] Remove out-of-dated flink 1.14 and 1.15
### What changes were proposed in this pull request?
Remove out-of-dated flink 1.14 and 1.15.

For more information, please see the discussion thread: https://lists.apache.org/thread/njho00zmkjx5qspcrbrkogy8s4zzmwv9

### Why are the changes needed?
Reduce maintenance burden.

### Does this PR introduce _any_ user-facing change?
Yes.

### How was this patch tested?
Changes can be covered by existing tests.

Closes #3029 from codenohup/remove-flink14and15.

Authored-by: codenohup <huangxu.walker@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2024-12-30 15:33:44 +08:00
xiyu.zk
4b60dae0f0 [CELEBORN-1789][DOC] Document on Java Columnar Shuffle
### What changes were proposed in this pull request?
Introduction to Celeborn's Java Columnar Shuffle

### Why are the changes needed?
Introduction to Celeborn's Java Columnar Shuffle

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI

Closes #3010 from kerwin-zk/CELEBORN-1789.

Authored-by: xiyu.zk <xiyu.zk@alibaba-inc.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-12-24 11:40:18 +08:00
Wang, Fei
59163c2a23 [CELEBORN-1745] Remove application top disk usage code
### What changes were proposed in this pull request?
Remove the code for app top disk usage both in master and worker end.

Prefer to use below prometheus expr to figure out the top app usages.
```
topk(50, sum by (applicationId) (metrics_diskBytesWritten_Value{role="worker", applicationId!=""}))
```

### Why are the changes needed?
To address comments: https://github.com/apache/celeborn/pull/2947#issuecomment-2499564978

> Due to the application dimension resource consumption, this feature should be included in the deprecated features. Maybe you can remove the codes for application top disk usage.

### Does this PR introduce _any_ user-facing change?

Yes, remove the app top disk usage api.

### How was this patch tested?
GA.

Closes #2949 from turboFei/remove_app_top_usage.

Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-11-28 10:55:34 +08:00
SteNicholas
9083dd401c [CELEBORN-1504][FOLLOWUP] Document adds Flink 1.16 support
### What changes were proposed in this pull request?

1. Document adds Flink 1.16 support including `README.md`, `deploy.md`.
2. Update description of `celeborn.client.shuffle.compression.codec` to change the supported Flink version for ZSTD.

### Why are the changes needed?

#2619 has supported Flink 1.16, which should update the document for the support. Meanwhile, since Flink version 1.16, zstd is supported for Flink shuffle client.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #2904 from SteNicholas/CELEBORN-1504.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-11-13 21:47:29 +08:00
SteNicholas
101c755e0b [CELEBORN-1635] Introduce Blaze support document
### What changes were proposed in this pull request?

Introduce Blaze support document.

### Why are the changes needed?

[Blaze](https://github.com/kwai/blaze) supports Celeborn as remote shuffle service. It's recommened to Blaze support document for introduction of Blaze usage.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #2787 from SteNicholas/CELEBORN-1635.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-10-09 11:48:05 +08:00
SteNicholas
5c32ba790e [CELEBORN-1486][FOLLOWUP] Update document link of Get Started With Velox and Get Started With ClickHouse in glutensupport.md
### What changes were proposed in this pull request?

Update document link of `Get Started With Velox` and `Get Started With ClickHouse` in `glutensupport.md`. Meanwhile, replace `gluten-celeborn-package-xx-SNAPSHOT.jar` with `(The bundled Gluten Jar. Make sure -Pceleborn is specified when it is built.)`, which refers to https://github.com/apache/incubator-gluten/pull/6692.

### Why are the changes needed?

The document link of `Get Started With Velox` and `Get Started With ClickHouse` could not access, which has already changed the url.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #2762 from SteNicholas/CELEBORN-1486.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-09-26 11:18:40 +08:00
Sanskar Modi
b3f86f9acc [CELEBORN-1297][FOLLOWUP] Fix DB config service SQL file
### What changes were proposed in this pull request?

Fix the unique key to reflect correct columns names.

### Why are the changes needed?

Running current DB scripts give below error because `user` column was renamed to `name` (https://github.com/apache/celeborn/pull/2340) but the unique key was not updated correctly.

```
mysql> CREATE TABLE IF NOT EXISTS celeborn_cluster_tenant_config
    -> (
    ->     id           int          NOT NULL AUTO_INCREMENT,
    ->     cluster_id   int          NOT NULL,
    ->     tenant_id    varchar(255) NOT NULL,
    ->     level        varchar(255) NOT NULL COMMENT 'config level, valid level is TENANT,USER',
    ->     name         varchar(255) DEFAULT NULL COMMENT 'tenant sub user',
    ->     config_key   varchar(255) NOT NULL,
    ->     config_value varchar(255) NOT NULL,
    ->     type         varchar(255) DEFAULT NULL COMMENT 'conf categories, such as quota',
    ->     gmt_create   timestamp NOT NULL,
    ->     gmt_modify   timestamp NOT NULL,
    ->     PRIMARY KEY (id),
    ->     UNIQUE KEY `index_unique_tenant_config_key` (`cluster_id`, `tenant_id`, `user`, `config_key`)
    -> );
ERROR 1072 (42000): Key column 'user' doesn't exist in table
```

### Does this PR introduce _any_ user-facing change?

NA

### How was this patch tested?

Tested in local DB
```
mysql> CREATE TABLE IF NOT EXISTS celeborn_cluster_tenant_config
    -> (
    ->     id           int          NOT NULL AUTO_INCREMENT,
    ->     cluster_id   int          NOT NULL,
    ->     tenant_id    varchar(255) NOT NULL,
    ->     level        varchar(255) NOT NULL COMMENT 'config level, valid level is TENANT,USER',
    ->     name         varchar(255) DEFAULT NULL COMMENT 'tenant sub user',
    ->     config_key   varchar(255) NOT NULL,
    ->     config_value varchar(255) NOT NULL,
    ->     type         varchar(255) DEFAULT NULL COMMENT 'conf categories, such as quota',
    ->     gmt_create   timestamp NOT NULL,
    ->     gmt_modify   timestamp NOT NULL,
    ->     PRIMARY KEY (id),
    ->     UNIQUE KEY `index_unique_tenant_config_key` (`cluster_id`, `tenant_id`, `name`, `config_key`)
    -> );
Query OK, 0 rows affected (0.01 sec)
```

Closes #2740 from s0nskar/fix-db-script.

Authored-by: Sanskar Modi <sanskarmodi97@gmail.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-09-18 07:46:13 +08:00
Sanskar Modi
a0b04d0036 [CELEBORN-1550] Add support of providing custom dynamic store backend implementation
### What changes were proposed in this pull request?

Adding support of providing custom dynamic store backend implementation, users can now pass there own implementation for dynamic config store backend.

This change also keep the backwards compatibility of supporting short names for backend like "FS" and "DB"

### Why are the changes needed?

Currently celeborn only supports File and DB based backend while there can be other ways of managing these configs.

### Does this PR introduce _any_ user-facing change?

NO, user facing behaviour will be same.

### How was this patch tested?

Existing UTs verifies that this change is working for "FS" and "DB" implementation.

Closes #2670 from s0nskar/dynamic_config.

Authored-by: Sanskar Modi <sanskarmodi97@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2024-08-12 15:04:43 +08:00
Weijie Guo
a759efb6dd [CELEBORN-1543] Support Flink 1.20
1.20 was the last non-bug-fix release before Flink 2.0, you can found all main upgrade features in this [release note](https://nightlies.apache.org/flink/flink-docs-release-1.20/release-notes/flink-1.20/). I think the most important feature related to Celeborn is we expose some interface to support Flink hybrid shuffle integration with Celeborn([FLIP-459](https://cwiki.apache.org/confluence/display/FLINK/FLIP-459%3A+Support+Flink+hybrid+shuffle+integration+with+Apache+Celeborn)). This(supporting hybrid shuffle in Celeborn side) is also a follow-up stuff to this PR.

incompatible changes in 1.20:
- 1.20 use enum `CompressionCodec` instead of `String` to construct `BufferDecompressor` and `BufferCompressor`.
- 1.20 introduce a new method(`notifyPartitionRecoveryStarted`) to `JobShuffleContext` in a non-compatible way.

I've already done the adaptation in this PR.

Closes #2662 from reswqa/support-120.

Authored-by: Weijie Guo <reswqa@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-08-09 17:05:58 +08:00
Mridul Muralidharan
17f89c553e [CELEBORN-1504] Support for Apache Flink 1.16
### What changes were proposed in this pull request?

Add support for Apache Flink 1.16 in Celeborn.

### Why are the changes needed?

User requests for Apache Flink 1.16.
This implementation is a synthesis of 1.15 and 1.17 support which already exists in Apache Celeborn

### Does this PR introduce _any_ user-facing change?

Yes, supports Apache Flink 1.16

### How was this patch tested?

Tests for 1.16 added, which are based on 1.15 and 1.17

Closes #2619 from mridulm/flink-1.16-support.

Authored-by: Mridul Muralidharan <mridulatgmail.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-07-15 10:44:16 +08:00
SteNicholas
5a39fc743b [CELEBORN-1486] Introduce ClickHouse Backend in Gluten Support document
### What changes were proposed in this pull request?

Introduce `ClickHouse Backend` in `Gluten Support` document. Meanwhile, fix the profile via `-Pceleborn` to compile gluten module.

### Why are the changes needed?

Gluten with ClickHouse backend supports Celeborn as remote shuffle service at present. Gluten Support document should introduce ClickHouse Backend to guide user usage of Gluten with ClickHouse backend.

Backport https://github.com/apache/incubator-gluten/pull/6282.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #2594 from SteNicholas/CELEBORN-1486.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: xiyu.zk <xiyu.zk@alibaba-inc.com>
2024-07-01 16:38:49 +08:00
Yi Chen
c20536e5c5
[CELEBORN-1425][HELM] Add helm chart unit tests to ensure manifests are rendered as expected
### What changes were proposed in this pull request?

Add helm chart unit tests.

### Why are the changes needed?

Unit tests can make resource manifests are rendered as expected with various configurations.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Detailed information about how to run helm chart unit tests can be found here [helm-unittest/helm-unittest](https://github.com/helm-unittest/helm-unittest). First, you need to install helm unit test plugin:

```shell
helm plugin install https://github.com/helm-unittest/helm-unittest.git
```

Then, run helm chart unitt tests as follows:

```shell
$ helm unittest charts/celeborn  --file "tests/**/*_test.yaml" --strict --debug
load_plugins.go:110: [info] file (/Users/chenyi/Library/helm/plugins/helm-acr/completion.yaml) not provided by plugin. No plugin auto-completion possible

### Chart [ celeborn ] charts/celeborn

 PASS  Test Celeborn configmap  charts/celeborn/tests/configmap_test.yaml
 PASS  Test Celeborn master pod monitor charts/celeborn/tests/master/podmonitor_test.yaml
 PASS  Test Celeborn master priority class      charts/celeborn/tests/master/priorityclass_test.yaml
 PASS  Test Celeborn master service     charts/celeborn/tests/master/service_test.yaml
 PASS  Test Celeborn master statefulset charts/celeborn/tests/master/statefulset_test.yaml
 PASS  Test Celeborn worker pod monitor charts/celeborn/tests/worker/podmonitor_test.yaml
 PASS  Test Celeborn worker priority class      charts/celeborn/tests/worker/priorityclass_test.yaml
 PASS  Test Celeborn worker service     charts/celeborn/tests/worker/service_test.yaml
 PASS  Test Celeborn worker statefulset charts/celeborn/tests/worker/statefulset_test.yaml

Charts:      1 passed, 1 total
Test Suites: 9 passed, 9 total
Tests:       48 passed, 48 total
Snapshot:    0 passed, 0 total
Time:        183.011375ms

```

Closes #2511 from ChenYi015/helm-unittest.

Authored-by: Yi Chen <github@chenyicn.net>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2024-05-15 19:17:30 +08:00
SteNicholas
c9b878a2f5
[INFRA] Remove incubator/incubating for graduation
### What changes were proposed in this pull request?

Remove incubator/incubating for graduation including:

- Remove `incubator`/`Incubating`.
- Remove `DISCLAIMER` and corresponding link.
- Update Release scripts and template.

Fix #2415.

### Why are the changes needed?

The ASF board has approved a resolution to graduate Celeborn into a full Top Level Project. To transition from the Apache Incubator to a new TLP, there's a few action items we need to do to complete the transition.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #2421 from SteNicholas/infra-graduation.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-03-27 13:54:47 +08:00
SteNicholas
73cf1562f7 [CELEBORN-1299] Introduce JVM profiling in Celeborn Worker using async-profiler
### What changes were proposed in this pull request?

Introduce JVM profiling `JVMProfier` in Celeborn Worker using async-profiler to capture CPU and memory profiles.

### Why are the changes needed?

[async-profiler](https://github.com/async-profiler) is a sampling profiler for any JDK based on the HotSpot JVM that does not suffer from Safepoint bias problem. It has low overhead and doesn’t rely on JVMTI. It avoids the safepoint bias problem by using the `AsyncGetCallTrace` API provided by HotSpot JVM to profile the Java code paths, and Linux’s perf_events to profile the native code paths. It features HotSpot-specific APIs to collect stack traces and to track memory allocations.
The feature introduces a profier plugin that does not add any overhead unless enabled and can be configured to accept profiler arguments as a configuration parameter. It should support to turn profiling on/off, includes the jar/binaries needed for profiling.

Backport [[SPARK-46094] Support Executor JVM Profiling](https://github.com/apache/spark/pull/44021).

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Worker cluster test.

Closes #2409 from SteNicholas/CELEBORN-1299.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-03-25 14:05:50 +08:00
SteNicholas
8fbcbead48
[CELEBORN-1341][FOLLOWUP] Improve Celeborn document
### What changes were proposed in this pull request?

Improve Celeborn document to fix typos, formats, unvalid link and unsynced default value of document. Meanwhile, the public interfaces of `shuffleclient.md` keep the consistent with `ShuffleClient`.

### Why are the changes needed?

There are some typos, formats, unvalid link and unsynced default value fixes in Celeborn document at present. Meanwhile, the public interfaces of `shuffleclient.md` is inconsistent with `ShuffleClient`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #2410 from SteNicholas/CELEBORN-1341.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-03-22 16:34:25 +08:00
SteNicholas
a371f934cf
[CELEBORN-1341] Improve Celeborn document
### What changes were proposed in this pull request?

Improve Celeborn document to fix typos, table formats and wrong description of document. Meanwhile, `deploy.md` adds the document of MapReduce client deployment.

### Why are the changes needed?

There are some typos and format fixes in Celeborn document at present. Meanwhile, the `deploy.md` does not contain the deployment of MapReduce client, which is inconsistent with `README.md` for Flink configuration.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #2407 from SteNicholas/CELEBORN-1341.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-03-20 15:02:05 +08:00
SteNicholas
adaa96fc60 [CELEBORN-1310][FLINK] Support Flink 1.19
### What changes were proposed in this pull request?

Support Flink 1.19.

### Why are the changes needed?

Flink 1.19.0 is announced to release: [Announcing the Release of Apache Flink 1.19] (https://flink.apache.org/2024/03/18/announcing-the-release-of-apache-flink-1.19).

The main changes includes:

- `org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel` constructor change parameters:
   - `consumedSubpartitionIndex` changes to `consumedSubpartitionIndexSet`: [[FLINK-33743][runtime] Support consuming multiple subpartitions on a single channel](https://github.com/apache/flink/pull/23927).
   - adds `partitionRequestListenerTimeout`: [[FLINK-25055][network] Support listen and notify mechanism for partition request](https://github.com/apache/flink/pull/23565).
- `org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate` constructor removes parameters `subpartitionIndexRange`, `tieredStorageConsumerClient`, `nettyService` and `tieredStorageConsumerSpecs`: [[FLINK-33743][runtime] Support consuming multiple subpartitions on a single channel](https://github.com/apache/flink/pull/23927).
- Change the default config file to `config.yaml` in `flink-dist`: [[FLINK-33577][dist] Change the default config file to config.yaml in flink-dist](https://github.com/apache/flink/pull/24177).
- `org.apache.flink.configuration.RestartStrategyOptions` uses `org.apache.commons.compress.utils.Sets` of `commons-compress` dependency: [[FLINK-33865][runtime] Adding an ITCase to ensure exponential-delay.attempts-before-reset-backoff works well](https://github.com/apache/flink/pull/23942).

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Local test:

- Flink batch job submission

```
$ ./bin/flink run examples/streaming/WordCount.jar --execution-mode BATCH
Executing example with default input data.
Use --input to specify file input.
Printing result to stdout. Use --output to specify output path.
Job has been submitted with JobID 2e9fb659991a9c29d376151783bdf6de
Program execution finished
Job with JobID 2e9fb659991a9c29d376151783bdf6de has finished.
Job Runtime: 1912 ms
```

- Flink batch job execution

![image](https://github.com/apache/incubator-celeborn/assets/10048174/18b60861-cafc-4df3-b94d-93307e728be2)

- Celeborn master log
```

24/03/18 20:52:47,513 INFO [celeborn-dispatcher-42] Master: Offer slots successfully for 1 reducers of 1710766312631-2e9fb659991a9c29d376151783bdf6de-0 on 1 workers.
```

- Celeborn worker log
```
24/03/18 20:52:47,704 INFO [celeborn-dispatcher-1] StorageManager: created file at /Users/nicholas/Software/Celeborn/apache-celeborn-0.5.0-SNAPSHOT/shuffle/celeborn-worker/shuffle_data/1710766312631-2e9fb659991a9c29d376151783bdf6de/0/0-0-0
24/03/18 20:52:47,707 INFO [celeborn-dispatcher-1] Controller: Reserved 1 primary location and 0 replica location for 1710766312631-2e9fb659991a9c29d376151783bdf6de-0
24/03/18 20:52:47,874 INFO [celeborn-dispatcher-2] Controller: Start commitFiles for 1710766312631-2e9fb659991a9c29d376151783bdf6de-0
24/03/18 20:52:47,890 INFO [worker-rpc-async-replier] Controller: CommitFiles for 1710766312631-2e9fb659991a9c29d376151783bdf6de-0 success with 1 committed primary partitions, 0 empty primary partitions, 0 failed primary partitions, 0 committed replica partitions, 0 empty replica partitions, 0 failed replica partitions.
```

Closes #2399 from SteNicholas/CELEBORN-1310.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-03-20 11:51:23 +08:00
ForVic
15b0f16f74 [MINOR] Fix typo in developer docs - overview
### What changes were proposed in this pull request?
To fix a typo.

### Why are the changes needed?
To maintain the quality of Celeborn documentation.

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
N/A

Closes #2397 from ForVic/forvic/fix_typo.

Authored-by: ForVic <victor.lakers0@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2024-03-17 16:15:16 +08:00
SteNicholas
36f2e03138
[MINOR] Fix style and Gluten link in Developers Doc
### What changes were proposed in this pull request?

Fix style and Gluten link in Developers Doc.

### Why are the changes needed?

- `slotsallocation.md` has the following wrong style:

<img width="1434" alt="image" src="https://github.com/apache/incubator-celeborn/assets/10048174/97fb53ed-473d-4f3d-8231-1fb613df9132">

- Gluten is apache incubating projetc, of which the link of Gluten project should be [Gluten](https://github.com/apache/incubator-gluten).

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #2375 from SteNicholas/developers-doc.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-03-11 12:07:01 +08:00
SteNicholas
93d2b9f47f [CELEBORN-1298][FOLLOWUP] Support Spark2.4 with Scala2.12
### What changes were proposed in this pull request?

Support Spark2.4 with Scala2.12 in `sbt.md`. Meanwhile, the CI workflow adds the test for Spark2.4 and Scala2.12.

Follow up #2344.

### Why are the changes needed?

Spark2.4 with Scala2.12 is supported.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI.

Closes #2345 from SteNicholas/CELEBORN-1298.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2024-02-29 21:48:51 +08:00
mingji
eed4f924b2 [CELEBORN-1295] Add tm to Celeborn's website
### What changes were proposed in this pull request?
Add trace mark symbol.

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #2338 from FMX/B1295.

Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-02-28 14:00:15 +08:00
SteNicholas
e7a89c7e13 [CELEBORN-1286] Introduce configuration.md to document dynamic config and config service
### What changes were proposed in this pull request?

Introduce `configuration.md` to document dynamic config and config service.

### Why are the changes needed?

`DynamicConfig` and `ConfigService` have already been supported in #2100, which should be documented to introduce the feature.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #2336 from SteNicholas/CELEBORN-1286.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2024-02-28 11:49:28 +08:00
sychen
b94fea8e17
[CELEBORN-1207] SBT http repository documentation
### What changes were proposed in this pull request?

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #2201 from cxzl25/CELEBORN-1207.

Lead-authored-by: sychen <sychen@ctrip.com>
Co-authored-by: cxzl25 <3898450+cxzl25@users.noreply.github.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2024-01-02 22:12:28 +08:00
Fu Chen
41df4ebbea [CELEBORN-1156][BUILD] SBT publish support
### What changes were proposed in this pull request?

As title

### Why are the changes needed?

As title

### Does this PR introduce _any_ user-facing change?

Yes, the user can publish shade clients via SBT

### How was this patch tested?

```shell
docker run -d -p 8081:8081 sonatype/nexus3
```

```shell
export SONATYPE_SNAPSHOTS_URL=http://192.168.3.46:8081/repository/maven-snapshots/
export SONATYPE_RELEASES_URL=http://192.168.3.46:8081/repository/maven-releases/
export ASF_USERNAME=admin
export ASF_PASSWORD=123456
```

- Publish the shade client for Spark 3.5:
```shell
./build/sbt -Pspark-3.4 celeborn-client-spark-3-shaded/publish
```

<img width="1673" alt="截屏2023-12-08 下午10 22 07" src="https://github.com/apache/incubator-celeborn/assets/8537877/1e87e7e2-cf3b-4bc0-8272-0f5b03ee65bf">

- Publish the shade client for Flink 1.18:

```shell
$ ./build/sbt -Pflink-1.18 celeborn-client-flink-1_18-shaded/publish
```
<img width="1676" alt="截屏2023-12-08 下午10 25 28" src="https://github.com/apache/incubator-celeborn/assets/8537877/62d0c3c4-e105-4e8a-8d8d-e78650a2eb09">

- Publish the shade client for MapReduce:
```shell
$ ./build/sbt -Pmr celeborn-client-mr-shaded/publish
```
<img width="1672" alt="截屏2023-12-08 下午10 25 47" src="https://github.com/apache/incubator-celeborn/assets/8537877/563d5ad5-fa6d-46fc-9465-8279ef96385a">

Closes #2129 from cfmcgrady/sbt-publish.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Fu Chen <cfmcgrady@gmail.com>
2023-12-15 11:22:35 +08:00
mingji
113311df3e [CELEBORN-1081][FOLLOWUP] Remove UNKNOWN_DISK and allocate all slots to disk
### What changes were proposed in this pull request?
1. Remove UNKNOWN_DISK from StorageInfo.
2. Enable load-aware slots allocation when there is HDFS.

### Why are the changes needed?
To support the application's config about available storage types.

### Does this PR introduce _any_ user-facing change?
no.

### How was this patch tested?
GA and Cluster.

Closes #2098 from FMX/B1081-1.

Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: Shuang <lvshuang.tb@gmail.com>
2023-11-28 11:26:00 +08:00
jiaoqingbo
39153c8c2d [MINOR] Updated sbt.md documentation to be consistent with description
### What changes were proposed in this pull request?

add --release parameter to create a Celeborn distribution like those distributed by the Celeborn Downloads page

### Why are the changes needed?

Without --release parameter, the created Celeborn distribution is different from the Celeborn Downloads page and lacks client-related packages.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

PASS GA

Closes #2080 from jiaoqingbo/minor-sbt.

Authored-by: jiaoqingbo <1178404354@qq.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-11-08 21:07:43 +08:00
sychen
efa22a4936 [CELEBORN-1105][FLINK] Support Flink 1.18
### What changes were proposed in this pull request?

### Why are the changes needed?

```bash
flink-1.18.0
./bin/start-cluster.sh
./bin/flink run examples/streaming/WordCount.jar --execution-mode BATCH
```

```java
Caused by: java.lang.NoSuchMethodError: org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.<init>(Ljava/lang/String;ILorg/apache/flink/runtime/jobgraph/IntermediateDataSetID;Lorg/apache/flink/runtime/io/network/partition/ResultPartitionType;Lorg/apache/flink/runtime/executiongraph/IndexRange;ILorg/apache/flink/runtime/io/network/partition/PartitionProducerStateProvider;Lorg/apache/flink/util/function/SupplierWithException;Lorg/apache/flink/runtime/io/network/buffer/BufferDecompressor;Lorg/apache/flink/core/memory/MemorySegmentProvider;ILorg/apache/flink/runtime/throughput/ThroughputCalculator;Lorg/apache/flink/runtime/throughput/BufferDebloater;)V
	at org.apache.celeborn.plugin.flink.RemoteShuffleInputGate$FakedRemoteInputChannel.<init>(RemoteShuffleInputGate.java:225)
	at org.apache.celeborn.plugin.flink.RemoteShuffleInputGate.getChannel(RemoteShuffleInputGate.java:179)
	at org.apache.flink.runtime.io.network.partition.consumer.InputGate.setChannelStateWriter(InputGate.java:90)
	at org.apache.flink.runtime.taskmanager.InputGateWithMetrics.setChannelStateWriter(InputGateWithMetrics.java:120)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.injectChannelStateWriterIntoChannels(StreamTask.java:524)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.<init>(StreamTask.java:496)
```

Flink 1.18.0 release
https://flink.apache.org/2023/10/24/announcing-the-release-of-apache-flink-1.18/

Interface `org.apache.flink.runtime.io.network.buffer.Buffer` adds `setRecycler` method.
[[FLINK-32549](https://issues.apache.org/jira/browse/FLINK-32549)][network] Tiered storage memory manager supports ownership transfer for buffers

`org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate` constructor adds parameters.
[[FLINK-31638](https://issues.apache.org/jira/browse/FLINK-31638)][network] Introduce the TieredStorageConsumerClient to SingleInputGate
[[FLINK-31642](https://issues.apache.org/jira/browse/FLINK-31642)][network] Introduce the MemoryTierConsumerAgent to TieredStorageConsumerClient

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
```bash
flink-1.18.0 ./bin/flink run examples/streaming/WordCount.jar --execution-mode BATCH
Executing example with default input data.
Use --input to specify file input.
Printing result to stdout. Use --output to specify output path.
Job has been submitted with JobID d7fc5f0ca018a54e9453c4d35f7c598a
Program execution finished
Job with JobID d7fc5f0ca018a54e9453c4d35f7c598a has finished.
Job Runtime: 1635 ms
```

<img width="1297" alt="image" src="https://github.com/apache/incubator-celeborn/assets/3898450/6a5266bf-2386-4386-b98b-a60d2570fa99">

Closes #2063 from cxzl25/CELEBORN-1105.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: Shuang <lvshuang.tb@gmail.com>
2023-11-06 15:53:39 +08:00
sychen
e437228dc8 [CELEBORN-1104][DOC] Fix SBT documentation incorrect command
### What changes were proposed in this pull request?

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #2062 from cxzl25/CELEBORN-1104.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-11-01 17:00:09 +08:00
SteNicholas
f61fe17551 [CELEBORN-987][FOLLOWUP][DOC] README#Build and sbt#System Requirements should extend to Scala 2.13 and Spark 3.5
### What changes were proposed in this pull request?

`README#Build` and `sbt#System Requirements` extends to Scala 2.13.

### Why are the changes needed?

`README#Build` and `sbt#System Requirements`should extend to Scala 2.13 to align the SBT CI test results.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

SBT CI tests.

Closes #1987 from SteNicholas/CELEBORN-987.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Fu Chen <cfmcgrady@gmail.com>
2023-10-14 09:54:22 +08:00
onebox-li
a47f6169d8 [MINOR] Fix some typos
### What changes were proposed in this pull request?
Fix some typos

### Why are the changes needed?
Ditto

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
-

Closes #1983 from onebox-li/fix-typo.

Authored-by: onebox-li <lyh-36@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-10-12 20:34:07 +08:00
mingji
95c9ccfc3e [CELEBORN-1010] Update docs about spark.shuffle.service.enabled
### What changes were proposed in this pull request?
To clarify a spark config to work with Celeborn.

### Why are the changes needed?
After some tests, I found that Spark 3.1 and newer can work with Celeborn with `spark.shuffle.service.enabled=true`.

ExternalShuffleBlockResolver won't check the shuffle manager's type since Spark 3.1 and newer.

### Does this PR introduce _any_ user-facing change?
NO.

### How was this patch tested?
I tested two scenarios about this PR.
1. Check whether Spark can release the executors in time.
2. Check data correctness by running TPC-DS.
All checks are good.

Closes #1955 from FMX/CELEBORN-1010.

Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-10-08 09:15:42 +08:00
jiaoqingbo
f1713dacaf [MINOR] Fix incorrect default resume ratio in trafficcontrol doc
<!--
Thanks for sending a pull request!  Here are some tips for you:
  - Make sure the PR title start w/ a JIRA ticket, e.g. '[CELEBORN-XXXX] Your PR title ...'.
  - Be sure to keep the PR description updated to reflect all changes.
  - Please write your PR title to summarize what this PR proposes.
  - If possible, provide a concise example to reproduce the issue for a faster review.
-->

### What changes were proposed in this pull request?

As Title

### Why are the changes needed?

Since 0.3.1, Celeborn changed the default value of `celeborn.worker.directMemoryRatioToResume` from `0.5` to `0.7`.

the doc should be update

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

PASS GA

Closes #1931 from jiaoqingbo/ratiofix.

Authored-by: jiaoqingbo <1178404354@qq.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-09-21 11:18:48 +08:00
sychen
bb50618780
[CELEBORN-997][DOC] Fix Rolling upgrade broken link
### What changes were proposed in this pull request?
https://celeborn.apache.org/docs/latest/developers/overview/

> For more details, please refer to Rolling upgrade

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1927 from cxzl25/CELEBORN-997.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-09-20 16:44:42 +08:00
zhouyifan279
dc5bdfadcc
[CELEBORN-923][DOC] docs/developers/overview.md has a broken link
### What changes were proposed in this pull request?
Fix a broken link in docs/developers/overview.md.

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Locally tested.

Closes #1845 from zhouyifan279/upgrade-page-link.

Authored-by: zhouyifan279 <zhouyifan279@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-08-28 12:07:43 +08:00
Fu Chen
efc334a6aa [CELEBORN-877][FOLLOWUP][DOC] Expand 'note' blocks by default in the docs sbt.md
### What changes were proposed in this pull request?

As title

### Why are the changes needed?

As title

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Pass GA

Closes #1806 from cfmcgrady/sbt-docs-followup.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-08-11 21:54:24 +08:00
Fu Chen
516bdc7e08
[CELEBORN-877][DOC] Document on SBT
### What changes were proposed in this pull request?

As title

### Why are the changes needed?

As title

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Manual test

Closes #1795 from cfmcgrady/sbt-docs.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-08-11 12:17:55 +08:00
Kerwin Zhang
4fb3f31a2d
[CELEBORN-870][FOLLOWUP][DOC] Document on usage together with Gluten (#1793) 2023-08-08 10:37:13 +08:00
xiyu.zk
35fe63e4a9 [CELEBORN-870][DOC] Document on usage together with Gluten
### What changes were proposed in this pull request?
As title.

### Why are the changes needed?
As title.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Manual test.

Closes #1784 from kerwin-zk/gluten_celeborn.

Lead-authored-by: xiyu.zk <xiyu.zk@alibaba-inc.com>
Co-authored-by: Kerwin Zhang <xiyu.zk@alibaba-inc.com>
Co-authored-by: Keyong Zhou <zhouky@apache.org>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-08-04 11:32:13 +08:00
zky.zhoukeyong
3ee0674058 [CELEBORN-869][FOLLOWUP][DOC] Document on Integrating Celeborn
### What changes were proposed in this pull request?
As title.

### Why are the changes needed?
As title.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Manual test.

Closes #1788 from waitinfuture/869-fu.

Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-08-02 18:17:17 +08:00