Commit Graph

30 Commits

Author SHA1 Message Date
codenohup
a57238024e
[CELEBORN-1801] Remove out-of-dated flink 1.14 and 1.15
### What changes were proposed in this pull request?
Remove out-of-dated flink 1.14 and 1.15.

For more information, please see the discussion thread: https://lists.apache.org/thread/njho00zmkjx5qspcrbrkogy8s4zzmwv9

### Why are the changes needed?
Reduce maintenance burden.

### Does this PR introduce _any_ user-facing change?
Yes.

### How was this patch tested?
Changes can be covered by existing tests.

Closes #3029 from codenohup/remove-flink14and15.

Authored-by: codenohup <huangxu.walker@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2024-12-30 15:33:44 +08:00
SteNicholas
9083dd401c [CELEBORN-1504][FOLLOWUP] Document adds Flink 1.16 support
### What changes were proposed in this pull request?

1. Document adds Flink 1.16 support including `README.md`, `deploy.md`.
2. Update description of `celeborn.client.shuffle.compression.codec` to change the supported Flink version for ZSTD.

### Why are the changes needed?

#2619 has supported Flink 1.16, which should update the document for the support. Meanwhile, since Flink version 1.16, zstd is supported for Flink shuffle client.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #2904 from SteNicholas/CELEBORN-1504.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-11-13 21:47:29 +08:00
Weijie Guo
f2e9043028 [CELEBORN-1687] Highlight flink session cluster issue in doc
### What changes were proposed in this pull request?

If we use celeborn shuffle service, we can't submit both batch and streaming to the same flink session cluster. This should be highlight in doc.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

No need.

Closes #2879 from reswqa/session-doc.

Authored-by: Weijie Guo <reswqa@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-11-06 10:52:34 +08:00
szt
ec67366b7a
[CELEBORN-1684] Fix ambiguous client jar expression of document
### What changes were proposed in this pull request?
When users deploy using the release binary as outlined in the documentation, the instructions for copying the client JAR can be unclear.

### Why are the changes needed?
No

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
![image](https://github.com/user-attachments/assets/a4e7c415-8f0e-44bd-8d18-18462896e27c)

Closes #2877 from zaynt4606/md.

Authored-by: szt <zaynt4606@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2024-11-05 13:48:22 +08:00
Weijie Guo
41fdb8ade1
[CELEBORN-1490][CIP-6] Add Flink hybrid shuffle doc
### What changes were proposed in this pull request?

Add Flink hybrid shuffle doc

### Why are the changes needed?
We need the doc for the new hybrid shuffle mode.

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?

no neeed.

Closes #2867 from reswqa/add-hs-doc.

Authored-by: Weijie Guo <reswqa@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2024-11-01 13:37:14 +08:00
SteNicholas
bae230937d [CELEBORN-1058][FOLLOWUP] Update name of master service from MasterSys to Master in startup document
### What changes were proposed in this pull request?

Update name of master service from `MasterSys` to `Master` in startup document to follow up https://github.com/apache/celeborn/pull/2003/files#r1365454256.

### Why are the changes needed?

#2003 has already changed the name of master and worker service, which should also update the name in startup logs of document.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #2772 from SteNicholas/CELEBORN-1058.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-09-30 09:21:50 +08:00
Bowen Liang
f226424b9a [CLEBORN-1555] Replace deprecated config celeborn.storage.activeTypes in docs and tests
### What changes were proposed in this pull request?

Replace the deprecated config `celeborn.storage.activeTypes` with `celeborn.storage.availableTypes` in docs and tests, guiding the new comers to use the new config names.

### Why are the changes needed?
The config `celeborn.storage.activeTypes` has been deprecated in 0.4.0 release.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No feature changed.

Closes #2675 from bowenliang123/avai-types.

Authored-by: Bowen Liang <liangbowen@gf.com.cn>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-08-26 14:36:01 +08:00
zhaohehuhu
d14afcddfe [CELEBORN-1566] Update docs about using HDFS
### What changes were proposed in this pull request?
update client deployment doc to include the param (spark.celeborn.storage.activeTypes)
### Why are the changes needed?

Just provide a hint for users, otherwise they may miss this param.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Yes

Closes #2683 from zhaohehuhu/dev-0815.

Authored-by: zhaohehuhu <luoyedeyi@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-08-26 08:22:34 +08:00
SteNicholas
627ee8c6ef
[CELEBORN-1402][FOLLOWUP] Correct document of setting spark.executor.userClassPathFirst to false
### What changes were proposed in this pull request?

Correct document of setting `spark.executor.userClassPathFirst` to false.

### Why are the changes needed?

Document sets `spark.executor.userClassPathFirst` to false via `spark.executor.userClassPathFirst=false`, which is wrong setting.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #2574 from SteNicholas/CELEBORN-1402.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-06-18 20:08:22 +08:00
zhaohehuhu
fa9af57a4a
[CELEBORN-1465] Update docs remove unused node id
### What changes were proposed in this pull request?

The param(celeborn.master.ha.node.id) is not required to set for master HA anymore.

### Why are the changes needed?

remove the param from HA section
### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Closes #2573 from zhaohehuhu/dev-0618.

Authored-by: zhaohehuhu <luoyedeyi@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-06-18 18:25:11 +08:00
SteNicholas
c394fd448d
[CELEBORN-1460] MRAppMasterWithCeleborn supports setting mapreduce.celeborn.master.endpoints via environment variable CELEBORN_MASTER_ENDPOINTS
### What changes were proposed in this pull request?

`MRAppMasterWithCeleborn` sets `mapreduce.celeborn.master.endpoints` via environment variable `CELEBORN_MASTER_ENDPOINTS`.

### Why are the changes needed?

`MRAppMasterWithCeleborn` sets `mapreduce.celeborn.master.endpoints` via `${HADOOP_CONF_DIR}/mapred-site.xml` or `-Dmapreduce.celeborn.master.endpoints` at present. It could not set `mapreduce.celeborn.master.endpoints` by above way for integration with `RMProxy` which could provide `MRAppMasterWithCeleborn` with master endpoints via `environments` of `TaskAttemptImpl`. It's recommended that `MRAppMasterWithCeleborn` supports setting `mapreduce.celeborn.master.endpoints` via environment variable `CELEBORN_MASTER_ENDPOINTS`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

`WordCountTest`

Closes #2558 from SteNicholas/CELEBORN-1460.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-06-17 20:23:46 +08:00
SteNicholas
cd5609971f
[CELEBORN-1434] Support MRAppMasterWithCeleborn to disable job recovery and job reduce slow start by default
### What changes were proposed in this pull request?

`MRAppMasterWithCeleborn` disables `yarn.app.mapreduce.am.job.recovery.enable` and sets `mapreduce.job.reduce.slowstart.completedmaps` to 1 by default.

### Why are the changes needed?

MapReduce does not set the flag which indicates whether to keep containers across application attempts in ApplicationSubmissionContext. Meanwhile, make sure reduces are scheduled only after all map are completed. Therefore, `MRAppMasterWithCeleborn` could disable `yarn.app.mapreduce.am.job.recovery.enable` and set `mapreduce.job.reduce.slowstart.completedmaps` to 1 by default.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

`WordCountTest`

Closes #2525 from SteNicholas/CELEBORN-1434.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-05-22 15:32:41 +08:00
SteNicholas
9908035ba8 [CELEBORN-1402] SparkShuffleManager print warning log for spark.executor.userClassPathFirst=true with ShuffleManager defined in user jar
### What changes were proposed in this pull request?

`SparkShuffleManager` print warning log for `spark.executor.userClassPathFirst=true` with `ShuffleManager` defined in user jar via `--jar` or `spark.jars`.

### Why are the changes needed?

When `spark.executor.userClassPathFirst` is enabled with ShuffleManager defined in user jar, the `ClassLoader` of `handle` is `ChildFirstURLClassLoader`, which is different from `CelebornShuffleHandle` of which the `ClassLoader` is `AppClassLoader` in `SparkShuffleManager#getWriter/getReader`. The local test log is as follows:

```
./bin/spark-sql --master yarn --deploy-mode client \
--conf spark.celeborn.master.endpoints=localhost:9099 \
--conf spark.executor.userClassPathFirst=true \
--conf spark.jars=/tmp/celeborn-client-spark-3-shaded_2.12-0.5.0-SNAPSHOT.jar \
--conf spark.shuffle.manager=org.apache.spark.shuffle.celeborn.SparkShuffleManager \
--conf spark.shuffle.service.enabled=false

./bin/spark-sql --master yarn --deploy-mode client --jars /tmp/celeborn-client-spark-3-shaded_2.12-0.5.0-SNAPSHOT.jar \
--conf spark.celeborn.master.endpoints=localhost:9099 \
--conf spark.executor.userClassPathFirst=true \
--conf spark.shuffle.manager=org.apache.spark.shuffle.celeborn.SparkShuffleManager \
--conf spark.shuffle.service.enabled=false
```
```
24/04/28 18:03:31 [Executor task launch worker for task 0.0 in stage 5.0 (TID 8)] WARN SparkShuffleManager: [getWriter] handle classloader: org.apache.spark.util.ChildFirstURLClassLoader, CelebornShuffleHandle classloader: sun.misc.Launcher$AppClassLoader
```

It causes that `SparkShuffleManager` fallback to vanilla Spark `SortShuffleManager` for `spark.executor.userClassPathFirst=true` with `ShuffleManager` defined in user jar before https://github.com/apache/spark/pull/43627. After [SPARK-45762](https://issues.apache.org/jira/browse/SPARK-45762), the `ClassLoader` of `handle` and `CelebornShuffleHandle` are both `ChildFirstURLClassLoader`.

```
24/04/28 18:03:31 [Executor task launch worker for task 0.0 in stage 5.0 (TID 8)] WARN SparkShuffleManager: [getWriter] handle classloader: org.apache.spark.util.ChildFirstURLClassLoader, CelebornShuffleHandle classloader: org.apache.spark.util.ChildFirstURLClassLoader
```

Therefore, `SparkShuffleManager` should print warning log to remind for `spark.executor.userClassPathFirst=true` with `ShuffleManager` defined in user jar.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual test.

Closes #2482 from SteNicholas/CELEBORN-1402.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2024-05-17 11:03:15 +08:00
SteNicholas
a371f934cf
[CELEBORN-1341] Improve Celeborn document
### What changes were proposed in this pull request?

Improve Celeborn document to fix typos, table formats and wrong description of document. Meanwhile, `deploy.md` adds the document of MapReduce client deployment.

### Why are the changes needed?

There are some typos and format fixes in Celeborn document at present. Meanwhile, the `deploy.md` does not contain the deployment of MapReduce client, which is inconsistent with `README.md` for Flink configuration.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #2407 from SteNicholas/CELEBORN-1341.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-03-20 15:02:05 +08:00
SteNicholas
ce5386397d
[CELEBORN-1134][FOLLOWUP] Add execution.batch-shuffle-mode: ALL_EXCHANGES_BLOCKING to Flink Configuration of Deploy Flink client
### What changes were proposed in this pull request?

Add `execution.batch-shuffle-mode: ALL_EXCHANGES_BLOCKING` to `Flink Configuration` of `Deploy Flink client` in `deploy.md`

### Why are the changes needed?

Validation whether `execution.batch-shuffle-mode` is `ALL_EXCHANGES_BLOCKING` is supported in #2106. `Flink Configuration` of `Deploy Flink client` should also add this configuration.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #2355 from SteNicholas/CELEBORN-1134.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-03-04 15:57:40 +08:00
SteNicholas
383a102ab4
[CELEBORN-1260] Improve Spark Configuration of Deploy Spark client for deployment document
### What changes were proposed in this pull request?

Improve `Spark Configuration` of `Deploy Spark client` in `deploy.md`.

Fix #2270.

### Why are the changes needed?

It's recommended to improve the Spark Configuration of Deploy Spark client for deployment document with Spark Dynamic Resource Allocation support.

```
# Support Spark Dynamic Resource Allocation
# Required Spark version >= 3.5.0
spark.shuffle.sort.io.plugin.class org.apache.spark.shuffle.celeborn.CelebornShuffleDataIO
# Required Spark version >= 3.4.0, highly recommended to disable
spark.dynamicAllocation.shuffleTracking.enabled false
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #2278 from SteNicholas/CELEBORN-1260.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-02-02 14:31:36 +08:00
SteNicholas
e7e39a51be
[CELEBORN-1189] Introduce RunningApplicationCount metric and /applications API to record running applications of worker
### What changes were proposed in this pull request?

Introduce `RunningApplicationCount` metric and `/applications` API to record running applications for Celeborn worker.

### Why are the changes needed?

`RunningApplicationCount` metrics only monitors the count of running applications in the cluster for master. Meanwhile, `/listTopDiskUsedApps` API lists the top disk usage application ids for master and worker. Therefore `RunningApplicationCount` metric and `/applications` API could be introduced to record running applications of worker.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Internal tests.

Closes #2172 from SteNicholas/CELEBORN-1189.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-12-27 09:51:16 +08:00
mingji
02cea042a0 [CELEBORN-1116] Read authentication configs from HADOOP_CONF_DIR
### What changes were proposed in this pull request?
1. Make Celeborn read configs from HADOOP_COND_DIR.
2. Remove unnecessary Kerberos configs.

### Why are the changes needed?
To support HDFS with Kerberos.

### Does this PR introduce _any_ user-facing change?
NO.

### How was this patch tested?
GA and cluster.

Closes #2082 from FMX/B1116.

Lead-authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Co-authored-by: Fu Chen <cfmcgrady@gmail.com>
Co-authored-by: Cheng Pan <pan3793@gmail.com>
Co-authored-by: Ethan Feng <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-11-09 11:07:13 +08:00
mingji
95c9ccfc3e [CELEBORN-1010] Update docs about spark.shuffle.service.enabled
### What changes were proposed in this pull request?
To clarify a spark config to work with Celeborn.

### Why are the changes needed?
After some tests, I found that Spark 3.1 and newer can work with Celeborn with `spark.shuffle.service.enabled=true`.

ExternalShuffleBlockResolver won't check the shuffle manager's type since Spark 3.1 and newer.

### Does this PR introduce _any_ user-facing change?
NO.

### How was this patch tested?
I tested two scenarios about this PR.
1. Check whether Spark can release the executors in time.
2. Check data correctness by running TPC-DS.
All checks are good.

Closes #1955 from FMX/CELEBORN-1010.

Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-10-08 09:15:42 +08:00
mingji
2ee6e305f1
[CELEBORN-941] fix incorrect deploy doc
### What changes were proposed in this pull request?
Fix the incorrect deploy doc about using HDFS only.

### Why are the changes needed?
Ditto.

### Does this PR introduce _any_ user-facing change?
NO.

### How was this patch tested?
Just docs.

Closes #1874 from FMX/CELEBORN-941.

Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-08-31 18:54:27 +08:00
SteNicholas
baaddb8ee8 [CELEBORN-822][DOC] Introduce a quick start guide for running Apache Flink with Apache Celeborn
### What changes were proposed in this pull request?

Introduce a quick start guide for running Apache Flink with Apache Celeborn to help Flink users to run with Celeborn.

### Why are the changes needed?

There is no quick start guide for running Apache Flink with Apache Celeborn.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

None.

Closes #1868 from SteNicholas/CELEBORN-822.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-08-30 21:38:03 +08:00
Fu Chen
516bdc7e08
[CELEBORN-877][DOC] Document on SBT
### What changes were proposed in this pull request?

As title

### Why are the changes needed?

As title

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Manual test

Closes #1795 from cfmcgrady/sbt-docs.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-08-11 12:17:55 +08:00
Angerszhuuuu
693172d0bd [CELEBORN-751] Rename remain rss related class name and filenames etc
### What changes were proposed in this pull request?
Rename remain rss related class name and filenames etc...

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1664 from AngersZhuuuu/CELEBORN-751.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
2023-07-04 10:20:08 +08:00
Angerszhuuuu
5c7ecb8302
[CELEBORN-754][IMPORTANT] Provide a new SparkShuffleManager to replace RssShuffleManager in the future
### What changes were proposed in this pull request?
Provide a new SparkShuffleManager to replace RssShuffleManager in the future

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1667 from AngersZhuuuu/CELEBORN-754.

Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Co-authored-by: Cheng Pan <pan3793@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-06-30 17:27:33 +08:00
mingji
40760ede3a [CELEBORN-568] Support storage type selection
### What changes were proposed in this pull request?
1. Celeborn supports storage type selection. HDD, SSD, and HDFS are available for now.
2. Add new buffer size for HDFS file writers.
3. Worker support empty working dirs.

### Why are the changes needed?
Support HDFS only scenario.

### Does this PR introduce _any_ user-facing change?
NO.

### How was this patch tested?
UT and cluster.

Closes #1619 from FMX/CELEBORN-568.

Lead-authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Co-authored-by: Ethan Feng <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-06-27 18:07:08 +08:00
Cheng Pan
e22379c3ab [CELEBORN-638] Migrate configurations celeborn.ha.master.* to celeborn.master.ha.*
### What changes were proposed in this pull request?

It was discussed during the last meeting, but abandoned due to the complication.

### Why are the changes needed?

Make the configuration unified.

### Does this PR introduce _any_ user-facing change?

Yes, but the legacy configurations still take effect.

### How was this patch tested?

New UTs.

Closes #1549 from pan3793/CELEBORN-638.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-06-16 18:18:26 +08:00
Angerszhuuuu
1ba6dee324 [CELEBORN-680][DOC] Refresh celeborn configurations in doc
### What changes were proposed in this pull request?
Refresh celeborn configurations in doc

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1592 from AngersZhuuuu/CELEBORN-680.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
2023-06-15 13:59:38 +08:00
Ethan Feng
91b757555e
[CELEBORN-570] Update docs about monitor and deployment. (#1478) 2023-05-08 17:07:42 +08:00
cxzl25
13f772e0c0
[CELEBORN-525] Fix wrong parameter celeborn.push.buffer.size 2023-04-14 20:45:25 +08:00
Cheng Pan
fb7b311c89
[CELEBORN-499] Move version specific resource to main repo (#1429)
* [CELEBORN-499] Move version specific resource to main repo

* license
2023-04-14 16:20:51 +08:00