Commit Graph

109 Commits

Author SHA1 Message Date
mingji
a24164c0fa [CELEBORN-2026] Skip build tez client
### What changes were proposed in this pull request?
Skip building the Tez client when releasing 0.6.0.

### Why are the changes needed?
The Tez client has not been fully verified, it will need some time before it is ready.

### Does this PR introduce _any_ user-facing change?
NO.

### How was this patch tested?
NO.

Closes #3312 from FMX/b2026.

Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2025-06-04 19:52:48 +08:00
Fei Wang
b44730771d [CELEBORN-1413][FOLLOWUP] Bump spark 4.0 version to 4.0.0
### What changes were proposed in this pull request?
Bump spark 4.0 version to 4.0.0.

### Why are the changes needed?
Spark 4.0.0 is ready.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?
GA.

Closes #3282 from turboFei/spark_4.0.

Lead-authored-by: Fei Wang <fwang12@ebay.com>
Co-authored-by: Wang, Fei <fwang12@ebay.com>
Co-authored-by: Fei Wang <cn.feiwang@gmail.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2025-05-28 17:56:08 +08:00
Nicolas Fraison
54732c7b38 Update celeborn conf to add S3 in default and doc for policy
### What changes were proposed in this pull request?

Add S3 type in evict and create policies
Add S3 type in list of default evict and create policy

### Why are the changes needed?

To align with other types

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Closes #3218 from ashangit/nfraison/doc_s3.

Authored-by: Nicolas Fraison <nfraison@yahoo.fr>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2025-05-08 16:52:44 +08:00
veli.yang
7d0ba7f9b8 [CELEBORN-1916] Support Aliyun OSS Based on MPU Extension Interface
### What changes were proposed in this pull request?

- close [CELEBORN-1916](https://issues.apache.org/jira/browse/CELEBORN-1916)
- This PR extends the Multipart Uploader (MPU) interface to support Aliyun OSS.

### Why are the changes needed?

- Implemented multipart-uploader-oss module based on the existing MPU extension interface.
- Added necessary configurations and dependencies for Aliyun OSS integration.
- Ensured compatibility with the existing multipart-uploader framework.
- This enhancement allows seamless multipart upload functionality for Aliyun OSS, similar to the existing AWS S3 support.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Deployment integration testing has been completed in the local environment.

Closes #3157 from shouwangyw/optimize/mpu-oss.

Lead-authored-by: veli.yang <897900564@qq.com>
Co-authored-by: yangwei <897900564@qq.com>
Co-authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2025-04-08 15:10:33 +08:00
SteNicholas
2b8f3520f9 [CELEBORN-1925] Support Flink 2.0
### What changes were proposed in this pull request?

Support Flink 2.0. The major changes of Flink 2.0 include:

- https://github.com/apache/flink/pull/25406: Bump target Java version to 11 and drop support for Java 8.
- https://github.com/apache/flink/pull/25551: Replace `InputGateDeploymentDescriptor#getConsumedSubpartitionIndexRange` with `InputGateDeploymentDescriptor#getConsumedSubpartitionRange(index)`.
- https://github.com/apache/flink/pull/25314: Replace `NettyShuffleEnvironmentOptions#NETWORK_EXCLUSIVE_BUFFERS_REQUEST_TIMEOUT_MILLISECONDS` with `NettyShuffleEnvironmentOptions#NETWORK_BUFFERS_REQUEST_TIMEOUT`.
- https://github.com/apache/flink/pull/25731: Introduce `InputGate#resumeGateConsumption`.

### Why are the changes needed?

Flink 2.0 is released which refers to [Release notes - Flink 2.0](https://nightlies.apache.org/flink/flink-docs-master/release-notes/flink-2.0).

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI.

Closes #3179 from SteNicholas/CELEBORN-1925.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Weijie Guo <reswqa@163.com>
2025-04-07 15:23:20 +08:00
codenohup
a57238024e
[CELEBORN-1801] Remove out-of-dated flink 1.14 and 1.15
### What changes were proposed in this pull request?
Remove out-of-dated flink 1.14 and 1.15.

For more information, please see the discussion thread: https://lists.apache.org/thread/njho00zmkjx5qspcrbrkogy8s4zzmwv9

### Why are the changes needed?
Reduce maintenance burden.

### Does this PR introduce _any_ user-facing change?
Yes.

### How was this patch tested?
Changes can be covered by existing tests.

Closes #3029 from codenohup/remove-flink14and15.

Authored-by: codenohup <huangxu.walker@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2024-12-30 15:33:44 +08:00
hongguangwei
d0d8edfe22 [CELEBORN-1737] Support build tez client package
### What changes were proposed in this pull request?
Add Tez packaging script.

### Why are the changes needed?
To support build tez client.

### Does this PR introduce _any_ user-facing change?
Yes, enable Celeborn with tez support.

### How was this patch tested?
Cluster test.

Closes #3028 from GH-Gloway/1737.

Lead-authored-by: hongguangwei <hongguangwei@bytedance.com>
Co-authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-12-30 11:01:19 +08:00
mingji
43e1b8a246
[MINOR] Update DingTalk group link
### What changes were proposed in this pull request?
Update Dingtalk group link to latest.

### Why are the changes needed?
The old Dingtalk is outdated.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #2948 from FMX/b01.

Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2024-11-26 14:27:41 +08:00
SteNicholas
9083dd401c [CELEBORN-1504][FOLLOWUP] Document adds Flink 1.16 support
### What changes were proposed in this pull request?

1. Document adds Flink 1.16 support including `README.md`, `deploy.md`.
2. Update description of `celeborn.client.shuffle.compression.codec` to change the supported Flink version for ZSTD.

### Why are the changes needed?

#2619 has supported Flink 1.16, which should update the document for the support. Meanwhile, since Flink version 1.16, zstd is supported for Flink shuffle client.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #2904 from SteNicholas/CELEBORN-1504.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-11-13 21:47:29 +08:00
Weijie Guo
f2e9043028 [CELEBORN-1687] Highlight flink session cluster issue in doc
### What changes were proposed in this pull request?

If we use celeborn shuffle service, we can't submit both batch and streaming to the same flink session cluster. This should be highlight in doc.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

No need.

Closes #2879 from reswqa/session-doc.

Authored-by: Weijie Guo <reswqa@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-11-06 10:52:34 +08:00
Weijie Guo
41fdb8ade1
[CELEBORN-1490][CIP-6] Add Flink hybrid shuffle doc
### What changes were proposed in this pull request?

Add Flink hybrid shuffle doc

### Why are the changes needed?
We need the doc for the new hybrid shuffle mode.

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?

no neeed.

Closes #2867 from reswqa/add-hs-doc.

Authored-by: Weijie Guo <reswqa@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2024-11-01 13:37:14 +08:00
SteNicholas
3726aefecf [CELEBORN-1659][FOLLOWUP] Dockerfile should support copying CLI jars
### What changes were proposed in this pull request?

Dockerfile should support copying CLI jars.

### Why are the changes needed?

CLI jars are generated from `make-distribution.sh`. Therefore, Dockerfile could copy CLI jars to `/opt/celeborn/` directory.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #2823 from SteNicholas/CELEBORN-1659.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-10-21 11:37:41 +08:00
SteNicholas
bae230937d [CELEBORN-1058][FOLLOWUP] Update name of master service from MasterSys to Master in startup document
### What changes were proposed in this pull request?

Update name of master service from `MasterSys` to `Master` in startup document to follow up https://github.com/apache/celeborn/pull/2003/files#r1365454256.

### Why are the changes needed?

#2003 has already changed the name of master and worker service, which should also update the name in startup logs of document.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #2772 from SteNicholas/CELEBORN-1058.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-09-30 09:21:50 +08:00
SteNicholas
8a38235f53 [CELEBORN-1437][FOLLOWUP] Update metrics document link of README.md
### What changes were proposed in this pull request?

Update metrics document link of `README.md`.

### Why are the changes needed?

`METRICS.md` has already been merged into `monitoring.md`, which should update the link in `README.md`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #2761 from SteNicholas/CELEBORN-1437.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-09-25 15:34:06 +08:00
SteNicholas
cd916040da [CELEBORN-912][FOLLOWUP] Support columnar shuffle for Spark 3.5
### What changes were proposed in this pull request?

Introduce `spark-3.5-columnar-shuffle` module to support columnar shuffle for Spark 3.5.

### Why are the changes needed?

#1850 does not support columnar shuffle for Spark 3.5, in which version building `spark-3-columnar-shuffle` module has compilation error. The compilation error is caused by https://github.com/apache/spark/pull/40784, which incompatible changes move `InternalType` from `AtomicType` to `PhysicalDataType`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI.

Closes #2710 from SteNicholas/CELEBORN-912.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: xiyu.zk <xiyu.zk@alibaba-inc.com>
2024-09-05 14:26:54 +08:00
Bowen Liang
f226424b9a [CLEBORN-1555] Replace deprecated config celeborn.storage.activeTypes in docs and tests
### What changes were proposed in this pull request?

Replace the deprecated config `celeborn.storage.activeTypes` with `celeborn.storage.availableTypes` in docs and tests, guiding the new comers to use the new config names.

### Why are the changes needed?
The config `celeborn.storage.activeTypes` has been deprecated in 0.4.0 release.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No feature changed.

Closes #2675 from bowenliang123/avai-types.

Authored-by: Bowen Liang <liangbowen@gf.com.cn>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-08-26 14:36:01 +08:00
zhaohehuhu
d14afcddfe [CELEBORN-1566] Update docs about using HDFS
### What changes were proposed in this pull request?
update client deployment doc to include the param (spark.celeborn.storage.activeTypes)
### Why are the changes needed?

Just provide a hint for users, otherwise they may miss this param.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Yes

Closes #2683 from zhaohehuhu/dev-0815.

Authored-by: zhaohehuhu <luoyedeyi@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-08-26 08:22:34 +08:00
Weijie Guo
a759efb6dd [CELEBORN-1543] Support Flink 1.20
1.20 was the last non-bug-fix release before Flink 2.0, you can found all main upgrade features in this [release note](https://nightlies.apache.org/flink/flink-docs-release-1.20/release-notes/flink-1.20/). I think the most important feature related to Celeborn is we expose some interface to support Flink hybrid shuffle integration with Celeborn([FLIP-459](https://cwiki.apache.org/confluence/display/FLINK/FLIP-459%3A+Support+Flink+hybrid+shuffle+integration+with+Apache+Celeborn)). This(supporting hybrid shuffle in Celeborn side) is also a follow-up stuff to this PR.

incompatible changes in 1.20:
- 1.20 use enum `CompressionCodec` instead of `String` to construct `BufferDecompressor` and `BufferCompressor`.
- 1.20 introduce a new method(`notifyPartitionRecoveryStarted`) to `JobShuffleContext` in a non-compatible way.

I've already done the adaptation in this PR.

Closes #2662 from reswqa/support-120.

Authored-by: Weijie Guo <reswqa@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-08-09 17:05:58 +08:00
Wang, Fei
0e05bc6cf9 [CELEBORN-1437][DOC] Merge METRICS.md into monitoring.md
### What changes were proposed in this pull request?

As title, merge these two similar user guides.

### Why are the changes needed?
To close CELEBORN-1437

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Preview https://github.com/turboFei/incubator-celeborn/blob/metrics_merge/docs/monitoring.md#setup-prometheus-dashboard

Closes #2623 from turboFei/metrics_merge.

Lead-authored-by: Wang, Fei <fwang12@ebay.com>
Co-authored-by: Fei Wang <cn.feiwang@gmail.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2024-07-16 13:41:46 +08:00
Mridul Muralidharan
17f89c553e [CELEBORN-1504] Support for Apache Flink 1.16
### What changes were proposed in this pull request?

Add support for Apache Flink 1.16 in Celeborn.

### Why are the changes needed?

User requests for Apache Flink 1.16.
This implementation is a synthesis of 1.15 and 1.17 support which already exists in Apache Celeborn

### Does this PR introduce _any_ user-facing change?

Yes, supports Apache Flink 1.16

### How was this patch tested?

Tests for 1.16 added, which are based on 1.15 and 1.17

Closes #2619 from mridulm/flink-1.16-support.

Authored-by: Mridul Muralidharan <mridulatgmail.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-07-15 10:44:16 +08:00
SteNicholas
627ee8c6ef
[CELEBORN-1402][FOLLOWUP] Correct document of setting spark.executor.userClassPathFirst to false
### What changes were proposed in this pull request?

Correct document of setting `spark.executor.userClassPathFirst` to false.

### Why are the changes needed?

Document sets `spark.executor.userClassPathFirst` to false via `spark.executor.userClassPathFirst=false`, which is wrong setting.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #2574 from SteNicholas/CELEBORN-1402.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-06-18 20:08:22 +08:00
zhaohehuhu
fa9af57a4a
[CELEBORN-1465] Update docs remove unused node id
### What changes were proposed in this pull request?

The param(celeborn.master.ha.node.id) is not required to set for master HA anymore.

### Why are the changes needed?

remove the param from HA section
### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Closes #2573 from zhaohehuhu/dev-0618.

Authored-by: zhaohehuhu <luoyedeyi@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-06-18 18:25:11 +08:00
SteNicholas
c394fd448d
[CELEBORN-1460] MRAppMasterWithCeleborn supports setting mapreduce.celeborn.master.endpoints via environment variable CELEBORN_MASTER_ENDPOINTS
### What changes were proposed in this pull request?

`MRAppMasterWithCeleborn` sets `mapreduce.celeborn.master.endpoints` via environment variable `CELEBORN_MASTER_ENDPOINTS`.

### Why are the changes needed?

`MRAppMasterWithCeleborn` sets `mapreduce.celeborn.master.endpoints` via `${HADOOP_CONF_DIR}/mapred-site.xml` or `-Dmapreduce.celeborn.master.endpoints` at present. It could not set `mapreduce.celeborn.master.endpoints` by above way for integration with `RMProxy` which could provide `MRAppMasterWithCeleborn` with master endpoints via `environments` of `TaskAttemptImpl`. It's recommended that `MRAppMasterWithCeleborn` supports setting `mapreduce.celeborn.master.endpoints` via environment variable `CELEBORN_MASTER_ENDPOINTS`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

`WordCountTest`

Closes #2558 from SteNicholas/CELEBORN-1460.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-06-17 20:23:46 +08:00
SteNicholas
dd87419044
[CELEBORN-1380][FOLLOWUP] leveldbjni uses org.openlabtesting.leveldbjni to support linux aarch64 platform for leveldb via aarch64 profile
### What changes were proposed in this pull request?

Dependency leveldbjni uses `org.openlabtesting.leveldbjni` to support linux aarch64 platform for leveldb via `aarch64` profile.

Follow up #2476.

### Why are the changes needed?

Celeborn worker could not start on arm arch devices if db backend is `LevelDB`, which should support leveldbjni on the aarch64 platform.

aarch64 uses `org.openlabtesting.leveldbjni:leveldbjni-all.1.8`, and other platforms use `org.fusesource.leveldbjni:leveldbjni-all.1.8`. Meanwhile, because some hadoop dependencies packages are also depend on `org.fusesource.leveldbjni:leveldbjni-all`, but hadoop merge the similar change on trunk, details see
[HADOOP-16614](https://issues.apache.org/jira/browse/HADOOP-16614), therefore it should exclude the dependency of `org.fusesource.leveldbjni` for these hadoop packages related.

In addtion, `org.openlabtesting.leveldbjni` requires glibc version 3.4.21. Otherwise, there will be the following potential runtime risks:

```
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGBUS (0x7) at pc=0x00007fad3630b12a, pid=62, tid=0x00007f93394ef700
#
# JRE version: Java(TM) SE Runtime Environment (8.0_162-b12) (build 1.8.0_162-b12)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.162-b12 mixed mode linux-amd64 )
# Problematic frame:
# C  [libc.so.6+0x8412a]
#
# Core dump written. Default location: /data/service/celeborn/core or core.62
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#

---------------  T H R E A D  ---------------

Current thread (0x00007f9308001000):  JavaThread "leveldb" [_thread_in_native, id=878, stack(0x00007f9338cf0000,0x00007f93394f0000)]

siginfo: si_signo: 7 (SIGBUS), si_code: 2 (BUS_ADRERR), si_addr: 0x00007f97380d2220
```

Backport:

- https://github.com/apache/spark/pull/26636
- https://github.com/apache/spark/pull/31036

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #2530 from SteNicholas/CELEBORN-1380.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-05-27 14:07:02 +08:00
SteNicholas
cd5609971f
[CELEBORN-1434] Support MRAppMasterWithCeleborn to disable job recovery and job reduce slow start by default
### What changes were proposed in this pull request?

`MRAppMasterWithCeleborn` disables `yarn.app.mapreduce.am.job.recovery.enable` and sets `mapreduce.job.reduce.slowstart.completedmaps` to 1 by default.

### Why are the changes needed?

MapReduce does not set the flag which indicates whether to keep containers across application attempts in ApplicationSubmissionContext. Meanwhile, make sure reduces are scheduled only after all map are completed. Therefore, `MRAppMasterWithCeleborn` could disable `yarn.app.mapreduce.am.job.recovery.enable` and set `mapreduce.job.reduce.slowstart.completedmaps` to 1 by default.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

`WordCountTest`

Closes #2525 from SteNicholas/CELEBORN-1434.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-05-22 15:32:41 +08:00
SteNicholas
9908035ba8 [CELEBORN-1402] SparkShuffleManager print warning log for spark.executor.userClassPathFirst=true with ShuffleManager defined in user jar
### What changes were proposed in this pull request?

`SparkShuffleManager` print warning log for `spark.executor.userClassPathFirst=true` with `ShuffleManager` defined in user jar via `--jar` or `spark.jars`.

### Why are the changes needed?

When `spark.executor.userClassPathFirst` is enabled with ShuffleManager defined in user jar, the `ClassLoader` of `handle` is `ChildFirstURLClassLoader`, which is different from `CelebornShuffleHandle` of which the `ClassLoader` is `AppClassLoader` in `SparkShuffleManager#getWriter/getReader`. The local test log is as follows:

```
./bin/spark-sql --master yarn --deploy-mode client \
--conf spark.celeborn.master.endpoints=localhost:9099 \
--conf spark.executor.userClassPathFirst=true \
--conf spark.jars=/tmp/celeborn-client-spark-3-shaded_2.12-0.5.0-SNAPSHOT.jar \
--conf spark.shuffle.manager=org.apache.spark.shuffle.celeborn.SparkShuffleManager \
--conf spark.shuffle.service.enabled=false

./bin/spark-sql --master yarn --deploy-mode client --jars /tmp/celeborn-client-spark-3-shaded_2.12-0.5.0-SNAPSHOT.jar \
--conf spark.celeborn.master.endpoints=localhost:9099 \
--conf spark.executor.userClassPathFirst=true \
--conf spark.shuffle.manager=org.apache.spark.shuffle.celeborn.SparkShuffleManager \
--conf spark.shuffle.service.enabled=false
```
```
24/04/28 18:03:31 [Executor task launch worker for task 0.0 in stage 5.0 (TID 8)] WARN SparkShuffleManager: [getWriter] handle classloader: org.apache.spark.util.ChildFirstURLClassLoader, CelebornShuffleHandle classloader: sun.misc.Launcher$AppClassLoader
```

It causes that `SparkShuffleManager` fallback to vanilla Spark `SortShuffleManager` for `spark.executor.userClassPathFirst=true` with `ShuffleManager` defined in user jar before https://github.com/apache/spark/pull/43627. After [SPARK-45762](https://issues.apache.org/jira/browse/SPARK-45762), the `ClassLoader` of `handle` and `CelebornShuffleHandle` are both `ChildFirstURLClassLoader`.

```
24/04/28 18:03:31 [Executor task launch worker for task 0.0 in stage 5.0 (TID 8)] WARN SparkShuffleManager: [getWriter] handle classloader: org.apache.spark.util.ChildFirstURLClassLoader, CelebornShuffleHandle classloader: org.apache.spark.util.ChildFirstURLClassLoader
```

Therefore, `SparkShuffleManager` should print warning log to remind for `spark.executor.userClassPathFirst=true` with `ShuffleManager` defined in user jar.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual test.

Closes #2482 from SteNicholas/CELEBORN-1402.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2024-05-17 11:03:15 +08:00
zky.zhoukeyong
7af3126c7e Support Spark3.5 with JDK21
### What changes were proposed in this pull request?
Compile Spark-3.5 with
`./build/make-distribution.sh -Pspark-3.5 -Pjdk-21`
or
`./build/make-distribution.sh --sbt-enabled -Pspark-3.5 -Pjdk-21`

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
manual tests

Closes #2385 from waitinfuture/1327.

Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2024-03-27 18:42:16 +08:00
SteNicholas
c9b878a2f5
[INFRA] Remove incubator/incubating for graduation
### What changes were proposed in this pull request?

Remove incubator/incubating for graduation including:

- Remove `incubator`/`Incubating`.
- Remove `DISCLAIMER` and corresponding link.
- Update Release scripts and template.

Fix #2415.

### Why are the changes needed?

The ASF board has approved a resolution to graduate Celeborn into a full Top Level Project. To transition from the Apache Incubator to a new TLP, there's a few action items we need to do to complete the transition.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #2421 from SteNicholas/infra-graduation.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-03-27 13:54:47 +08:00
SteNicholas
a371f934cf
[CELEBORN-1341] Improve Celeborn document
### What changes were proposed in this pull request?

Improve Celeborn document to fix typos, table formats and wrong description of document. Meanwhile, `deploy.md` adds the document of MapReduce client deployment.

### Why are the changes needed?

There are some typos and format fixes in Celeborn document at present. Meanwhile, the `deploy.md` does not contain the deployment of MapReduce client, which is inconsistent with `README.md` for Flink configuration.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #2407 from SteNicholas/CELEBORN-1341.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-03-20 15:02:05 +08:00
SteNicholas
adaa96fc60 [CELEBORN-1310][FLINK] Support Flink 1.19
### What changes were proposed in this pull request?

Support Flink 1.19.

### Why are the changes needed?

Flink 1.19.0 is announced to release: [Announcing the Release of Apache Flink 1.19] (https://flink.apache.org/2024/03/18/announcing-the-release-of-apache-flink-1.19).

The main changes includes:

- `org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel` constructor change parameters:
   - `consumedSubpartitionIndex` changes to `consumedSubpartitionIndexSet`: [[FLINK-33743][runtime] Support consuming multiple subpartitions on a single channel](https://github.com/apache/flink/pull/23927).
   - adds `partitionRequestListenerTimeout`: [[FLINK-25055][network] Support listen and notify mechanism for partition request](https://github.com/apache/flink/pull/23565).
- `org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate` constructor removes parameters `subpartitionIndexRange`, `tieredStorageConsumerClient`, `nettyService` and `tieredStorageConsumerSpecs`: [[FLINK-33743][runtime] Support consuming multiple subpartitions on a single channel](https://github.com/apache/flink/pull/23927).
- Change the default config file to `config.yaml` in `flink-dist`: [[FLINK-33577][dist] Change the default config file to config.yaml in flink-dist](https://github.com/apache/flink/pull/24177).
- `org.apache.flink.configuration.RestartStrategyOptions` uses `org.apache.commons.compress.utils.Sets` of `commons-compress` dependency: [[FLINK-33865][runtime] Adding an ITCase to ensure exponential-delay.attempts-before-reset-backoff works well](https://github.com/apache/flink/pull/23942).

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Local test:

- Flink batch job submission

```
$ ./bin/flink run examples/streaming/WordCount.jar --execution-mode BATCH
Executing example with default input data.
Use --input to specify file input.
Printing result to stdout. Use --output to specify output path.
Job has been submitted with JobID 2e9fb659991a9c29d376151783bdf6de
Program execution finished
Job with JobID 2e9fb659991a9c29d376151783bdf6de has finished.
Job Runtime: 1912 ms
```

- Flink batch job execution

![image](https://github.com/apache/incubator-celeborn/assets/10048174/18b60861-cafc-4df3-b94d-93307e728be2)

- Celeborn master log
```

24/03/18 20:52:47,513 INFO [celeborn-dispatcher-42] Master: Offer slots successfully for 1 reducers of 1710766312631-2e9fb659991a9c29d376151783bdf6de-0 on 1 workers.
```

- Celeborn worker log
```
24/03/18 20:52:47,704 INFO [celeborn-dispatcher-1] StorageManager: created file at /Users/nicholas/Software/Celeborn/apache-celeborn-0.5.0-SNAPSHOT/shuffle/celeborn-worker/shuffle_data/1710766312631-2e9fb659991a9c29d376151783bdf6de/0/0-0-0
24/03/18 20:52:47,707 INFO [celeborn-dispatcher-1] Controller: Reserved 1 primary location and 0 replica location for 1710766312631-2e9fb659991a9c29d376151783bdf6de-0
24/03/18 20:52:47,874 INFO [celeborn-dispatcher-2] Controller: Start commitFiles for 1710766312631-2e9fb659991a9c29d376151783bdf6de-0
24/03/18 20:52:47,890 INFO [worker-rpc-async-replier] Controller: CommitFiles for 1710766312631-2e9fb659991a9c29d376151783bdf6de-0 success with 1 committed primary partitions, 0 empty primary partitions, 0 failed primary partitions, 0 committed replica partitions, 0 empty replica partitions, 0 failed replica partitions.
```

Closes #2399 from SteNicholas/CELEBORN-1310.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-03-20 11:51:23 +08:00
Keyong Zhou
af8c159c9f [CELEBORN-1298] Support Spark2.4 with Scala2.12
### What changes were proposed in this pull request?
As title

### Why are the changes needed?
As title

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
GA

Closes #2344 from waitinfuture/1298-1.

Lead-authored-by: Keyong Zhou <waitinfuture@gmail.com>
Co-authored-by: waitinfuture <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: waitinfuture <zky.zhoukeyong@alibaba-inc.com>
2024-02-29 17:58:54 +08:00
SteNicholas
383a102ab4
[CELEBORN-1260] Improve Spark Configuration of Deploy Spark client for deployment document
### What changes were proposed in this pull request?

Improve `Spark Configuration` of `Deploy Spark client` in `deploy.md`.

Fix #2270.

### Why are the changes needed?

It's recommended to improve the Spark Configuration of Deploy Spark client for deployment document with Spark Dynamic Resource Allocation support.

```
# Support Spark Dynamic Resource Allocation
# Required Spark version >= 3.5.0
spark.shuffle.sort.io.plugin.class org.apache.spark.shuffle.celeborn.CelebornShuffleDataIO
# Required Spark version >= 3.4.0, highly recommended to disable
spark.dynamicAllocation.shuffleTracking.enabled false
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #2278 from SteNicholas/CELEBORN-1260.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-02-02 14:31:36 +08:00
Albin Zeng
66fc0058b4 [CELEBORN-1213] Add pronunciation of Celeborn in README.md
### What changes were proposed in this pull request?

Add pronunciation of celeborn.

### Why are the changes needed?

New users have different interpretations of how to pronounce "Celeborn."  See [CELEBORN-1213](https://issues.apache.org/jira/browse/CELEBORN-1213).

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Document preview.

Closes #2211 from albin3/main.

Lead-authored-by: Albin Zeng <binwei.zeng3@gmail.com>
Co-authored-by: Keyong Zhou <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2024-01-05 13:53:29 +08:00
SteNicholas
e7e39a51be
[CELEBORN-1189] Introduce RunningApplicationCount metric and /applications API to record running applications of worker
### What changes were proposed in this pull request?

Introduce `RunningApplicationCount` metric and `/applications` API to record running applications for Celeborn worker.

### Why are the changes needed?

`RunningApplicationCount` metrics only monitors the count of running applications in the cluster for master. Meanwhile, `/listTopDiskUsedApps` API lists the top disk usage application ids for master and worker. Therefore `RunningApplicationCount` metric and `/applications` API could be introduced to record running applications of worker.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Internal tests.

Closes #2172 from SteNicholas/CELEBORN-1189.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-12-27 09:51:16 +08:00
SteNicholas
465b0938f7
[CELEBORN-1134] Celeborn Flink client should validate whether execution.batch-shuffle-mode is ALL_EXCHANGES_BLOCKING
### What changes were proposed in this pull request?

Celeborn Flink client validates whether `execution.batch-shuffle-mode` is `ALL_EXCHANGES_BLOCKING`.

### Why are the changes needed?

The config option `execution.batch-shuffle-mode` of Flink is `ALL_EXCHANGES_BLOCKING` by default. Celeborn Flink client should validate whether `execution.batch-shuffle-mode` is `ALL_EXCHANGES_BLOCKING`. If `execution.batch-shuffle-mode` is set as `ALL_EXCHANGES_PIPELINED`, there is `NullPointerException` for `ReducePartitionCommitHandler#handleGetReducerFileGroup`, which exception is as follows:

```
023-11-16 14:40:55,984 ERROR org.apache.celeborn.common.rpc.netty.Inbox                    - Ignoring error
java.lang.NullPointerException: Cannot invoke "java.util.Set.add(Object)" because the return value of "java.util.concurrent.ConcurrentHashMap.get(Object)" is null
	at org.apache.celeborn.client.commit.ReducePartitionCommitHandler.handleGetReducerFileGroup(ReducePartitionCommitHandler.scala:307)
	at org.apache.celeborn.client.CommitManager.handleGetReducerFileGroup(CommitManager.scala:266)
	at org.apache.celeborn.client.LifecycleManager.org$apache$celeborn$client$LifecycleManager$$handleGetReducerFileGroup(LifecycleManager.scala:559)
	at org.apache.celeborn.client.LifecycleManager$$anonfun$receiveAndReply$1.applyOrElse(LifecycleManager.scala:297)
	at org.apache.celeborn.common.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
	at org.apache.celeborn.common.rpc.netty.Inbox.safelyCall(Inbox.scala:222)
	at org.apache.celeborn.common.rpc.netty.Inbox.process(Inbox.scala:110)
	at org.apache.celeborn.common.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:227)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:833)
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

`RemoteShuffleServiceFactorySuitJ#testInvalidShuffleServiceConfig`.

Closes #2106 from SteNicholas/CELEBORN-1134.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-11-17 20:32:04 +08:00
Luke Yan
c7c2f6a35a [CELEBORN-858] Generate patch to each Spark 3.x minor version
### What changes were proposed in this pull request?

Add the following patch files in directory `incubator-celeborn/tree/spark3-patch/assets/spark-patch` :

1. Celeborn_Dynamic_Allocation_spark3_0.patch
2. Celeborn_Dynamic_Allocation_spark3_1.patch
3. Celeborn_Dynamic_Allocation_spark3_2.patch
4. Celeborn_Dynamic_Allocation_spark3_3.patch

Delete a patch at the same time:

1. Celeborn_Dynamic_Allocation_spark3.patch

Modified `Support Spark Dynamic Allocation` in incubator-celeborn/README.md :

![image](https://github.com/apache/incubator-celeborn/assets/108530647/61e2e69b-d3f5-4d11-a20b-374622936443)

### Why are the changes needed?

Convenient for customers to apply patches in Spark 3.X for `Support Spark Dynamic Allocation`

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

yes. All patch files can be applied to the corresponding version of spark source code through `git apply`  without any code conflicts.

Closes #2085 from lukeyan2023/spark3-patch.

Authored-by: Luke Yan <108530647+lukeyan2023@users.noreply.github.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-11-10 15:35:54 +08:00
mingji
02cea042a0 [CELEBORN-1116] Read authentication configs from HADOOP_CONF_DIR
### What changes were proposed in this pull request?
1. Make Celeborn read configs from HADOOP_COND_DIR.
2. Remove unnecessary Kerberos configs.

### Why are the changes needed?
To support HDFS with Kerberos.

### Does this PR introduce _any_ user-facing change?
NO.

### How was this patch tested?
GA and cluster.

Closes #2082 from FMX/B1116.

Lead-authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Co-authored-by: Fu Chen <cfmcgrady@gmail.com>
Co-authored-by: Cheng Pan <pan3793@gmail.com>
Co-authored-by: Ethan Feng <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-11-09 11:07:13 +08:00
sychen
efa22a4936 [CELEBORN-1105][FLINK] Support Flink 1.18
### What changes were proposed in this pull request?

### Why are the changes needed?

```bash
flink-1.18.0
./bin/start-cluster.sh
./bin/flink run examples/streaming/WordCount.jar --execution-mode BATCH
```

```java
Caused by: java.lang.NoSuchMethodError: org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.<init>(Ljava/lang/String;ILorg/apache/flink/runtime/jobgraph/IntermediateDataSetID;Lorg/apache/flink/runtime/io/network/partition/ResultPartitionType;Lorg/apache/flink/runtime/executiongraph/IndexRange;ILorg/apache/flink/runtime/io/network/partition/PartitionProducerStateProvider;Lorg/apache/flink/util/function/SupplierWithException;Lorg/apache/flink/runtime/io/network/buffer/BufferDecompressor;Lorg/apache/flink/core/memory/MemorySegmentProvider;ILorg/apache/flink/runtime/throughput/ThroughputCalculator;Lorg/apache/flink/runtime/throughput/BufferDebloater;)V
	at org.apache.celeborn.plugin.flink.RemoteShuffleInputGate$FakedRemoteInputChannel.<init>(RemoteShuffleInputGate.java:225)
	at org.apache.celeborn.plugin.flink.RemoteShuffleInputGate.getChannel(RemoteShuffleInputGate.java:179)
	at org.apache.flink.runtime.io.network.partition.consumer.InputGate.setChannelStateWriter(InputGate.java:90)
	at org.apache.flink.runtime.taskmanager.InputGateWithMetrics.setChannelStateWriter(InputGateWithMetrics.java:120)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.injectChannelStateWriterIntoChannels(StreamTask.java:524)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.<init>(StreamTask.java:496)
```

Flink 1.18.0 release
https://flink.apache.org/2023/10/24/announcing-the-release-of-apache-flink-1.18/

Interface `org.apache.flink.runtime.io.network.buffer.Buffer` adds `setRecycler` method.
[[FLINK-32549](https://issues.apache.org/jira/browse/FLINK-32549)][network] Tiered storage memory manager supports ownership transfer for buffers

`org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate` constructor adds parameters.
[[FLINK-31638](https://issues.apache.org/jira/browse/FLINK-31638)][network] Introduce the TieredStorageConsumerClient to SingleInputGate
[[FLINK-31642](https://issues.apache.org/jira/browse/FLINK-31642)][network] Introduce the MemoryTierConsumerAgent to TieredStorageConsumerClient

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
```bash
flink-1.18.0 ./bin/flink run examples/streaming/WordCount.jar --execution-mode BATCH
Executing example with default input data.
Use --input to specify file input.
Printing result to stdout. Use --output to specify output path.
Job has been submitted with JobID d7fc5f0ca018a54e9453c4d35f7c598a
Program execution finished
Job with JobID d7fc5f0ca018a54e9453c4d35f7c598a has finished.
Job Runtime: 1635 ms
```

<img width="1297" alt="image" src="https://github.com/apache/incubator-celeborn/assets/3898450/6a5266bf-2386-4386-b98b-a60d2570fa99">

Closes #2063 from cxzl25/CELEBORN-1105.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: Shuang <lvshuang.tb@gmail.com>
2023-11-06 15:53:39 +08:00
SteNicholas
f61fe17551 [CELEBORN-987][FOLLOWUP][DOC] README#Build and sbt#System Requirements should extend to Scala 2.13 and Spark 3.5
### What changes were proposed in this pull request?

`README#Build` and `sbt#System Requirements` extends to Scala 2.13.

### Why are the changes needed?

`README#Build` and `sbt#System Requirements`should extend to Scala 2.13 to align the SBT CI test results.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

SBT CI tests.

Closes #1987 from SteNicholas/CELEBORN-987.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Fu Chen <cfmcgrady@gmail.com>
2023-10-14 09:54:22 +08:00
SteNicholas
c97628c510 [CELEBORN-987][DOC] README#Build should extend to Java8/11/17
### What changes were proposed in this pull request?

`README#Build` extends to Java8/11/17. Meanwhile, the profile of maven adds `jdk-17`.

### Why are the changes needed?

`README#Build` should extend to Java8/11/17. Meanwhile, the profile of maven should add jdk-17.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Local maven compile.

Closes #1985 from SteNicholas/CELEBORN-987.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-10-12 21:58:32 +08:00
Bowen Song
a734b8cb79 [CELEBORN-1020] Remove outdated info in README.md file
### What changes were proposed in this pull request?
The description about restart a Celeborn cluster is outdated, remove this part in README file

Closes #1957 from zgzzbws/edit-doc.

Authored-by: Bowen Song <song_bowen_work@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-10-09 00:11:47 +08:00
mingji
95c9ccfc3e [CELEBORN-1010] Update docs about spark.shuffle.service.enabled
### What changes were proposed in this pull request?
To clarify a spark config to work with Celeborn.

### Why are the changes needed?
After some tests, I found that Spark 3.1 and newer can work with Celeborn with `spark.shuffle.service.enabled=true`.

ExternalShuffleBlockResolver won't check the shuffle manager's type since Spark 3.1 and newer.

### Does this PR introduce _any_ user-facing change?
NO.

### How was this patch tested?
I tested two scenarios about this PR.
1. Check whether Spark can release the executors in time.
2. Check data correctness by running TPC-DS.
All checks are good.

Closes #1955 from FMX/CELEBORN-1010.

Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-10-08 09:15:42 +08:00
zhouyifan279
333db39713 [CELEBORN-954] Add documentation about reliable shuffle data storage
### What changes were proposed in this pull request?
As title

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?
Yes. A new config was added in [README.md ](https://github.com/apache/incubator-celeborn/blob/main/README.md#spark-configuration).

### How was this patch tested?

Closes #1938 from zhouyifan279/reliable-storage-doc.

Authored-by: zhouyifan279 <zhouyifan279@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-09-27 00:39:14 +08:00
mingji
e0c00ecd38 [CELEBORN-839][MR] Support Hadoop MapReduce
### What changes were proposed in this pull request?
1. Map side merge and push.
2. Support hadoop2 & 3.
3. Reduce in-memory merge.
4. Integrate LifecycleManager to RmApplicationMaster.

### Why are the changes needed?
Ditto.

### Does this PR introduce _any_ user-facing change?
NO.

### How was this patch tested?
Cluster.

I tested this PR on a cluster with a 4x 16 CPU 64G Mem 4ESSD cluster.
Hadoop 2.8.5

1TB Terasort, 8400 mappers, 1000 reducers
Celeborn 81min vs MR shuffle 89min
![mr1](https://github.com/apache/incubator-celeborn/assets/4150993/a3cf6493-b6ff-4c03-9936-4558cf22761d)
![mr2](https://github.com/apache/incubator-celeborn/assets/4150993/9119ffb4-6996-4b77-bcdf-cbd6db5c096f)

1GB wordcount, 8 mappers, 8 reducers
Celeborn 35s VS MR shuffle 38s
![mr3](https://github.com/apache/incubator-celeborn/assets/4150993/907dce24-16b7-4788-ab5d-5b784fd07d47)
![mr4](https://github.com/apache/incubator-celeborn/assets/4150993/8e8065b9-6c46-4c8d-9e71-45eed8e63877)

Closes #1830 from FMX/CELEBORN-839.

Lead-authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Co-authored-by: Ethan Feng <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-09-14 14:12:53 +08:00
mingji
2ee6e305f1
[CELEBORN-941] fix incorrect deploy doc
### What changes were proposed in this pull request?
Fix the incorrect deploy doc about using HDFS only.

### Why are the changes needed?
Ditto.

### Does this PR introduce _any_ user-facing change?
NO.

### How was this patch tested?
Just docs.

Closes #1874 from FMX/CELEBORN-941.

Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-08-31 18:54:27 +08:00
liangbowen
1bf93991bc [CELEBORN-893][DOC] Fix Spark patch list text in Readme
### What changes were proposed in this pull request?

- Fix the text of Spark patch list

### Why are the changes needed?

Before:
<img width="909" alt="image" src="https://github.com/apache/incubator-celeborn/assets/1935105/1d402df1-3a68-4810-8f84-8ab61a38314c">

After:
<img width="908" alt="image" src="https://github.com/apache/incubator-celeborn/assets/1935105/2c733568-a08a-4951-bd5a-f4a444a28833">

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Screenshots attached.

Closes #1810 from bowenliang123/readme-patch.

Authored-by: liangbowen <liangbowen@gf.com.cn>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-08-14 14:54:58 +08:00
e
f78a7d349f [CELEBORN-794] Fix link of CONFIGURATIONS in README
### What changes were proposed in this pull request?

Modify CONFIGURATIONS to point to the correct address

### Why are the changes needed?

CONFIGURATIONS in README.md points to an invalid address

![image](https://github.com/apache/incubator-celeborn/assets/14961757/538294ee-3432-4e1e-a45e-4dc1983d50e8)
![image](https://github.com/apache/incubator-celeborn/assets/14961757/d4681603-5317-46ae-a2f5-e58fa72c706c)

### Does this PR introduce _any_ user-facing change?

NO

### How was this patch tested?
NO

Closes #1714 from jiaoqingbo/CELEBORN-794.

Authored-by: e <1178404354@qq.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-14 18:08:09 +08:00
mingji
d0ecf83fec [CELEBORN-764] Fix celeborn on HDFS might clean using app directories
### What changes were proposed in this pull request?
Make Celeborn leader clean expired app dirs on HDFS when an application is Lost.

### Why are the changes needed?
If Celeborn is working on HDFS, the storage manager starts and cleans expired app directories, and the newly created worker will want to delete any unknown app directories.
This will cause using app directories to be deleted unexpectedly.

### Does this PR introduce _any_ user-facing change?
NO.

### How was this patch tested?
UT and cluster.

Closes #1678 from FMX/CELEBORN-764.

Lead-authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Co-authored-by: Cheng Pan <pan3793@gmail.com>
Co-authored-by: Ethan Feng <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-05 23:11:50 +08:00
zhongqiang.czq
a0f4be67a9 [CELEBORN-765][DOC] Disable partitionSplit in Flink engine related co…
…nfigurations

### What changes were proposed in this pull request?
In Doc Readme, setting partitionSplit to false should be added in Flink engine related configurations.

### Why are the changes needed?
Currently, Mappartition split is not supported, but shuffle partition split is enabled by default, so error will be thrown when flink task's shuffle data size exceeds 1G(by Default).

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?
manually

Closes #1679 from zhongqiangczq/readme.

Authored-by: zhongqiang.czq <zhongqiang.czq@alibaba-inc.com>
Signed-off-by: zhongqiang.czq <zhongqiang.czq@alibaba-inc.com>
2023-07-05 18:04:10 +08:00
Angerszhuuuu
693172d0bd [CELEBORN-751] Rename remain rss related class name and filenames etc
### What changes were proposed in this pull request?
Rename remain rss related class name and filenames etc...

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1664 from AngersZhuuuu/CELEBORN-751.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
2023-07-04 10:20:08 +08:00