### What changes were proposed in this pull request?
Remove out-of-dated flink 1.14 and 1.15.
For more information, please see the discussion thread: https://lists.apache.org/thread/njho00zmkjx5qspcrbrkogy8s4zzmwv9
### Why are the changes needed?
Reduce maintenance burden.
### Does this PR introduce _any_ user-facing change?
Yes.
### How was this patch tested?
Changes can be covered by existing tests.
Closes#3029 from codenohup/remove-flink14and15.
Authored-by: codenohup <huangxu.walker@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
### What changes were proposed in this pull request?
1. Document adds Flink 1.16 support including `README.md`, `deploy.md`.
2. Update description of `celeborn.client.shuffle.compression.codec` to change the supported Flink version for ZSTD.
### Why are the changes needed?
#2619 has supported Flink 1.16, which should update the document for the support. Meanwhile, since Flink version 1.16, zstd is supported for Flink shuffle client.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
No.
Closes#2904 from SteNicholas/CELEBORN-1504.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
If we use celeborn shuffle service, we can't submit both batch and streaming to the same flink session cluster. This should be highlight in doc.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
No need.
Closes#2879 from reswqa/session-doc.
Authored-by: Weijie Guo <reswqa@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
When users deploy using the release binary as outlined in the documentation, the instructions for copying the client JAR can be unclear.
### Why are the changes needed?
No
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?

Closes#2877 from zaynt4606/md.
Authored-by: szt <zaynt4606@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>
### What changes were proposed in this pull request?
Add Flink hybrid shuffle doc
### Why are the changes needed?
We need the doc for the new hybrid shuffle mode.
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
no neeed.
Closes#2867 from reswqa/add-hs-doc.
Authored-by: Weijie Guo <reswqa@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>
### What changes were proposed in this pull request?
Update name of master service from `MasterSys` to `Master` in startup document to follow up https://github.com/apache/celeborn/pull/2003/files#r1365454256.
### Why are the changes needed?
#2003 has already changed the name of master and worker service, which should also update the name in startup logs of document.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
No.
Closes#2772 from SteNicholas/CELEBORN-1058.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Replace the deprecated config `celeborn.storage.activeTypes` with `celeborn.storage.availableTypes` in docs and tests, guiding the new comers to use the new config names.
### Why are the changes needed?
The config `celeborn.storage.activeTypes` has been deprecated in 0.4.0 release.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
No feature changed.
Closes#2675 from bowenliang123/avai-types.
Authored-by: Bowen Liang <liangbowen@gf.com.cn>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
update client deployment doc to include the param (spark.celeborn.storage.activeTypes)
### Why are the changes needed?
Just provide a hint for users, otherwise they may miss this param.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Yes
Closes#2683 from zhaohehuhu/dev-0815.
Authored-by: zhaohehuhu <luoyedeyi@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
Correct document of setting `spark.executor.userClassPathFirst` to false.
### Why are the changes needed?
Document sets `spark.executor.userClassPathFirst` to false via `spark.executor.userClassPathFirst=false`, which is wrong setting.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
No.
Closes#2574 from SteNicholas/CELEBORN-1402.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
The param(celeborn.master.ha.node.id) is not required to set for master HA anymore.
### Why are the changes needed?
remove the param from HA section
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Closes#2573 from zhaohehuhu/dev-0618.
Authored-by: zhaohehuhu <luoyedeyi@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
`MRAppMasterWithCeleborn` sets `mapreduce.celeborn.master.endpoints` via environment variable `CELEBORN_MASTER_ENDPOINTS`.
### Why are the changes needed?
`MRAppMasterWithCeleborn` sets `mapreduce.celeborn.master.endpoints` via `${HADOOP_CONF_DIR}/mapred-site.xml` or `-Dmapreduce.celeborn.master.endpoints` at present. It could not set `mapreduce.celeborn.master.endpoints` by above way for integration with `RMProxy` which could provide `MRAppMasterWithCeleborn` with master endpoints via `environments` of `TaskAttemptImpl`. It's recommended that `MRAppMasterWithCeleborn` supports setting `mapreduce.celeborn.master.endpoints` via environment variable `CELEBORN_MASTER_ENDPOINTS`.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
`WordCountTest`
Closes#2558 from SteNicholas/CELEBORN-1460.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
`MRAppMasterWithCeleborn` disables `yarn.app.mapreduce.am.job.recovery.enable` and sets `mapreduce.job.reduce.slowstart.completedmaps` to 1 by default.
### Why are the changes needed?
MapReduce does not set the flag which indicates whether to keep containers across application attempts in ApplicationSubmissionContext. Meanwhile, make sure reduces are scheduled only after all map are completed. Therefore, `MRAppMasterWithCeleborn` could disable `yarn.app.mapreduce.am.job.recovery.enable` and set `mapreduce.job.reduce.slowstart.completedmaps` to 1 by default.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
`WordCountTest`
Closes#2525 from SteNicholas/CELEBORN-1434.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
`SparkShuffleManager` print warning log for `spark.executor.userClassPathFirst=true` with `ShuffleManager` defined in user jar via `--jar` or `spark.jars`.
### Why are the changes needed?
When `spark.executor.userClassPathFirst` is enabled with ShuffleManager defined in user jar, the `ClassLoader` of `handle` is `ChildFirstURLClassLoader`, which is different from `CelebornShuffleHandle` of which the `ClassLoader` is `AppClassLoader` in `SparkShuffleManager#getWriter/getReader`. The local test log is as follows:
```
./bin/spark-sql --master yarn --deploy-mode client \
--conf spark.celeborn.master.endpoints=localhost:9099 \
--conf spark.executor.userClassPathFirst=true \
--conf spark.jars=/tmp/celeborn-client-spark-3-shaded_2.12-0.5.0-SNAPSHOT.jar \
--conf spark.shuffle.manager=org.apache.spark.shuffle.celeborn.SparkShuffleManager \
--conf spark.shuffle.service.enabled=false
./bin/spark-sql --master yarn --deploy-mode client --jars /tmp/celeborn-client-spark-3-shaded_2.12-0.5.0-SNAPSHOT.jar \
--conf spark.celeborn.master.endpoints=localhost:9099 \
--conf spark.executor.userClassPathFirst=true \
--conf spark.shuffle.manager=org.apache.spark.shuffle.celeborn.SparkShuffleManager \
--conf spark.shuffle.service.enabled=false
```
```
24/04/28 18:03:31 [Executor task launch worker for task 0.0 in stage 5.0 (TID 8)] WARN SparkShuffleManager: [getWriter] handle classloader: org.apache.spark.util.ChildFirstURLClassLoader, CelebornShuffleHandle classloader: sun.misc.Launcher$AppClassLoader
```
It causes that `SparkShuffleManager` fallback to vanilla Spark `SortShuffleManager` for `spark.executor.userClassPathFirst=true` with `ShuffleManager` defined in user jar before https://github.com/apache/spark/pull/43627. After [SPARK-45762](https://issues.apache.org/jira/browse/SPARK-45762), the `ClassLoader` of `handle` and `CelebornShuffleHandle` are both `ChildFirstURLClassLoader`.
```
24/04/28 18:03:31 [Executor task launch worker for task 0.0 in stage 5.0 (TID 8)] WARN SparkShuffleManager: [getWriter] handle classloader: org.apache.spark.util.ChildFirstURLClassLoader, CelebornShuffleHandle classloader: org.apache.spark.util.ChildFirstURLClassLoader
```
Therefore, `SparkShuffleManager` should print warning log to remind for `spark.executor.userClassPathFirst=true` with `ShuffleManager` defined in user jar.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Manual test.
Closes#2482 from SteNicholas/CELEBORN-1402.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>
### What changes were proposed in this pull request?
Improve Celeborn document to fix typos, table formats and wrong description of document. Meanwhile, `deploy.md` adds the document of MapReduce client deployment.
### Why are the changes needed?
There are some typos and format fixes in Celeborn document at present. Meanwhile, the `deploy.md` does not contain the deployment of MapReduce client, which is inconsistent with `README.md` for Flink configuration.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
No.
Closes#2407 from SteNicholas/CELEBORN-1341.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Add `execution.batch-shuffle-mode: ALL_EXCHANGES_BLOCKING` to `Flink Configuration` of `Deploy Flink client` in `deploy.md`
### Why are the changes needed?
Validation whether `execution.batch-shuffle-mode` is `ALL_EXCHANGES_BLOCKING` is supported in #2106. `Flink Configuration` of `Deploy Flink client` should also add this configuration.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
No.
Closes#2355 from SteNicholas/CELEBORN-1134.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Improve `Spark Configuration` of `Deploy Spark client` in `deploy.md`.
Fix#2270.
### Why are the changes needed?
It's recommended to improve the Spark Configuration of Deploy Spark client for deployment document with Spark Dynamic Resource Allocation support.
```
# Support Spark Dynamic Resource Allocation
# Required Spark version >= 3.5.0
spark.shuffle.sort.io.plugin.class org.apache.spark.shuffle.celeborn.CelebornShuffleDataIO
# Required Spark version >= 3.4.0, highly recommended to disable
spark.dynamicAllocation.shuffleTracking.enabled false
```
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
No.
Closes#2278 from SteNicholas/CELEBORN-1260.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Introduce `RunningApplicationCount` metric and `/applications` API to record running applications for Celeborn worker.
### Why are the changes needed?
`RunningApplicationCount` metrics only monitors the count of running applications in the cluster for master. Meanwhile, `/listTopDiskUsedApps` API lists the top disk usage application ids for master and worker. Therefore `RunningApplicationCount` metric and `/applications` API could be introduced to record running applications of worker.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Internal tests.
Closes#2172 from SteNicholas/CELEBORN-1189.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
1. Make Celeborn read configs from HADOOP_COND_DIR.
2. Remove unnecessary Kerberos configs.
### Why are the changes needed?
To support HDFS with Kerberos.
### Does this PR introduce _any_ user-facing change?
NO.
### How was this patch tested?
GA and cluster.
Closes#2082 from FMX/B1116.
Lead-authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Co-authored-by: Fu Chen <cfmcgrady@gmail.com>
Co-authored-by: Cheng Pan <pan3793@gmail.com>
Co-authored-by: Ethan Feng <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
To clarify a spark config to work with Celeborn.
### Why are the changes needed?
After some tests, I found that Spark 3.1 and newer can work with Celeborn with `spark.shuffle.service.enabled=true`.
ExternalShuffleBlockResolver won't check the shuffle manager's type since Spark 3.1 and newer.
### Does this PR introduce _any_ user-facing change?
NO.
### How was this patch tested?
I tested two scenarios about this PR.
1. Check whether Spark can release the executors in time.
2. Check data correctness by running TPC-DS.
All checks are good.
Closes#1955 from FMX/CELEBORN-1010.
Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
Fix the incorrect deploy doc about using HDFS only.
### Why are the changes needed?
Ditto.
### Does this PR introduce _any_ user-facing change?
NO.
### How was this patch tested?
Just docs.
Closes#1874 from FMX/CELEBORN-941.
Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Introduce a quick start guide for running Apache Flink with Apache Celeborn to help Flink users to run with Celeborn.
### Why are the changes needed?
There is no quick start guide for running Apache Flink with Apache Celeborn.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
None.
Closes#1868 from SteNicholas/CELEBORN-822.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
As title
### Why are the changes needed?
As title
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Manual test
Closes#1795 from cfmcgrady/sbt-docs.
Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
### What changes were proposed in this pull request?
Rename remain rss related class name and filenames etc...
### Why are the changes needed?
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Closes#1664 from AngersZhuuuu/CELEBORN-751.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
### What changes were proposed in this pull request?
Provide a new SparkShuffleManager to replace RssShuffleManager in the future
### Why are the changes needed?
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Closes#1667 from AngersZhuuuu/CELEBORN-754.
Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Co-authored-by: Cheng Pan <pan3793@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
### What changes were proposed in this pull request?
1. Celeborn supports storage type selection. HDD, SSD, and HDFS are available for now.
2. Add new buffer size for HDFS file writers.
3. Worker support empty working dirs.
### Why are the changes needed?
Support HDFS only scenario.
### Does this PR introduce _any_ user-facing change?
NO.
### How was this patch tested?
UT and cluster.
Closes#1619 from FMX/CELEBORN-568.
Lead-authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Co-authored-by: Ethan Feng <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
It was discussed during the last meeting, but abandoned due to the complication.
### Why are the changes needed?
Make the configuration unified.
### Does this PR introduce _any_ user-facing change?
Yes, but the legacy configurations still take effect.
### How was this patch tested?
New UTs.
Closes#1549 from pan3793/CELEBORN-638.
Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
Refresh celeborn configurations in doc
### Why are the changes needed?
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Closes#1592 from AngersZhuuuu/CELEBORN-680.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>