Commit Graph

74 Commits

Author SHA1 Message Date
Luke Yan
c7c2f6a35a [CELEBORN-858] Generate patch to each Spark 3.x minor version
### What changes were proposed in this pull request?

Add the following patch files in directory `incubator-celeborn/tree/spark3-patch/assets/spark-patch` :

1. Celeborn_Dynamic_Allocation_spark3_0.patch
2. Celeborn_Dynamic_Allocation_spark3_1.patch
3. Celeborn_Dynamic_Allocation_spark3_2.patch
4. Celeborn_Dynamic_Allocation_spark3_3.patch

Delete a patch at the same time:

1. Celeborn_Dynamic_Allocation_spark3.patch

Modified `Support Spark Dynamic Allocation` in incubator-celeborn/README.md :

![image](https://github.com/apache/incubator-celeborn/assets/108530647/61e2e69b-d3f5-4d11-a20b-374622936443)

### Why are the changes needed?

Convenient for customers to apply patches in Spark 3.X for `Support Spark Dynamic Allocation`

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

yes. All patch files can be applied to the corresponding version of spark source code through `git apply`  without any code conflicts.

Closes #2085 from lukeyan2023/spark3-patch.

Authored-by: Luke Yan <108530647+lukeyan2023@users.noreply.github.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-11-10 15:35:54 +08:00
mingji
02cea042a0 [CELEBORN-1116] Read authentication configs from HADOOP_CONF_DIR
### What changes were proposed in this pull request?
1. Make Celeborn read configs from HADOOP_COND_DIR.
2. Remove unnecessary Kerberos configs.

### Why are the changes needed?
To support HDFS with Kerberos.

### Does this PR introduce _any_ user-facing change?
NO.

### How was this patch tested?
GA and cluster.

Closes #2082 from FMX/B1116.

Lead-authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Co-authored-by: Fu Chen <cfmcgrady@gmail.com>
Co-authored-by: Cheng Pan <pan3793@gmail.com>
Co-authored-by: Ethan Feng <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-11-09 11:07:13 +08:00
sychen
efa22a4936 [CELEBORN-1105][FLINK] Support Flink 1.18
### What changes were proposed in this pull request?

### Why are the changes needed?

```bash
flink-1.18.0
./bin/start-cluster.sh
./bin/flink run examples/streaming/WordCount.jar --execution-mode BATCH
```

```java
Caused by: java.lang.NoSuchMethodError: org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.<init>(Ljava/lang/String;ILorg/apache/flink/runtime/jobgraph/IntermediateDataSetID;Lorg/apache/flink/runtime/io/network/partition/ResultPartitionType;Lorg/apache/flink/runtime/executiongraph/IndexRange;ILorg/apache/flink/runtime/io/network/partition/PartitionProducerStateProvider;Lorg/apache/flink/util/function/SupplierWithException;Lorg/apache/flink/runtime/io/network/buffer/BufferDecompressor;Lorg/apache/flink/core/memory/MemorySegmentProvider;ILorg/apache/flink/runtime/throughput/ThroughputCalculator;Lorg/apache/flink/runtime/throughput/BufferDebloater;)V
	at org.apache.celeborn.plugin.flink.RemoteShuffleInputGate$FakedRemoteInputChannel.<init>(RemoteShuffleInputGate.java:225)
	at org.apache.celeborn.plugin.flink.RemoteShuffleInputGate.getChannel(RemoteShuffleInputGate.java:179)
	at org.apache.flink.runtime.io.network.partition.consumer.InputGate.setChannelStateWriter(InputGate.java:90)
	at org.apache.flink.runtime.taskmanager.InputGateWithMetrics.setChannelStateWriter(InputGateWithMetrics.java:120)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.injectChannelStateWriterIntoChannels(StreamTask.java:524)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.<init>(StreamTask.java:496)
```

Flink 1.18.0 release
https://flink.apache.org/2023/10/24/announcing-the-release-of-apache-flink-1.18/

Interface `org.apache.flink.runtime.io.network.buffer.Buffer` adds `setRecycler` method.
[[FLINK-32549](https://issues.apache.org/jira/browse/FLINK-32549)][network] Tiered storage memory manager supports ownership transfer for buffers

`org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate` constructor adds parameters.
[[FLINK-31638](https://issues.apache.org/jira/browse/FLINK-31638)][network] Introduce the TieredStorageConsumerClient to SingleInputGate
[[FLINK-31642](https://issues.apache.org/jira/browse/FLINK-31642)][network] Introduce the MemoryTierConsumerAgent to TieredStorageConsumerClient

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
```bash
flink-1.18.0 ./bin/flink run examples/streaming/WordCount.jar --execution-mode BATCH
Executing example with default input data.
Use --input to specify file input.
Printing result to stdout. Use --output to specify output path.
Job has been submitted with JobID d7fc5f0ca018a54e9453c4d35f7c598a
Program execution finished
Job with JobID d7fc5f0ca018a54e9453c4d35f7c598a has finished.
Job Runtime: 1635 ms
```

<img width="1297" alt="image" src="https://github.com/apache/incubator-celeborn/assets/3898450/6a5266bf-2386-4386-b98b-a60d2570fa99">

Closes #2063 from cxzl25/CELEBORN-1105.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: Shuang <lvshuang.tb@gmail.com>
2023-11-06 15:53:39 +08:00
SteNicholas
f61fe17551 [CELEBORN-987][FOLLOWUP][DOC] README#Build and sbt#System Requirements should extend to Scala 2.13 and Spark 3.5
### What changes were proposed in this pull request?

`README#Build` and `sbt#System Requirements` extends to Scala 2.13.

### Why are the changes needed?

`README#Build` and `sbt#System Requirements`should extend to Scala 2.13 to align the SBT CI test results.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

SBT CI tests.

Closes #1987 from SteNicholas/CELEBORN-987.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Fu Chen <cfmcgrady@gmail.com>
2023-10-14 09:54:22 +08:00
SteNicholas
c97628c510 [CELEBORN-987][DOC] README#Build should extend to Java8/11/17
### What changes were proposed in this pull request?

`README#Build` extends to Java8/11/17. Meanwhile, the profile of maven adds `jdk-17`.

### Why are the changes needed?

`README#Build` should extend to Java8/11/17. Meanwhile, the profile of maven should add jdk-17.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Local maven compile.

Closes #1985 from SteNicholas/CELEBORN-987.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-10-12 21:58:32 +08:00
Bowen Song
a734b8cb79 [CELEBORN-1020] Remove outdated info in README.md file
### What changes were proposed in this pull request?
The description about restart a Celeborn cluster is outdated, remove this part in README file

Closes #1957 from zgzzbws/edit-doc.

Authored-by: Bowen Song <song_bowen_work@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-10-09 00:11:47 +08:00
mingji
95c9ccfc3e [CELEBORN-1010] Update docs about spark.shuffle.service.enabled
### What changes were proposed in this pull request?
To clarify a spark config to work with Celeborn.

### Why are the changes needed?
After some tests, I found that Spark 3.1 and newer can work with Celeborn with `spark.shuffle.service.enabled=true`.

ExternalShuffleBlockResolver won't check the shuffle manager's type since Spark 3.1 and newer.

### Does this PR introduce _any_ user-facing change?
NO.

### How was this patch tested?
I tested two scenarios about this PR.
1. Check whether Spark can release the executors in time.
2. Check data correctness by running TPC-DS.
All checks are good.

Closes #1955 from FMX/CELEBORN-1010.

Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-10-08 09:15:42 +08:00
zhouyifan279
333db39713 [CELEBORN-954] Add documentation about reliable shuffle data storage
### What changes were proposed in this pull request?
As title

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?
Yes. A new config was added in [README.md ](https://github.com/apache/incubator-celeborn/blob/main/README.md#spark-configuration).

### How was this patch tested?

Closes #1938 from zhouyifan279/reliable-storage-doc.

Authored-by: zhouyifan279 <zhouyifan279@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-09-27 00:39:14 +08:00
mingji
e0c00ecd38 [CELEBORN-839][MR] Support Hadoop MapReduce
### What changes were proposed in this pull request?
1. Map side merge and push.
2. Support hadoop2 & 3.
3. Reduce in-memory merge.
4. Integrate LifecycleManager to RmApplicationMaster.

### Why are the changes needed?
Ditto.

### Does this PR introduce _any_ user-facing change?
NO.

### How was this patch tested?
Cluster.

I tested this PR on a cluster with a 4x 16 CPU 64G Mem 4ESSD cluster.
Hadoop 2.8.5

1TB Terasort, 8400 mappers, 1000 reducers
Celeborn 81min vs MR shuffle 89min
![mr1](https://github.com/apache/incubator-celeborn/assets/4150993/a3cf6493-b6ff-4c03-9936-4558cf22761d)
![mr2](https://github.com/apache/incubator-celeborn/assets/4150993/9119ffb4-6996-4b77-bcdf-cbd6db5c096f)

1GB wordcount, 8 mappers, 8 reducers
Celeborn 35s VS MR shuffle 38s
![mr3](https://github.com/apache/incubator-celeborn/assets/4150993/907dce24-16b7-4788-ab5d-5b784fd07d47)
![mr4](https://github.com/apache/incubator-celeborn/assets/4150993/8e8065b9-6c46-4c8d-9e71-45eed8e63877)

Closes #1830 from FMX/CELEBORN-839.

Lead-authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Co-authored-by: Ethan Feng <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-09-14 14:12:53 +08:00
mingji
2ee6e305f1
[CELEBORN-941] fix incorrect deploy doc
### What changes were proposed in this pull request?
Fix the incorrect deploy doc about using HDFS only.

### Why are the changes needed?
Ditto.

### Does this PR introduce _any_ user-facing change?
NO.

### How was this patch tested?
Just docs.

Closes #1874 from FMX/CELEBORN-941.

Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-08-31 18:54:27 +08:00
liangbowen
1bf93991bc [CELEBORN-893][DOC] Fix Spark patch list text in Readme
### What changes were proposed in this pull request?

- Fix the text of Spark patch list

### Why are the changes needed?

Before:
<img width="909" alt="image" src="https://github.com/apache/incubator-celeborn/assets/1935105/1d402df1-3a68-4810-8f84-8ab61a38314c">

After:
<img width="908" alt="image" src="https://github.com/apache/incubator-celeborn/assets/1935105/2c733568-a08a-4951-bd5a-f4a444a28833">

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Screenshots attached.

Closes #1810 from bowenliang123/readme-patch.

Authored-by: liangbowen <liangbowen@gf.com.cn>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-08-14 14:54:58 +08:00
e
f78a7d349f [CELEBORN-794] Fix link of CONFIGURATIONS in README
### What changes were proposed in this pull request?

Modify CONFIGURATIONS to point to the correct address

### Why are the changes needed?

CONFIGURATIONS in README.md points to an invalid address

![image](https://github.com/apache/incubator-celeborn/assets/14961757/538294ee-3432-4e1e-a45e-4dc1983d50e8)
![image](https://github.com/apache/incubator-celeborn/assets/14961757/d4681603-5317-46ae-a2f5-e58fa72c706c)

### Does this PR introduce _any_ user-facing change?

NO

### How was this patch tested?
NO

Closes #1714 from jiaoqingbo/CELEBORN-794.

Authored-by: e <1178404354@qq.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-14 18:08:09 +08:00
mingji
d0ecf83fec [CELEBORN-764] Fix celeborn on HDFS might clean using app directories
### What changes were proposed in this pull request?
Make Celeborn leader clean expired app dirs on HDFS when an application is Lost.

### Why are the changes needed?
If Celeborn is working on HDFS, the storage manager starts and cleans expired app directories, and the newly created worker will want to delete any unknown app directories.
This will cause using app directories to be deleted unexpectedly.

### Does this PR introduce _any_ user-facing change?
NO.

### How was this patch tested?
UT and cluster.

Closes #1678 from FMX/CELEBORN-764.

Lead-authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Co-authored-by: Cheng Pan <pan3793@gmail.com>
Co-authored-by: Ethan Feng <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-05 23:11:50 +08:00
zhongqiang.czq
a0f4be67a9 [CELEBORN-765][DOC] Disable partitionSplit in Flink engine related co…
…nfigurations

### What changes were proposed in this pull request?
In Doc Readme, setting partitionSplit to false should be added in Flink engine related configurations.

### Why are the changes needed?
Currently, Mappartition split is not supported, but shuffle partition split is enabled by default, so error will be thrown when flink task's shuffle data size exceeds 1G(by Default).

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?
manually

Closes #1679 from zhongqiangczq/readme.

Authored-by: zhongqiang.czq <zhongqiang.czq@alibaba-inc.com>
Signed-off-by: zhongqiang.czq <zhongqiang.czq@alibaba-inc.com>
2023-07-05 18:04:10 +08:00
Angerszhuuuu
693172d0bd [CELEBORN-751] Rename remain rss related class name and filenames etc
### What changes were proposed in this pull request?
Rename remain rss related class name and filenames etc...

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1664 from AngersZhuuuu/CELEBORN-751.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
2023-07-04 10:20:08 +08:00
Angerszhuuuu
5c7ecb8302
[CELEBORN-754][IMPORTANT] Provide a new SparkShuffleManager to replace RssShuffleManager in the future
### What changes were proposed in this pull request?
Provide a new SparkShuffleManager to replace RssShuffleManager in the future

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1667 from AngersZhuuuu/CELEBORN-754.

Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Co-authored-by: Cheng Pan <pan3793@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-06-30 17:27:33 +08:00
Angerszhuuuu
6e35745736
[CELEBORN-753] Rename spark patch file name to make it more clear
### What changes were proposed in this pull request?
Rename spark patch file name to make it more clear

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1666 from AngersZhuuuu/CELEBORN-753.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-06-30 11:41:12 +08:00
Angerszhuuuu
bd7c2ea35a [CELEBORN-746][BUILD] Rename project files from rss-xx to celeborn-xx
### What changes were proposed in this pull request?
Rename project files from rss-xx to celeborn-xx

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1660 from AngersZhuuuu/CELEBORN-746.

Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Co-authored-by: Cheng Pan <pan3793@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
2023-06-29 16:30:02 +08:00
mingji
40760ede3a [CELEBORN-568] Support storage type selection
### What changes were proposed in this pull request?
1. Celeborn supports storage type selection. HDD, SSD, and HDFS are available for now.
2. Add new buffer size for HDFS file writers.
3. Worker support empty working dirs.

### Why are the changes needed?
Support HDFS only scenario.

### Does this PR introduce _any_ user-facing change?
NO.

### How was this patch tested?
UT and cluster.

Closes #1619 from FMX/CELEBORN-568.

Lead-authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Co-authored-by: Ethan Feng <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-06-27 18:07:08 +08:00
Cheng Pan
e22379c3ab [CELEBORN-638] Migrate configurations celeborn.ha.master.* to celeborn.master.ha.*
### What changes were proposed in this pull request?

It was discussed during the last meeting, but abandoned due to the complication.

### Why are the changes needed?

Make the configuration unified.

### Does this PR introduce _any_ user-facing change?

Yes, but the legacy configurations still take effect.

### How was this patch tested?

New UTs.

Closes #1549 from pan3793/CELEBORN-638.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-06-16 18:18:26 +08:00
Angerszhuuuu
1ba6dee324 [CELEBORN-680][DOC] Refresh celeborn configurations in doc
### What changes were proposed in this pull request?
Refresh celeborn configurations in doc

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1592 from AngersZhuuuu/CELEBORN-680.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
2023-06-15 13:59:38 +08:00
Ethan Feng
5600728149
[CELEBORN-619][CORE][SHUFFLE] Support enable DRA with Apache Celeborn
### What changes were proposed in this pull request?

Adapt Spark DRA patch for spark 3.4

### Why are the changes needed?

To support enabling DRA w/ Celeborn on Spark 3.4

### Does this PR introduce _any_ user-facing change?

Yes, this PR provides a DRA patch for Spark 3.4

### How was this patch tested?

Compiled with Spark 3.4

Closes #1529 from FMX/CELEBORN-619.

Authored-by: Ethan Feng <ethanfeng@apache.org>
Signed-off-by: Ethan Feng <ethanfeng@apache.org>
2023-06-05 09:50:05 +08:00
Cheng Pan
ef8e556202
[CELEBORN-604][SPARK] Support Spark 3.4 (#1509) 2023-05-24 23:10:13 +08:00
minseok
6e166662f1
[CELEBORN-598] Fix Typos in README 2023-05-21 19:36:38 +08:00
Ethan Feng
7015d2463a
[CELEBORN-583] Merge pooled memory allocators. (#1490) 2023-05-18 10:37:30 +08:00
Ethan Feng
91b757555e
[CELEBORN-570] Update docs about monitor and deployment. (#1478) 2023-05-08 17:07:42 +08:00
Ethan Feng
58aa0ba48f
[CELEBORN-566] Refine docs to eliminate misleading configs. (#1473) 2023-05-03 17:25:59 +08:00
Ethan Feng
537fc94df2
[CELEBORN-549] Update readme about deploy flink client. (#1454) 2023-04-24 21:03:53 +08:00
Ethan Feng
8584f1049f
Add DingTalk Group info. (#1453) 2023-04-24 10:11:24 +08:00
cxzl25
13f772e0c0
[CELEBORN-525] Fix wrong parameter celeborn.push.buffer.size 2023-04-14 20:45:25 +08:00
Ethan Feng
599bdbeb72
[CELEBORN-420] Add hosts template and docs about start-all scripts. (#1354) 2023-03-16 11:33:32 +08:00
zhongqiangchen
4fb5b3d547
[CELEBORN-298] Fix the wrong configuration name in readme and conf.template (#1234) 2023-02-14 13:38:03 +08:00
Keyong Zhou
a67a275609
Update README.md 2023-02-01 10:46:55 +08:00
Cheng Pan
0c29c5dd57
[CELEBORN-180][BUILD][FOLLOWUP] Update CI workflow and docs (#1134) 2023-01-03 17:58:51 +08:00
Ethan Feng
65cb36c002
[CELEBORN-83][FOLLOWUP] Fix various bugs when using HDFS as storage. (#1065) 2022-12-15 15:20:29 +08:00
Ethan Feng
98864889c6
[CELEBORN-5] Update README for jira and slack. (#972) 2022-11-15 18:42:36 +08:00
Gabriel
0b78cbfee0
[COMMUNITY] Update README (#971) 2022-11-15 16:10:02 +08:00
leesf
3699683a3b
Fix and migrate some configs (#927) 2022-11-07 09:41:38 +08:00
Cheng Pan
873eeeb1ed
[BUILD] Add apache- prefix in release tarball name (#854) 2022-10-25 22:39:48 +08:00
Cheng Pan
8d7d397e71
Fix Configuration page and polish naming (#838)
* Fix Configuration page and polish naming

* nit

* nit

* comment
2022-10-24 12:46:25 +08:00
Cheng Pan
ea67f4e060
Introduce categories to ConfigEntry and migrate configurations (#775) 2022-10-17 16:56:54 +08:00
Cheng Pan
5829bda21a
Rework and migrate HA configuration system (#763) 2022-10-13 22:35:01 +08:00
Cheng Pan
f01a696313
Migrate and refactor configuration for master endpoints (#752) 2022-10-11 21:33:21 +08:00
dxheming
7ef4144ced
[DOC] Modify build cmd (#758) 2022-10-11 14:23:01 +08:00
Keyong Zhou
645339b024
Update README.md 2022-10-10 11:57:29 +08:00
Ethan Feng
59474c2f11
[INFRA]Update scripts and templates for new name. (#724) 2022-10-09 14:56:06 +08:00
Cheng Pan
ab16b4f101
[INFRA] Rename modules w/ celeborn prefix (#723) 2022-10-08 08:05:57 +08:00
Keyong Zhou
a2d2379153
[DOC] Replace RSS with Celeborn in docs (#715) 2022-10-06 10:37:46 +08:00
Keyong Zhou
fe3b5988f2
[REFACTOR] Change package name to org.apache.celeborn (#710) 2022-10-02 18:10:29 +08:00
Kerwin Zhang
10cfdec18f
[DOC] Update the calculation method of the worker's slot count (#702) 2022-09-30 16:02:59 +08:00