Commit Graph

132 Commits

Author SHA1 Message Date
zky.zhoukeyong
6a5e3ed794 [CELEBORN-812] Cleanup SendBufferPool if idle for long
### What changes were proposed in this pull request?
Cleans up the pooled send buffers and push tasks if the SendBufferPool has been idle for more than
`celeborn.client.push.sendbufferpool.expireTimeout`.

### Why are the changes needed?
Before this PR the SendBufferPool will cache the send buffers and push tasks forever. If they are large
and will not be reused in the future, it wastes memory and causes GC.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Passes GA and manual tests.

Closes #1735 from waitinfuture/812-1.

Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-20 00:34:55 +08:00
Angerszhuuuu
5471a6afe5
[CELEBORN-804] ShuffleClient should cleanup shuffle infos when trigger unregisterShuffle
### What changes were proposed in this pull request?

After discussion, we make sure that `shuffleManager.unregisterShuffle()` will be triggered by Spark both in driver and executor. In this pr:

  1. Add shuffle client both in driver and executor side in ShuffleManager
  2. ShuffleClient call cleanupShuffle() when trigger `unregisterShuffle`.

This replaced https://github.com/apache/incubator-celeborn/pull/1719

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1726 from AngersZhuuuu/CELEBORN-804.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-07-19 20:50:18 +08:00
onebox-li
405b2801fa [CELEBORN-810] Fix some typos and grammar
### What changes were proposed in this pull request?
Fix some typos and grammar

### Why are the changes needed?
Ditto

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
manually test

Closes #1733 from onebox-li/fix-typo.

Authored-by: onebox-li <lyh-36@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-19 18:35:38 +08:00
Angerszhuuuu
c8ad39d9bd [CELEBORN-809] Directly use isDriver passed from SparkEnv
### What changes were proposed in this pull request?
As title
<img width="1051" alt="截屏2023-07-19 下午1 01 25" src="https://github.com/apache/incubator-celeborn/assets/46485123/26d506b2-bab9-43f5-9bbe-58d22a761bab">

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1732 from AngersZhuuuu/CELEBORN-809.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
2023-07-19 15:20:01 +08:00
Cheng Pan
0db919403e Revert "[CELEBORN-798] Add heartbeat from client to LifecycleManager to clean…"
This reverts commit e56a8a8bed.
2023-07-19 15:08:45 +08:00
zky.zhoukeyong
e56a8a8bed [CELEBORN-798] Add heartbeat from client to LifecycleManager to clean…
…up client

### What changes were proposed in this pull request?
Add heartbeat from client to lifecycle manager. In this PR heartbeat request contains local shuffle ids from
client, lifecycle manager checks with it's local set and returns ids it doesn't know. Upon receiving response,
client calls ```unregisterShuffle``` for cleanup.

### Why are the changes needed?
Before this PR, client side ```unregisterShuffle``` is never called. When running TPCDS 3T with spark thriftserver
without DRA, I found the Executor's heap contains 1.6 million PartitionLocation objects (and StorageInfo):
![image](https://github.com/apache/incubator-celeborn/assets/948245/43658369-7763-4511-a5b0-9b3fbdf02005)

After this PR, the number of PartitionLocation objects decreases to 275 thousands
![image](https://github.com/apache/incubator-celeborn/assets/948245/45f8f849-186d-4cad-83c8-64bd6d18debc)

This heartbeat can be extended in the future for other purposes, i.e. reporting client's metrics.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Passes GA and  manual test.

Closes #1719 from waitinfuture/798.

Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-17 18:14:10 +08:00
Cheng Pan
1ec4f4a9f5 [CELEBORN-801] Warn when local shuffle reader is enabled
### What changes were proposed in this pull request?

Warn when local shuffle reader is enabled.

```
Detected spark.sql.adaptive.localShuffleReader.enabled (default is true) is enabled,
it's highly recommended to disable it when use Celeborn as Remote Shuffle Service to
avoid performance degradation.
```

### Why are the changes needed?

When local shuffle reader is enabled, the reduce task may read shuffle data in by map id, which is not match the Celeborn shuffle data clustering model, then cause extremely bad shuffle read performance.

### Does this PR introduce _any_ user-facing change?

Yes, user would see warning message from Driver log when `spark.sql.adaptive.localShuffleReader.enabled` is true.

### How was this patch tested?

Review.

Closes #1721 from pan3793/CELEBORN-801.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-17 16:43:50 +08:00
zky.zhoukeyong
10a1def512 [CELEBORN-802] Reuse DataPusher#idleQueue by pooling to avoid too many byte[] objects
### What changes were proposed in this pull request?
Reuse ```DataPusher#idleQueue``` by pooling in ```SendBufferPool``` to avoid too many ```byte[]```
objects in ```PushTask```.

### Why are the changes needed?
I'm testing 3T TPCDS. Before this PR, I encountered Container killed because of OOM, GC is about 9.6h. For alive Executors, I dumped the memory and see number of PushTask object is 2w, and the number of ```64k``` byte[] is 23356, total around 1.7G:
![image](https://github.com/apache/incubator-celeborn/assets/948245/7b4ee4fa-7860-4ddb-b862-181a91748092)

After this PR, no container is killed because of OOM, GC is about 8.6h. I also dumped Executor and found number
of  PushTask object is 3584, and the number of ```64K``` byte[] objects is 5783, total around 361M:
![image](https://github.com/apache/incubator-celeborn/assets/948245/981e8f70-52f8-4bb1-9f67-9a8b4f398392)

Also, before this PR, total execution time is ```3313.8s```, after this PR, total execution time is ```3229.5s```.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Passes GA and Manual test.

Closes #1722 from waitinfuture/802.

Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-17 16:35:14 +08:00
zky.zhoukeyong
a7bbbd05c4 [CELEBORN-797] Decrease writeTime metric sampling frequency to improve perf
### What changes were proposed in this pull request?
1. Decrease writeTime metric sampling frequency to improve perf
2. Set default value of ```celeborn.<module>.push.timeoutCheck.threads``` and ```celeborn.<module>.fetch.timeoutCheck.threads``` to 4

### Why are the changes needed?
Following are test cases
case 1: ```spark.sparkContext.parallelize(1 to 8000, 8000).flatMap( _ => (1 to 15000000).iterator.map(num => num)).repartition(8000).count``` // shuffle 1.1T data
case 2: ```spark.sparkContext.parallelize(1 to 8000, 8000).flatMap( _ => (1 to 30000000).iterator.map(num => num)).repartition(8000).count``` // shuffle 2.2T data
Following are e2e time of shuffle write stage
||Sort pusher before|Sort pusher after|Hash pusher before|Hash pusher after|
|----|----|----|----|-----|
|case1|4.4min|4.1min|4.4min|3.9min|
|case2|9.1min|8.4min|9.7min|8.5min|

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Passes GA and manual test.

Closes #1718 from waitinfuture/797.

Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-14 20:51:50 +08:00
无迹
e1337972e8 [CELEBORN-792] SparkShuffleManager.getWriter use wrong appUniqueId fo…
…r Spark2

### What changes were proposed in this pull request?
As title

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass GA and manual test.

Closes #1717 from shujiewu/CELEBORN-792.

Authored-by: 无迹 <peter.wsj@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-14 17:17:48 +08:00
Fu Chen
90ba9f3e87 [CELEBORN-783][FOLLOWUP] Private member updates and cleanup in SortBasedPusher
### What changes were proposed in this pull request?

As title

### Why are the changes needed?

https://github.com/apache/incubator-celeborn/pull/1699#discussion_r1259137323

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Pass GA

Closes #1704 from cfmcgrady/insert-record-followup.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-11 23:08:42 +08:00
Fu Chen
e47ec10cef [CELEBORN-783] Revise the conditions for the SortBasedPusher#insertRecord method
### What changes were proposed in this pull request?

As title

### Why are the changes needed?

[comment](7adf1fca41 (r121138008))

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

New UT

Closes #1699 from cfmcgrady/insert-record.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-11 11:36:29 +08:00
Fu Chen
2bd1d86d41
[CELEBORN-775] Fix executorCores calculation in SparkShuffleManager for Spark local mode
### What changes were proposed in this pull request?

As title

### Why are the changes needed?

```shell
$ bin/spark-shell --master local[2]
23/07/06 16:11:30 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/07/06 16:11:40 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Spark context available as 'sc' (master = local[2], app id = local-1688631101733).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.3.1
      /_/

Using Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java 1.8.0_292)
Type in expressions to have them evaluated.
Type :help for more information.

scala> spark.sparkContext.getConf.get("spark.executor.cores")
java.util.NoSuchElementException: spark.executor.cores
  at org.apache.spark.SparkConf.$anonfun$get$1(SparkConf.scala:245)
  at scala.Option.getOrElse(Option.scala:189)
  at org.apache.spark.SparkConf.get(SparkConf.scala:245)
  ... 47 elided

scala>
```

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

CelebornPipelineSortSuite should cover this change

Closes #1685 from cfmcgrady/local-core-number.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-07-06 16:29:59 +08:00
Angerszhuuuu
693172d0bd [CELEBORN-751] Rename remain rss related class name and filenames etc
### What changes were proposed in this pull request?
Rename remain rss related class name and filenames etc...

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1664 from AngersZhuuuu/CELEBORN-751.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
2023-07-04 10:20:08 +08:00
Angerszhuuuu
5c7ecb8302
[CELEBORN-754][IMPORTANT] Provide a new SparkShuffleManager to replace RssShuffleManager in the future
### What changes were proposed in this pull request?
Provide a new SparkShuffleManager to replace RssShuffleManager in the future

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1667 from AngersZhuuuu/CELEBORN-754.

Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Co-authored-by: Cheng Pan <pan3793@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-06-30 17:27:33 +08:00
Angerszhuuuu
4c67325a3d
[CELEBORN-720][SPARK] Correct metric peakExecutionMemory of SortBasedShuffleWriter
### What changes were proposed in this pull request?
Currently SortBasedShuffleWriter won't update peakMemoryUsedBytes, this pr support this.

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1632 from AngersZhuuuu/CELEBORN-720.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-06-27 18:40:06 +08:00
Fu Chen
4b8f126d54 [CELEBORN-716][BUILD] Correct the to name when renaming the Netty native library
### What changes were proposed in this pull request?

As title

### Why are the changes needed?

before this PR the `liborg_apache_celeborn_shaded_netty_transport_native_epoll_aarch_64.so` can't correctly be loaded.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Manually tested

```shell
> tar zxf celeborn-client-spark-3-shaded_2.12-0.4.0-SNAPSHOT.jar
> find * -name "*.so"
META-INF/native/liborg_apache_celeborn_shaded_netty_transport_native_epoll_aarch_64.so
META-INF/native/liborg_apache_celeborn_shaded_netty_transport_native_epoll_x86_64.so
```

Closes #1625 from cfmcgrady/typo.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-06-26 21:57:06 +08:00
Fu Chen
1b3ec61690 [CELEBORN-711][TEST] Rework PushDataTimeoutTest
### What changes were proposed in this pull request?

1. separated push data timeout tests and push merge data timeout tests in `PushDataTimeoutTest`
2. updated the test results assertion
3. rework `pushdata timeout will add to blacklist`

### Why are the changes needed?

ensure that the timeout behavior is correctly implemented

https://github.com/apache/incubator-celeborn/pull/1613#discussion_r1236423721

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

existing tests

Closes #1620 from cfmcgrady/push-timeout-test.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-06-26 13:45:27 +08:00
zky.zhoukeyong
6b82ecdfa0 [CELEBORN-712] Make appUniqueId a member of ShuffleClientImpl and refactor code
### What changes were proposed in this pull request?
Make appUniqueId a member of ShuffleClientImpl and remove applicationId from RPC messages across client side, so it won't cause compatibility issues.

### Why are the changes needed?
Currently Celeborn Client is bound to a single application id, so there's no need to pass applicationId around in many RPC messages in client side.

### Does this PR introduce _any_ user-facing change?
In some logs the application id will not be printed, which should not be a problem.

### How was this patch tested?
UTs.

Closes #1621 from waitinfuture/appid.

Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-06-25 21:37:16 +08:00
Fu Chen
18f2be0fbe
[CELEBORN-693][SPARK] Align the incWriterTime in the hash-based shuffle writer with the sort-based shuffle
### What changes were proposed in this pull request?

As title.

### Why are the changes needed?

https://github.com/apache/incubator-celeborn/pull/1585#issuecomment-1589164128

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

tested locally.

Closes #1604 from cfmcgrady/hash-based-writer-metrics.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-06-19 15:42:01 +08:00
sychen
e734ceb558 [MINOR] Cleanup code
### What changes were proposed in this pull request?
1. Use `<arg>-Ywarn-unused-import</arg>` to remove some unused imports
There is no way to use `<arg>-Ywarn-unused-import</arg>` at this stage
Because we have the following code
```
// Can Remove this if celeborn don't support scala211 in future
import org.apache.celeborn.common.util.FunctionConverter._
```
2. Fix scala case match not fully covered, avoid `scala.MatchError`
3. Fixed some scala compilation warnings

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1600 from cxzl25/cleanup_code.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-06-19 11:31:51 +08:00
Fu Chen
b9c9c00697 [CELEBORN-683][SPARK][PERF] Avoid calling CelebornConf.get multi-time when columnar shuffle wri…
…te is enabled.

### What changes were proposed in this pull request?

as title.

### Why are the changes needed?

frame graph and stage duration before:

![截屏2023-06-15 下午4 49 04](https://github.com/apache/incubator-celeborn/assets/8537877/6fe7f7f6-fd36-42ec-a6a1-9a4943022dc8)

![截屏2023-06-15 下午4 57 53](https://github.com/apache/incubator-celeborn/assets/8537877/077f6c22-4dc9-497a-affe-ddba9200fe28)

frame graph and stage duration after:

![截屏2023-06-15 下午4 37 45](https://github.com/apache/incubator-celeborn/assets/8537877/d6ae7aa6-95c7-490e-a0ae-c110e6a83e5a)

![截屏2023-06-15 下午4 58 12](https://github.com/apache/incubator-celeborn/assets/8537877/e8dd5c3b-94d9-47d7-a644-4897acef43ad)

### Does this PR introduce _any_ user-facing change?

No, only perf improvement.

### How was this patch tested?

tested locally.

Closes #1595 from cfmcgrady/columnar-conf.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-06-15 17:52:23 +08:00
Fu Chen
86cbf7a359
[CELEBORN-673][SPARK][PERF] Improve the perf of sort-based shuffle write
### What changes were proposed in this pull request?

1. `SQLShuffleWriteMetricsReporter#incWriteTime` is a performance killer, stop calling it once we insert a record
2. simplify the `incWriteTime` logic for handling large records, also including the time required for memory copying

### Why are the changes needed?

frame graph and stage duration before:

![截屏2023-06-13 下午3 30 53](https://github.com/apache/incubator-celeborn/assets/8537877/5fb0a242-82d1-4348-aeaa-4af75a012308)

![截屏2023-06-13 下午3 31 26](https://github.com/apache/incubator-celeborn/assets/8537877/3ded2f16-1c17-4120-8d10-31ea7b5182a2)

frame graph and stage duration after:

![截屏2023-06-13 下午3 33 08](https://github.com/apache/incubator-celeborn/assets/8537877/fbe45cf2-4d23-4d6c-a476-64338e1610f1)

![截屏2023-06-13 下午3 33 59](https://github.com/apache/incubator-celeborn/assets/8537877/9129d771-ad36-42e9-86b7-e454d2f8e0b0)

### Does this PR introduce _any_ user-facing change?

No, only perf improvement

### How was this patch tested?

tested locally.

Closes #1585 from cfmcgrady/shuffle-metrics.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-06-13 19:07:04 +08:00
Fu Chen
79806b27ca [CELEBORN-664][SPARK][PERF] Improve the perf of columnar shuffle write
### What changes were proposed in this pull request?

per https://github.com/databricks/scala-style-guide#traversal-and-zipwithindex, use `while` loop for performance-sensitive code

framegraph and shuffle write time before:

![截屏2023-06-12 下午4 18 24](https://github.com/apache/incubator-celeborn/assets/8537877/59d94e05-71b5-4474-bebe-66df554ccc48)

![截屏2023-06-12 下午4 19 56](https://github.com/apache/incubator-celeborn/assets/8537877/e24bb8b2-5b16-431b-92ae-cb8216e69d16)

framegraph and shuffle write time after:

![截屏2023-06-12 下午4 18 38](https://github.com/apache/incubator-celeborn/assets/8537877/18a84774-2197-487d-aa51-b33445619210)

![截屏2023-06-12 下午4 21 39](https://github.com/apache/incubator-celeborn/assets/8537877/26d95e5a-6e68-46b7-8c8c-49eb2d2e252f)

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1577 from cfmcgrady/columnar-perf.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-06-12 18:46:00 +08:00
Fu Chen
cc716506f9 [CELEBORN-659][SPARK][TEST] Refine RssShuffleWriterSuiteJ
### What changes were proposed in this pull request?

1. renamed `RssShuffleWriterSuiteJ` to `CelebornShuffleWriterSuiteBase`, which now serves as an abstract base class.
2. two new classes, `HashBasedShuffleWriterSuiteJ` and `SortBasedShuffleWriterSuiteJ`, have been added. These classes extend `CelebornShuffleWriterSuiteBase` and provide suites for testing hash-based and sort-based shuffle writers.

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1570 from cfmcgrady/sort-based-writer-suite.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-06-12 13:48:52 +08:00
Cheng Pan
76533d7324
[CELEBORN-650][TEST] Upgrade scalatest and unify mockito version
### What changes were proposed in this pull request?

This PR upgrades

- `mockito` from 1.10.19 and 3.6.0 to 4.11.0
- `scalatest` from 3.2.3 to 3.2.16
- `mockito-scalatest` from 1.16.37 to 1.17.14

### Why are the changes needed?

Housekeeping, making test dependencies up-to-date and unified.

### Does this PR introduce _any_ user-facing change?

No, it only affects test.

### How was this patch tested?

Pass GA.

Closes #1562 from pan3793/CELEBORN-650.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-06-09 10:04:14 +08:00
Cheng Pan
6b64b1de9c
[CELEBORN-648][SPARK] Improve perf of SendBufferPool and logs about memory
### What changes were proposed in this pull request?

- Replace index-based item access with an iterator for LinkedList.
- Always try to remove a buffer if SendBufferPool does not have a matched candidate, this change makes the total buffer number from `capacity+N-1` to `capacity` in worst cases.
- Some logs and code polish.

### Why are the changes needed?

Improve performance and logs, reduce memory consumption.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass GA.

Closes #1560 from pan3793/CELEBORN-648.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-06-09 09:45:27 +08:00
Cheng Pan
0636e3ca40
[CELEBORN-654][SPARK] SortBasedShuffleWriter does not require mapStatusRecords in Spark 3
### What changes were proposed in this pull request?

`mapStatusRecords` is required in Spark 2 for constructing `MapStatus` when AQE is enabled, but not in Spark 3, so remove it to save memory and compute resources.

This PR also simplifies the `for loop` code.

### Why are the changes needed?

Remove unnecessary variables to save resources and clean up code.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass GA.

Closes #1564 from pan3793/CELEBORN-654.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-06-09 09:43:08 +08:00
Cheng Pan
1ae8eb7145 [CELEBORN-655][SPARK] Rename newAppId to appUniqueId
### What changes were proposed in this pull request?

Rename variable `newAppId` to `appUniqueId` in Spark client.

### Why are the changes needed?

Make the variable name intuitive.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass GA.

Closes #1565 from pan3793/CELEBORN-655.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-06-08 22:14:20 +08:00
Cheng Pan
5bc37f1286
[CELEBORN-637] Remove support for rss.* configuration alias
### What changes were proposed in this pull request?

Remove support for `rss.*` configuration alias

### Why are the changes needed?

The legacy `rss.*` configuration alias was added during Celeborn entering Apache Incubator, to simplify users' migration from RSS to Celeborn.

Lots of configuration changes happened after Celeborn 0.2, the `rss.*` configuration alias become less helpful, so remove it to clean up the code.

### Does this PR introduce _any_ user-facing change?

Yes, but it's expected, the `rss.*` compatibility has never been documented.

### How was this patch tested?

Pass GA.

Closes #1547 from pan3793/CELEBORN-637.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-06-07 22:28:36 +08:00
xiyu.zk
82bdea7085 [CELEBORN-620] Fix columnar shuffle codegen exception
### What changes were proposed in this pull request?
Fix columnar shuffle codegen exception. This is a refactoring of #1523。

Closes #1543 from kerwin-zk/issue-620.

Authored-by: xiyu.zk <xiyu.zk@alibaba-inc.com>
Signed-off-by: xiyu.zk <xiyu.zk@alibaba-inc.com>
2023-06-05 12:05:06 +08:00
Angerszhuuuu
4df4775524
[CELEBORN-632][DOC] Add spark name space to spark specify properties (#1538) 2023-06-02 21:48:56 +08:00
Ethan Feng
d33916e571
[CELEBORN-625] Add a config to enable/disable UnsafeRow fast write. (#1532) 2023-06-01 20:55:45 +08:00
Angerszhuuuu
cf308aa057
[CLEBORN-595] Refine code frame of CelebornConf (#1525) 2023-06-01 10:37:58 +08:00
Angerszhuuuu
62681ba85d
[CELEBORN-595] Rename and refactor the configuration doc. (#1501) 2023-05-30 15:14:12 +08:00
Cheng Pan
ef8e556202
[CELEBORN-604][SPARK] Support Spark 3.4 (#1509) 2023-05-24 23:10:13 +08:00
Angerszhuuuu
a22c61e479
[CELEBORN-582] Celeborn should handle InterruptedException during kill task properly (#1486) 2023-05-17 18:17:41 +08:00
Angerszhuuuu
783d4e5dc5
[CELEBORN-551] Remove unnecessary ShuffleClient.get() (#1456) 2023-05-04 20:47:45 +08:00
cxzl25
13f772e0c0
[CELEBORN-525] Fix wrong parameter celeborn.push.buffer.size 2023-04-14 20:45:25 +08:00
Kerwin Zhang
27a1f369cf
[CELEBORN-472] Support using Celeborn in the scenario of switching multiple SparkContexts in the same process (#1379) 2023-03-27 16:10:34 +08:00
Keyong Zhou
7adf1fca41
[CELEBORN-295] Optimize data push (#1232)
* [CELEBORN-295] Add double buffer for sort pusher
2023-02-28 10:35:55 +08:00
jiaoqingbo
bd9e0ddc1f
[CELEBORN-304] Missing setIfMissing celeborn.$module.io.serverThreads (#1238) 2023-02-15 15:49:08 +08:00
Angerszhuuuu
c410392284
[CELEBORN-265] Integration with Spark3.0 cast class exception of ShuffleHandler (#1197)
* [CELEBORN-265] Integration with Spark3.0 cast class exception of ShuffleHandler
2023-02-02 11:52:51 +08:00
Keyong Zhou
e47f1e33b0
[CELEBORN-55][FOLLOWUP] Code refine (#1175) 2023-01-20 16:22:47 +08:00
zy.jordan
c5be79ee3d
[CELEBORN-55][FEATURE] Split maxReqsInFlight limitation into level of target worker (#1102) 2023-01-20 10:18:45 +08:00
jxysoft
41b1fa46d3
[CELEBORN-185][SPARK] Can't release shuffle data if rss fallback to nss (#1133)
Co-authored-by: xianyao.jiang <xianyao.jiang@antfin.com>
2023-01-03 14:28:09 +08:00
nafiy
ddab27a1d7
[CELEBORN-145][REFACTOR] Add reason in CheckQuotaResponse (#1093)
* [CELEBORN-145][REFACTOR] Add reason in CheckQuotaResponse
2022-12-15 18:16:34 +08:00
Cheng Pan
ec371c0026
[CELEBORN-132] ShuffleClient should not implement Cloneable (#1077) 2022-12-14 10:04:39 +08:00
Angerszhuuuu
dac2ba6b40
[CELEBORN-114][REFACTOR] Keep same log code in spark2/spark3 of quota exceed (#1058) 2022-12-09 12:13:01 +08:00
nafiy
529bb22781
[ISSUE-958][REFACTOR] Add and modify log of fallback policy (#965) 2022-11-14 20:16:33 +08:00