Commit Graph

375 Commits

Author SHA1 Message Date
SteNicholas
3097ffe33b [CELEBORN-678][FOLLOWUP] MapperAttempts for a shuffle should reply MAP_ENDED when mapper has already been ended from speculative task
### What changes were proposed in this pull request?

MapperAttempts for a shuffle replies the `MAP_ENDED` when mapper has already been ended for receving push data or push merged data from speculative task.

Follow up #1591.

### Why are the changes needed?

When mapper has already been ended for receving push data or push merged data from speculative task, `PushDataHandler` should trigger MapEnd instead of StageEnd for worker. Meanwhile, the `ShuffleClientImpl` should handle `STAGE_ENDED` as MapEnd, otherwise causes that other tasks of the stage could not send shuffle data for data lost.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Internal test.

Closes #2190 from SteNicholas/CELEBORN-678.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-12-27 20:40:40 +08:00
zwangsheng
6c2fdf7477
[CELEBORN-1188][TEST] Using JUnit function instead of java assert
### What changes were proposed in this pull request?
Using Junit function instead of java assert.

### Why are the changes needed?
When java assert fail, will throw AssertException, which is hard to find diff.

![截屏2023-12-20 10 34 52](https://github.com/apache/incubator-celeborn/assets/52876270/b36421a5-64e1-4717-a6d4-3b08db403293)

Instead, when we use junit assert, we can clearly find diff.

![截屏2023-12-20 11 17 21](https://github.com/apache/incubator-celeborn/assets/52876270/ce39fa20-e9ab-4419-a4ca-62c4157e4b2c)

### Does this PR introduce _any_ user-facing change?
NO, only test changed

### How was this patch tested?
Run CI

Closes #2173 from zwangsheng/CELEBORN-1188.

Authored-by: zwangsheng <binjieyang@apache.org>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-12-20 21:20:38 +08:00
sychen
7f653ce7d6 [CELEBORN-1190] Apply error prone patch and suppress some problems
### What changes were proposed in this pull request?
1.  Fix MissingOverride, DefaultCharset, UnnecessaryParentheses Rule
2. Exclude generated sources, FutureReturnValueIgnored, TypeParameterUnusedInFormals, UnusedVariable

### Why are the changes needed?
```
./build/make-distribution.sh --release
```
We get a lot of WARNINGs.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
GA

Closes #2177 from cxzl25/error_prone_patch.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: Fu Chen <cfmcgrady@gmail.com>
2023-12-20 20:54:18 +08:00
SteNicholas
35aa54bfe3 [MINOR] Update log level of CommitFiles success for CommitHandler from error to info
### What changes were proposed in this pull request?

Update log level of CommitFiles success for `CommitHandler` from error to info.

### Why are the changes needed?

The log level of sending CommitFiles success for `CommitHandler` should not be error.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #2174 from SteNicholas/commit-files-log.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-12-20 15:13:38 +08:00
mingji
4dacf72a6d
[CELEBORN-1150] support io encryption for spark
### What changes were proposed in this pull request?
1. To support io encryption for spark.

### Why are the changes needed?
Ditto.

### Does this PR introduce _any_ user-facing change?
NO.

### How was this patch tested?
GA and manually test on a cluster.

Closes #2135 from FMX/B1150.

Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-12-19 11:44:05 +08:00
zky.zhoukeyong
4b7702e49c [CELEBORN-1181] Filter out null endpoint workers in destroySlotsWithRetry
### What changes were proposed in this pull request?
To avoid NPE in `val future = workerInfo.endpoint.ask[DestroyWorkerSlotsResponse](destroy)`

### Why are the changes needed?
ditto

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Manual test

Closes #2166 from waitinfuture/1181.

Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-12-17 20:17:20 +08:00
zky.zhoukeyong
e361788e48 [CELEBORN-1178] Destroy fail reserved slots in LifecycleManager#reserveSlotsWithRetry
### What changes were proposed in this pull request?
I'm testing main branch and encountered the following scenario.
I run `sbin/stop-worker.sh` near simultaneously on 3 out of 6 workers, and I'm expecting the 3 workers
will soon shutdown because I enabled graceful shutdown. However, only the first worker I stopped
shutdown in 15s as expected, the other two won't shutdown until shutdown timeout.

After digging into it, I found `LifecycleManager#reserveSlotsWithRetry` will reserve for the same location
twice:
1. At T1, only worker1 shutdown, pushes receive HARD_SPLIT and goes to revive
2. At T2, LifecycleManager handles revive requests in batch, and try to reallocate the locs to other workers
3. At T3, reserve to worker3 succeeds because it's not shutdown yet, but reserve to worker2 fails because it's shutdown
4. At T4, LifecycleManager will re-allocate the failed slots to other workers except worker1 and worker2. However, at this time Worker3 is also shutdown, so it fails to reserve on worker3
5. At T5, it re-allocates slots that failed to worker3. However, `getFailedPartitionLocations` will return slots allocated to worker3 in step 3, and increment the epoch to 2. At this time, worker3 has slots of epoch 1, but they will never to pushed to because newer epoch 3 is generated at the same time
6. Since the epoch 2 locs in worker3 will never be pushed to, it will never get a chance to return HARD_SPLIT, as a result it can't fast shutdown untile timeout.

This PR fixes this by destroying failed to be reserved slots in the process of `reserveSlotsWithRetry`

### Why are the changes needed?
ditto

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Manual test.

Before:
![image](https://github.com/apache/incubator-celeborn/assets/948245/50c55524-d37f-494e-a5aa-fba682438cda)
After:
![image](https://github.com/apache/incubator-celeborn/assets/948245/8c90a869-b388-46f3-a86b-a37fd0f4ce0f)

Closes #2163 from waitinfuture/1178.

Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-12-17 14:28:04 +08:00
zky.zhoukeyong
309153a99b [CELEBORN-1175] Add UT for commit files
### What changes were proposed in this pull request?
As title.

### Why are the changes needed?
As title.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Passes UTs.

Closes #2162 from waitinfuture/1175-2.

Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-12-16 01:36:29 +08:00
zky.zhoukeyong
01feb93abb [CELEBORN-1167] Avoid calling parmap when destroy slots
### What changes were proposed in this pull request?
As title

### Why are the changes needed?
One user reported that LifecycleManager's parmap can create huge number of threads and causes OOM.

![image](https://github.com/apache/incubator-celeborn/assets/948245/1e9a0b83-32fe-40d5-8739-2b370e030fc8)

There are four places where parmap is called:

1. When LifecycleManager commits files
2. When LifecycleManager reserves slots
3. When LifecycleManager setup connection to workers
4. When LifecycleManager call destroy slots

This PR fixes the fourth one. To be more detail, this PR eliminates `parmap` when destroying slots, and also replaces `askSync` with `ask`.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Manual test and GA.

Closes #2156 from waitinfuture/1167.

Lead-authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Co-authored-by: cxzl25 <cxzl25@users.noreply.github.com>
Co-authored-by: Keyong Zhou <waitinfuture@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-12-15 18:30:31 +08:00
zky.zhoukeyong
b4bbe4b151 [CELEBORN-1171] Add UT for LifecycleManager's async setup endpoints
### What changes were proposed in this pull request?
as title

### Why are the changes needed?
as title

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Passes GA

Closes #2159 from waitinfuture/1171.

Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-12-15 11:00:13 +08:00
zky.zhoukeyong
ea0fff057f [CELEBORN-1166] Avoid calling parmap when setup endpoint
### What changes were proposed in this pull request?
As title

### Why are the changes needed?
One user reported that LifecycleManager's parmap can create huge number of threads and causes OOM.

![image](https://github.com/apache/incubator-celeborn/assets/948245/1e9a0b83-32fe-40d5-8739-2b370e030fc8)

There are four places where parmap is called:

1. When LifecycleManager commits files
2. When LifecycleManager reserves slots
3. When LifecycleManager setup connection to workers
4. When StorageManager calls close

This PR fixes the third one. To be more detail, this PR eliminates `parmap` when setup connection to workers, and also replaces `askSync` with `ask`.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Manual test and GA.

Closes #2154 from waitinfuture/1166.

Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-12-13 17:07:28 +08:00
zky.zhoukeyong
4303be3231 [CELEBORN-1165] Avoid calling parmap when reserve slots
### What changes were proposed in this pull request?
As title

### Why are the changes needed?
One user reported that LifecycleManager's parmap can create huge number of threads and causes OOM.

![image](https://github.com/apache/incubator-celeborn/assets/948245/1e9a0b83-32fe-40d5-8739-2b370e030fc8)

There are four places where parmap is called:

1. When LifecycleManager commits files
2. When LifecycleManager reserves slots
3. When LifecycleManager setup connection to workers
4. When StorageManager calls close

This PR fixes the second one. To be more detail, this PR eliminates `parmap` when reserving slots, and also replaces `askSync` with `ask`.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Manual test and GA.

Closes #2152 from waitinfuture/1165-1.

Lead-authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Co-authored-by: Cheng Pan <pan3793@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-12-13 16:37:20 +08:00
zky.zhoukeyong
92bebd305d [CELEBORN-1160] Avoid calling parmap when commit files
### What changes were proposed in this pull request?
As title

### Why are the changes needed?
One user reported that LifecycleManager's parmap can create huge number of threads and causes OOM.

![image](https://github.com/apache/incubator-celeborn/assets/948245/1e9a0b83-32fe-40d5-8739-2b370e030fc8)

There are four places where parmap is called:

1. When LifecycleManager commits files
2. When LifecycleManager reserves slots
3. When LifecycleManager setup connection to workers
4. When StorageManager calls close

This PR fixes the first one. To be more detail, this PR eliminates `parmap` when doing committing files, and also replaces `askSync` with `ask`.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Manual test and GA.

Closes #2145 from waitinfuture/1160.

Lead-authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Co-authored-by: Keyong Zhou <waitinfuture@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-12-13 14:36:48 +08:00
wangshengjie
8516df4beb [CELEBORN-1151] Request slots when register shuffle should filter the workers excluded by application
### What changes were proposed in this pull request?
When request slots, filter workers excluded by application

### Why are the changes needed?
If worker alive but can not service, register shuffle will remove the worker from application client exclude list and next shuffle may reserve slots on this worker,this will cause application revive unexpectly

### Does this PR introduce _any_ user-facing change?
Yes, request slots will filter workers excluded by application

### How was this patch tested?
UT,

Closes #2131 from wangshengjie123/fix-request-slots-blacklist.

Authored-by: wangshengjie <wangshengjie3@xiaomi.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-12-12 10:02:18 +08:00
Erik.fang
87b64391ea [CELEBORN-1152] fix GetShuffleId RPC NPE for empty shuffle
### What changes were proposed in this pull request?

In [celeborn-955](https://github.com/apache/incubator-celeborn/pull/1924),  GetShuffleId RPC was introduced to generate a celeborn shuffle id from app shuffle id to support spark stage rerun
GetShuffleId RPC assumes that Shuffle Write operation always happens before Shuffle Read operation, but this is not true for empty shuffle data in celeborn, which causes GetShuffleId RPC to throw NPE and fail the Job
This PR fixes this bug

### Why are the changes needed?
to avoid spark job failure with empty shuffle data

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
a new test case is included for empty shuffle data

Closes #2136 from ErikFang/fix-GetShuffleId-RPC-NPE-for-empty-shuffle.

Lead-authored-by: Erik.fang <fmerik@gmail.com>
Co-authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-12-11 20:13:26 +08:00
exmy
8a15396cb6 [CELEBORN-1145] Separate clientPushBufferMaxSize from CelebornInputStreamImpl
### What changes were proposed in this pull request?
The `clientPushBufferMaxSize` config is also used by `CelebornInputStreamImpl`, it's a config about push side and should not be used by fetch side. This pr introduces a fetch config to replace it.

### Why are the changes needed?

As above

### Does this PR introduce _any_ user-facing change?

Yes, a new config `celeborn.client.fetch.buffer.size` is introduced.

### How was this patch tested?

Pass CI

Closes #2118 from exmy/celeborn-1145.

Authored-by: exmy <xumovens@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-11-30 18:56:03 +08:00
jiaoqingbo
98cb941c84 [CELEBORN-1142] clear shuffleIdCache in shutdown method of ShuffleClientImpl
### What changes were proposed in this pull request?

As Title

### Why are the changes needed?

As Title

### Does this PR introduce _any_ user-facing change?

NO

### How was this patch tested?

PASS GA

Closes #2114 from jiaoqingbo/1142.

Authored-by: jiaoqingbo <1178404354@qq.com>
Signed-off-by: Fu Chen <cfmcgrady@gmail.com>
2023-11-28 20:54:24 +08:00
Erik.fang
aee41555c6 [CELEBORN-955] Re-run Spark Stage for Celeborn Shuffle Fetch Failure
### What changes were proposed in this pull request?
Currently, Celeborn uses replication to handle shuffle data lost for celeborn shuffle reader, this PR implements an alternative solution by Spark stage resubmission.

Design doc:
https://docs.google.com/document/d/1dkG6fww3g99VAb1wkphNlUES_MpngVPNg8601chmVp8/edit

### Why are the changes needed?
Spark stage resubmission uses less resources compared with replication, and some Celeborn users are also asking for it

### Does this PR introduce _any_ user-facing change?
a new config celeborn.client.fetch.throwsFetchFailure is introduced to enable this feature

### How was this patch tested?
two UTs are attached, and we also tested it in Ant Group's Dev spark cluster

Closes #1924 from ErikFang/Re-run-Spark-Stage-for-Celeborn-Shuffle-Fetch-Failure.

Lead-authored-by: Erik.fang <fmerik@gmail.com>
Co-authored-by: Cheng Pan <pan3793@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-11-26 16:47:58 +08:00
jiaoqingbo
820c17ad7d
[CELEBORN-1140] Use try-with-resources to avoid FSDataInputStream not being closed
### What changes were proposed in this pull request?

As Title

### Why are the changes needed?

As Title

### Does this PR introduce _any_ user-facing change?

NO

### How was this patch tested?

PASS GA

Closes #2113 from jiaoqingbo/1140.

Authored-by: jiaoqingbo <1178404354@qq.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-11-24 17:55:32 +08:00
jiaoqingbo
6f328382b3 [CELEBORN-1138] Fix log error in createReaderWithRetry method
As Title

As Title

NO

PASS GA

Closes #2111 from jiaoqingbo/1138.

Authored-by: jiaoqingbo <1178404354@qq.com>
Signed-off-by: Shuang <lvshuang.tb@gmail.com>
2023-11-23 20:11:32 +08:00
吴祥平
758018f512 [CELEBORN-1129] More easy to dedicate createReaderWithRetry error
### What changes were proposed in this pull request?
Add lastException to CelebornIOException when createReaderWithRetry meet error

### Why are the changes needed?
Now we should to find the detail executor to dedicate the detail error msg

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?

Closes #2103 from wxplovecc/easy-to-dedicate-error.

Authored-by: 吴祥平 <wxp4532@ly.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-11-15 22:30:39 +08:00
zky.zhoukeyong
12d6052239 [CELEBORN-1130] LifecycleManager#requestWorkerReserveSlots should check null for endpoint
### What changes were proposed in this pull request?
When I kill -9 a Worker process, Master will not exclude the worker until heartbeat timeout.
During this time, Master will still allocate slots on this Worker, causing NPE when register shuffle
```
Caused by: java.lang.NullPointerException
	at org.apache.celeborn.client.LifecycleManager.requestWorkerReserveSlots(LifecycleManager.scala:1246) ~[celeborn-client-spark-3-shaded_2.12-0.4.0-SNAPSHOT.jar:?]
	at org.apache.celeborn.client.LifecycleManager.$anonfun$reserveSlots$2(LifecycleManager.scala:864) ~[celeborn-client-spark-3-shaded_2.12-0.4.0-SNAPSHOT.jar:?]
	at org.apache.celeborn.common.util.ThreadUtils$.$anonfun$parmap$2(ThreadUtils.scala:301) ~[celeborn-client-spark-3-shaded_2.12-0.4.0-SNAPSHOT.jar:?]
	at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:659) ~[scala-library-2.12.15.jar:?]
	at scala.util.Success.$anonfun$map$1(Try.scala:255) ~[scala-library-2.12.15.jar:?]
	at scala.util.Success.map(Try.scala:213) ~[scala-library-2.12.15.jar:?]
	at scala.concurrent.Future.$anonfun$map$1(Future.scala:292) ~[scala-library-2.12.15.jar:?]
	at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:33) ~[scala-library-2.12.15.jar:?]
	at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:33) ~[scala-library-2.12.15.jar:?]
	at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64) ~[scala-library-2.12.15.jar:?]
	at java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(ForkJoinTask.java:1402) ~[?:1.8.0_372]
	at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) ~[?:1.8.0_372]
	at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056) ~[?:1.8.0_372]
	at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692) ~[?:1.8.0_372]
	at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:175) ~[?:1.8.0_372]
```

### Why are the changes needed?
ditto

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Manual test and passes GA

Closes #2104 from waitinfuture/1130.

Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-11-15 22:12:38 +08:00
liangyongyuan
69e14fd341 [CELEBORN-1128] Fix incorrect method reference in ConcurrentHashMap.contains
### What changes were proposed in this pull request?
ConcurrentHashMap.contains main containsValue ,not containsKey. In the current codebase, there is a misuse of the contains method in the ConcurrentHashMap class.

### Why are the changes needed?
ConcurrentHashMap.contains misuse

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
No

Closes #2102 from lyy-pineapple/hashMap.

Authored-by: liangyongyuan <liangyongyuan@xiaomi.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-11-15 19:48:39 +08:00
SteNicholas
65fb07e694 [CELEBORN-1124] Exclude workers of shuffle manager remove worker of connect exception primary or replica
### What changes were proposed in this pull request?

Exclude workers of shuffle manager remove worker of connect exception primary or replica.

### Why are the changes needed?

Exclude workers of shuffle manager should not always remove worker of connect exception replica.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #2091 from SteNicholas/CELEBORN-1124.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Fu Chen <cfmcgrady@gmail.com>
2023-11-13 17:31:44 +08:00
SteNicholas
eb1be3fbf8 [CELEBORN-1120] ShuffleClientImpl should close batchReviveRequestScheduler of ReviveManager
### What changes were proposed in this pull request?

`ShuffleClientImpl` closes `batchReviveRequestScheduler` of `ReviveManager`.

### Why are the changes needed?

After shuffle client is closed, `ReviveManager` still schedules invoker to `ShuffleClientImpl#reviveBatch`, which causes the `NullPointerException`. Therefore, `ShuffleClientImpl` should close `batchReviveRequestScheduler` of `ReviveManager` to avoid `NullPointerException`.

```
23/11/08 18:09:25,819 [batch-revive-scheduler] ERROR ShuffleClientImpl: Exception raised while reviving for shuffle 0 partitionIds 1988, epochs 0,.
java.lang.NullPointerException
	at org.apache.celeborn.client.ShuffleClientImpl.reviveBatch(ShuffleClientImpl.java:705)
	at org.apache.celeborn.client.ReviveManager.lambda$new$1(ReviveManager.java:94)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
23/11/08 18:09:25,844 [celeborn-retry-sender-6] ERROR ShuffleClientImpl: Push data to xx.xx.xx.xx:9092 failed for shuffle 0 map 216 attempt 0 partition 1988 batch 2623, remain revive times 4.
org.apache.celeborn.common.exception.CelebornIOException: PUSH_DATA_CONNECTION_EXCEPTION_PRIMARY then revive but REVIVE_FAILED, revive status 12(REVIVE_FAILED), old location: PartitionLocation[
  id-epoch:1988-0
  host-rpcPort-pushPort-fetchPort-replicatePort:xx.xx.xx.xx-9091-9092-9093-9094
  mode:PRIMARY
  peer:(empty)
  storage hint:StorageInfo{type=MEMORY, mountPoint='/tmp/storage', finalResult=false, filePath=}
  mapIdBitMap:null]
	at org.apache.celeborn.client.ShuffleClientImpl.submitRetryPushData(ShuffleClientImpl.java:261)
	at org.apache.celeborn.client.ShuffleClientImpl.access$600(ShuffleClientImpl.java:62)
	at org.apache.celeborn.client.ShuffleClientImpl$3.lambda$onFailure$1(ShuffleClientImpl.java:1045)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #2084 from SteNicholas/CELEBORN-1120.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Fu Chen <cfmcgrady@gmail.com>
2023-11-10 11:44:47 +08:00
Shuang
931880a82d [CELEBORN-1112] Inform celeborn application is shutdown, then celeborn cluster can release resource immediately
### What changes were proposed in this pull request?
Unregister application to Celeborn master After Application stopped, then master will expire the related shuffle resource immediately, resulting in resource savings.

### Why are the changes needed?
Currently Celeborn master expires the related application shuffle resource only when application is being checked timeout,
this would greatly delay the release of resources, which is not conducive to saving resources.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
PASS GA

Closes #2075 from RexXiong/CELEBORN-1112.

Authored-by: Shuang <lvshuang.tb@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-11-08 20:46:51 +08:00
xiyu.zk
ffbbe257fb [CELEBORN-1109] Cache RegisterShuffleResponse to improve the processing speed of LifecycleManager
### What changes were proposed in this pull request?
Cache RegisterShuffleResponse to improve the processing speed of LifecycleManager

### Why are the changes needed?
During the processing of the registerShuffle request, constructing the RegisterShuffleResponse instance and serialization can indeed consume a significant amount of time.  When there are a large number of registerShuffle requests that need to be processed by the LifecycleManager simultaneously, the response time of the LifecycleManager will be delayed. Therefore, caching is needed to improve the processing performance of the LifecycleManager.

![image](https://github.com/apache/incubator-celeborn/assets/107825064/06d3cb3c-156a-46c7-a08d-fefa18b26e40)

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #2070 from kerwin-zk/issue-1109.

Authored-by: xiyu.zk <xiyu.zk@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-11-07 18:05:22 +08:00
sychen
4465a9229b [CELEBORN-1048][FOLLOWUP] MR module compile
### What changes were proposed in this pull request?
Let the MR module compile successfully.

### Why are the changes needed?
#2000 added parameters in the `ShuffleClient#readPartition` method, resulting in MR module compilation failure.

MR CI is still missing.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
local test
```bash
./build/make-distribution.sh -Pmr
```

Closes #2069 from cxzl25/CELEBORN-1048-FOLLOWUP.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-11-04 20:21:47 +08:00
mingji
5e77b851c9 [CELEBORN-1081] Client support celeborn.storage.activeTypes config
### What changes were proposed in this pull request?
1.To support `celeborn.storage.activeTypes` in Client.
2.Master will ignore slots for "UNKNOWN_DISK".

### Why are the changes needed?
Enable client application to select storage types to use.

### Does this PR introduce _any_ user-facing change?
Yes.

### How was this patch tested?
GA and cluster.

Closes #2045 from FMX/B1081.

Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: Shuang <lvshuang.tb@gmail.com>
2023-11-03 20:03:11 +08:00
onebox-li
7b185a2562 [CELEBORN-1058] Support specifying the number of dispatcher threads for each role
### What changes were proposed in this pull request?
Support specifying the number of dispatcher threads for each role, especially shuffle client side. For shuffle client, there is only RpcEndpointVerifier endpoint which handles not many requests, one thread is enough. The rpc env of other roles has only two endpoints at most, using a shared event loop is reasonable. I am not sure if there is a need to add rpc requests to shuffle client. So add specific parameters to specify the dispatcher threads here.

And change the dispatcher thread pool name in order to distinguish it from spark's.

### Why are the changes needed?
Ditto

### Does this PR introduce _any_ user-facing change?
Yes, add params celeborn.\<role>.rpc.dispatcher.threads

### How was this patch tested?
Manual test and UT

Closes #2003 from onebox-li/my_dev.

Authored-by: onebox-li <lyh-36@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-11-03 10:35:54 +08:00
TongWei1105
0583cdb5a8 [CELEBORN-1048] Align fetchWaitTime metrics to spark implementation
### What changes were proposed in this pull request?
Align fetchWaitTime metrics to spark implementation

### Why are the changes needed?
In our production environment, there are variations in the fetchWaitTime metric for the same stage of the same job.

ON YARN ESS:
![image](https://github.com/apache/incubator-celeborn/assets/68682646/601a8315-1317-48dc-b9a6-7ea651d5122d)
ON CELEBORN
![image](https://github.com/apache/incubator-celeborn/assets/68682646/e00ed60f-3789-4330-a7ed-fdd5754acf1d)
Then, based on the implementation of Spark ShuffleBlockFetcherIterator, I made adjustments to the fetchWaitTime metrics code

Now, looks like more reasonable, 
![image](https://github.com/apache/incubator-celeborn/assets/68682646/ce5e46e4-8ed2-422e-b54b-cd594aad73dd)
### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
yes, tested in our production environment

Closes #2000 from TongWei1105/CELEBORN-1048.

Lead-authored-by: TongWei1105 <vvtwow@gmail.com>
Co-authored-by: Keyong Zhou <zhouky@apache.org>
Co-authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-11-02 15:27:30 +08:00
onebox-li
cd8acf89c9 [CELEBORN-1059] Fix callback not update if push worker excluded during retry
### What changes were proposed in this pull request?
When retry push data and revive succeed in ShuffleClientImpl#submitRetryPushData, if new location is excluded, the callback's `lastest` location has not been updated when wrappedCallback.onFailure is called in ShuffleClientImpl#isPushTargetWorkerExcluded. Therefore there may be problems with subsequent revive.

### Why are the changes needed?
Ditto

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Manual test.

Closes #2005 from onebox-li/improve-push-exclude.

Authored-by: onebox-li <lyh-36@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-11-01 10:23:50 +08:00
sychen
e02cde0a22 [CELEBORN-1098] Logging worker address with worker failure log
### What changes were proposed in this pull request?

### Why are the changes needed?
At present, from the log, We don't know which worker's request timed out.

```java
23/10/30 15:44:51,963 [CommitFiles-ForkJoinPool-162-worker-1] ERROR ReducePartitionCommitHandler: AskSync CommitFiles for 0 failed (attempt 1/4).
org.apache.celeborn.common.rpc.RpcTimeoutException: Futures timed out after [60000 milliseconds]. This timeout is controlled by celeborn.rpc.askTimeout
	at org.apache.celeborn.common.rpc.RpcTimeout.org$apache$celeborn$common$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:46)
	at org.apache.celeborn.common.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:61)
	at org.apache.celeborn.common.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:57)
	at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38)
	at org.apache.celeborn.common.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
	at org.apache.celeborn.common.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:89)
	at org.apache.celeborn.common.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:73)
	at org.apache.celeborn.client.commit.CommitHandler.requestCommitFilesWithRetry(CommitHandler.scala:417)
	at org.apache.celeborn.client.commit.CommitHandler.commitFiles(CommitHandler.scala:279)
	at org.apache.celeborn.client.CommitManager$$anon$1$$anon$2.$anonfun$run$2(CommitManager.scala:151)
	at org.apache.celeborn.client.CommitManager$$anon$1$$anon$2.$anonfun$run$2$adapted(CommitManager.scala:122)
	at org.apache.celeborn.common.util.ThreadUtils$.$anonfun$parmap$2(ThreadUtils.scala:293)
	at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:659)
	at scala.util.Success.$anonfun$map$1(Try.scala:255)
	at scala.util.Success.map(Try.scala:213)
	at scala.concurrent.Future.$anonfun$map$1(Future.scala:292)
	at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:33)
	at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:33)
	at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)
	at java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(ForkJoinTask.java:1402)
	at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
	at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
	at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
	at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [60000 milliseconds]
	at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:259)
	at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:263)
	at org.apache.celeborn.common.util.ThreadUtils$.awaitResult(ThreadUtils.scala:225)
	at org.apache.celeborn.common.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:74)
	... 19 more
```

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #2054 from cxzl25/CELEBORN-1098.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-10-31 21:30:07 +08:00
onebox-li
f6cc377c15 [CELEBORN-1099] Check register when handleGetReducerFileGroup
### What changes were proposed in this pull request?
For spark case, when stage outputPartitioning is satisfied and no longer needs shuffle exchange, there will be no shuffle write procedure, same goes for `RegisterShuffle`, currently the origin reduce stage will throw a NPE when LifeCycleManager `handleGetReducerFileGroup`.
```
ERROR [dispatcher-event-loop-11] Inbox: Ignoring error
java.lang.NullPointerException: null
    at org.apache.celeborn.client.commit.ReducePartitionCommitHandler.handleGetReducerFileGroup(ReducePartitionCommitHandler.scala:307)
    at org.apache.celeborn.client.CommitManager.handleGetReducerFileGroup(CommitManager.scala:266)
    at org.apache.celeborn.client.LifecycleManager.org$apache$celeborn$client$LifecycleManager$$handleGetReducerFileGroup(LifecycleManager.scala:556)
    at org.apache.celeborn.client.LifecycleManager$$anonfun$receiveAndReply$1.applyOrElse(LifecycleManager.scala:298)
    at org.apache.celeborn.common.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
    at org.apache.celeborn.common.rpc.netty.Inbox.safelyCall(Inbox.scala:222)
    at org.apache.celeborn.common.rpc.netty.Inbox.process(Inbox.scala:110)
    at org.apache.celeborn.common.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:227)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
```
Reproduce example like:
`select count(*) as cnt from tableA;`
And tableA is empty.

So here fix code to adapt to this normal situation. For Flink client, just follows the old behavior.

### Why are the changes needed?
Fix the probable NPE.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Cluster test.

Closes #2056 from onebox-li/fix-empty-shuffle-npe.

Authored-by: onebox-li <lyh-36@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-10-31 21:25:55 +08:00
xiyu.zk
2ce8d6fd95 [CELEBORN-1102] Optimize the performance of getAllPrimaryLocationsWithMinEpoch
### What changes were proposed in this pull request?
Optimize the performance of getAllPrimaryLocationsWithMinEpoch

### Why are the changes needed?
#### Before optimization:
![image](https://github.com/apache/incubator-celeborn/assets/107825064/0ccbf503-99b7-45db-a8bd-6539e854d011)

#### After optimization:
![image](https://github.com/apache/incubator-celeborn/assets/107825064/0cb54276-a089-44dc-9b75-6649537515f2)

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #2058 from kerwin-zk/issue-1102.

Authored-by: xiyu.zk <xiyu.zk@alibaba-inc.com>
Signed-off-by: xiyu.zk <xiyu.zk@alibaba-inc.com>
2023-10-31 20:37:17 +08:00
SteNicholas
3092644168 [CELEBORN-1095] Support configuration of fastest available XXHashFactory instance for checksum of Lz4Decompressor
### What changes were proposed in this pull request?

`CelebornConf` adds `celeborn.client.shuffle.decompression.lz4.xxhash.instance` to configure fastest available `XXHashFactory` instance for checksum of `Lz4Decompressor`. Fix #2043.

### Why are the changes needed?

`Lz4Decompressor` creates the checksum with `XXHashFactory#fastestInstance`, which returns the fastest available `XXHashFactory` instance that uses nativeInstance at default. The fastest available `XXHashFactory` instance for checksum of `Lz4Decompressor` could be supported to configure instead of dependency on the class loader is the system class loader, which method is as follows:
```
/**
 * Returns the fastest available {link XXHashFactory} instance. If the class
 * loader is the system class loader and if the
 * {link #nativeInstance() native instance} loads successfully, then the
 * {link #nativeInstance() native instance} is returned, otherwise the
 * {link #fastestJavaInstance() fastest Java instance} is returned.
 * <p>
 * Please read {link #nativeInstance() javadocs of nativeInstance()} before
 * using this method.
 *
 * return the fastest available {link XXHashFactory} instance.
 */
public static XXHashFactory fastestInstance() {
  if (Native.isLoaded()
      || Native.class.getClassLoader() == ClassLoader.getSystemClassLoader()) {
    try {
      return nativeInstance();
    } catch (Throwable t) {
      return fastestJavaInstance();
    }
  } else {
    return fastestJavaInstance();
  }
}
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

- `CelebornConfSuite`
- `ConfigurationSuite`

Closes #2050 from SteNicholas/CELEBORN-1095.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: xiyu.zk <xiyu.zk@alibaba-inc.com>
2023-10-31 14:57:31 +08:00
SteNicholas
df40a28959 [CELEBORN-1032][FOLLOWUP] Use scheduleWithFixedDelay instead of scheduleAtFixedRate in threads pool of master and worker
### What changes were proposed in this pull request?

Use `scheduleWithFixedDelay` instead of `scheduleAtFixedRate` in thread pool of Celeborn Master and Worker.

### Why are the changes needed?

Follow up #1970.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Internal tests.

Closes #2048 from SteNicholas/CELEBORN-1032.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-10-27 11:20:28 +08:00
SteNicholas
49ea881037
[MINOR] Remove unnecessary increment index of Master#timeoutDeadWorkers
### What changes were proposed in this pull request?

Remove unnecessary increment index of `Master#timeoutDeadWorkers`.

### Why are the changes needed?

Increment index of `Master#timeoutDeadWorkers` is unnecessary.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #2027 from SteNicholas/timeout-dead-workers.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-10-23 22:18:39 +08:00
sychen
34e6c19192 [CELEBORN-1042] Calculate duration using nanotime in CelebornInputStream
### What changes were proposed in this pull request?

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1994 from cxzl25/CELEBORN-1042.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-10-16 19:17:03 +08:00
onebox-li
6b3c108f6e [CELEBORN-1040] Adjust local read logs and refine createReader
### What changes were proposed in this pull request?
Adjust the local reader logs. Before, it will log local read stats in each stream clos whether it really contains local read.
And refine the CelebornInputStreamImpl#createReader code to be more clearer.

### Why are the changes needed?
Ditto.

### Does this PR introduce _any_ user-facing change?
Adjust local read logs.

### How was this patch tested?
Manual test.

Closes #1988 from onebox-li/local-dev.

Authored-by: onebox-li <lyh-36@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-10-13 20:59:38 +08:00
SteNicholas
56276e910f [CELEBORN-1024] Thread factory should set UncaughtExceptionHandler to handle uncaught exception
### What changes were proposed in this pull request?

`batchHandleChangePartitionExecutors` could not handle uncaught exception in `ChangePartitionRequest`, which causes that the uncaught exception of the thread could not get for troubleshooting. Thread factory should set `UncaughtExceptionHandler` to handle uncaught exception.

### Why are the changes needed?

Thread factory sets `UncaughtExceptionHandler` to handle uncaught exception in `ThreadUtils`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit tests.

Closes #1962 from SteNicholas/CELEBORN-1024.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-10-09 20:56:40 +08:00
Fu Chen
b2412d0774 [CELEBORN-1022][TEST] Update log level from FATAL to ERROR for console output in unit tests
### What changes were proposed in this pull request?

As title

### Why are the changes needed?

1. this is developer-friendly for debugging unit tests in IntelliJ IDEA, for example: Netty's memory leak reports are logged at the error level and won't cause unit tests to be marked as fatal.

```
23/10/09 09:57:26,422 ERROR [fetch-server-52-2] ResourceLeakDetector: LEAK: ByteBuf.release() was not called before it's garbage-collected. See https://netty.io/wiki/reference-counted-objects.html for more information.
Recent access records:
Created at:
	io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:403)
	io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:188)
	io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:179)
	io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:140)
	io.netty.channel.DefaultMaxMessagesRecvByteBufAllocator$MaxMessageHandle.allocate(DefaultMaxMessagesRecvByteBufAllocator.java:120)
	io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:150)
	io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788)
	io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
	io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)
	io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
	io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
	io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	java.lang.Thread.run(Thread.java:750)
```

2. this won't increase console output and affect the stability of CI.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Pass GA

Closes #1958 from cfmcgrady/ut-console-log-level.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-10-09 15:56:05 +08:00
Fu Chen
c4135dc1b1 [CELEBORN-980] Asynchronously delete original files to fix ReusedExchange bug
### What changes were proposed in this pull request?

The `ReusedExchange` operator has the potential to generate different types of fetch requests, including both non-range and range requests. Currently, an issue arises due to the synchronous deletion of the original file by the Celeborn worker upon completion of sorting. This issue leads to the failure of non-range requests following a range request for the same partition.

the snippets to reproduce this bug
```scala
  val sparkConf = new SparkConf().setAppName("celeborn-test").setMaster("local[2]")
    .set("spark.shuffle.manager", "org.apache.spark.shuffle.celeborn.SparkShuffleManager")
    .set(s"spark.${CelebornConf.MASTER_ENDPOINTS.key}", masterInfo._1.rpcEnv.address.toString)
    .set("spark.sql.autoBroadcastJoinThreshold", "-1")
    .set("spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes", "100")
    .set("spark.sql.adaptive.advisoryPartitionSizeInBytes", "100")
  val spark = SparkSession.builder()
    .config(sparkConf)
    .getOrCreate()
  spark.range(0, 1000, 1, 10)
    .selectExpr("id as k1", "id as v1")
    .createOrReplaceTempView("ta")
  spark.range(0, 1000, 1, 10)
    .selectExpr("id % 1 as k21", "id % 1 as k22", "id as v2")
    .createOrReplaceTempView("tb")
  spark.range(140)
    .select(
      col("id").cast("long").as("k3"),
      concat(col("id").cast("string"), lit("a")).as("v3"))
    .createOrReplaceTempView("tc")

  spark.sql(
    """
      |SELECT *
      |FROM ta
      |LEFT JOIN tb ON ta.k1 = tb.k21
      |LEFT JOIN tc ON tb.k22 = tc.k3
      |""".stripMargin)
    .createOrReplaceTempView("v1")

  spark.sql(
    """
      |SELECT * FROM v1 WHERE v3 IS NOT NULL
      |UNION
      |SELECT * FROM v1
      |""".stripMargin)
    .collect()
```

This PR proposes a solution to address this problem. It introduces an asynchronous thread for the removal of the original file. Once the sorted file is generated for a given partition, this modification ensures that both non-range and range fetch requests will be able to and only fetch the sorted file once it is generated for a given partition.

this activity diagram of `openStream`

![openStream](https://github.com/apache/incubator-celeborn/assets/8537877/633cc5b8-e673-45a0-860e-e1f7e50c8965)

### Does this PR introduce _any_ user-facing change?

No, only bug fix

### How was this patch tested?

UT

Closes #1932 from cfmcgrady/fix-partition-sort-bug-v4.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-10-09 11:04:41 +08:00
mingji
e0c00ecd38 [CELEBORN-839][MR] Support Hadoop MapReduce
### What changes were proposed in this pull request?
1. Map side merge and push.
2. Support hadoop2 & 3.
3. Reduce in-memory merge.
4. Integrate LifecycleManager to RmApplicationMaster.

### Why are the changes needed?
Ditto.

### Does this PR introduce _any_ user-facing change?
NO.

### How was this patch tested?
Cluster.

I tested this PR on a cluster with a 4x 16 CPU 64G Mem 4ESSD cluster.
Hadoop 2.8.5

1TB Terasort, 8400 mappers, 1000 reducers
Celeborn 81min vs MR shuffle 89min
![mr1](https://github.com/apache/incubator-celeborn/assets/4150993/a3cf6493-b6ff-4c03-9936-4558cf22761d)
![mr2](https://github.com/apache/incubator-celeborn/assets/4150993/9119ffb4-6996-4b77-bcdf-cbd6db5c096f)

1GB wordcount, 8 mappers, 8 reducers
Celeborn 35s VS MR shuffle 38s
![mr3](https://github.com/apache/incubator-celeborn/assets/4150993/907dce24-16b7-4788-ab5d-5b784fd07d47)
![mr4](https://github.com/apache/incubator-celeborn/assets/4150993/8e8065b9-6c46-4c8d-9e71-45eed8e63877)

Closes #1830 from FMX/CELEBORN-839.

Lead-authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Co-authored-by: Ethan Feng <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-09-14 14:12:53 +08:00
sychen
38a68163e0 [CELEBORN-957] Simplify nano time duration calculation
### What changes were proposed in this pull request?
use `TimeUnit.NANOSECONDS.toMillis` instead of `/1000_000`

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1888 from cxzl25/CELEBORN-957.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-09-08 19:03:37 +08:00
zhongqiang.czq
b1e3d661e6 [CELEBORN-627][FLINK][FOLLOWUP] Support split partitions
### What changes were proposed in this pull request?
fix duplicated sending commitFiles for MapPartition and fix not sending BufferStreamEnd while opening MapPartition split.

### Why are the changes needed?
After open partition split  for MapPartition, there are 2 errors.
- ERROR1 : Worker don't send streamend to client because concurrent thread sync problem . After idle timeout, client will close the channel and throws the Exception **" xx is lost, notify related stream xx"**
```java
2023-09-06T04:40:47.7549935Z 23/09/06 04:40:47,753 WARN [Keyed Aggregation -> Map -> Sink: Unnamed (5/8)#0] Task: Keyed Aggregation -> Map -> Sink: Unnamed (5/8)#0 (c1cade728ddb3a32e0bf72acb1d87588_c27dcf7b54ef6bfd6cff02ca8870b681_4_0) switched from RUNNING to FAILED with failure cause:
2023-09-06T04:40:47.7550644Z java.io.IOException: Client localhost/127.0.0.1:38485 is lost, notify related stream 256654410004
2023-09-06T04:40:47.7551219Z 	at org.apache.celeborn.plugin.flink.RemoteBufferStreamReader.errorReceived(RemoteBufferStreamReader.java:142)
2023-09-06T04:40:47.7551886Z 	at org.apache.celeborn.plugin.flink.RemoteBufferStreamReader.lambda$new$0(RemoteBufferStreamReader.java:77)
2023-09-06T04:40:47.7552576Z 	at org.apache.celeborn.plugin.flink.network.ReadClientHandler.processMessageInternal(ReadClientHandler.java:57)
2023-09-06T04:40:47.7553250Z 	at org.apache.celeborn.plugin.flink.network.ReadClientHandler.lambda$channelInactive$0(ReadClientHandler.java:119)
2023-09-06T04:40:47.7553806Z 	at java.util.concurrent.ConcurrentHashMap.forEach(ConcurrentHashMap.java:1597)
2023-09-06T04:40:47.7554564Z 	at org.apache.celeborn.plugin.flink.network.ReadClientHandler.channelInactive(ReadClientHandler.java:110)
2023-09-06T04:40:47.7555270Z 	at org.apache.celeborn.common.network.server.TransportRequestHandler.channelInactive(TransportRequestHandler.java:71)
2023-09-06T04:40:47.7556005Z 	at org.apache.celeborn.common.network.server.TransportChannelHandler.channelInactive(TransportChannelHandler.java:136)
2023-09-06T04:40:47.7556710Z 	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:305)
2023-09-06T04:40:47.7557370Z 	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:281)
2023-09-06T04:40:47.7558172Z 	at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:274)
2023-09-06T04:40:47.7558803Z 	at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:81)
2023-09-06T04:40:47.7559368Z 	at io.netty.handler.timeout.IdleStateHandler.channelInactive(IdleStateHandler.java:277)
2023-09-06T04:40:47.7559954Z 	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:303)
2023-09-06T04:40:47.7560589Z 	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:281)
2023-09-06T04:40:47.7561222Z 	at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:274)
2023-09-06T04:40:47.7561829Z 	at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:81)
2023-09-06T04:40:47.7562620Z 	at org.apache.celeborn.plugin.flink.network.TransportFrameDecoderWithBufferSupplier.channelInactive(TransportFrameDecoderWithBufferSupplier.java:206)
2023-09-06T04:40:47.7563506Z 	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:305)
2023-09-06T04:40:47.7564207Z 	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:281)
2023-09-06T04:40:47.7564829Z 	at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:274)
2023-09-06T04:40:47.7565417Z 	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1405)
2023-09-06T04:40:47.7566014Z 	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:301)
2023-09-06T04:40:47.7566654Z 	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:281)
2023-09-06T04:40:47.7567317Z 	at io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:901)
2023-09-06T04:40:47.7567813Z 	at io.netty.channel.AbstractChannel$AbstractUnsafe$7.run(AbstractChannel.java:813)
2023-09-06T04:40:47.7568297Z 	at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:174)
2023-09-06T04:40:47.7568830Z 	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:167)
2023-09-06T04:40:47.7569402Z 	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
2023-09-06T04:40:47.7569894Z 	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:569)
2023-09-06T04:40:47.7570356Z 	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
2023-09-06T04:40:47.7570841Z 	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
2023-09-06T04:40:47.7571319Z 	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
2023-09-06T04:40:47.7571721Z 	at java.lang.Thread.run(Thread.java:750)
```
- ERROR2: Client will send duplicated commitFiles to worker. Becuase of inconsistency unHandledPartiitions , both batchCommit and finalCommit send commitFiles
``` java
2023-09-06T04:36:48.3146773Z 23/09/06 04:36:48,314 WARN [Worker-CommitFiles-1] Controller: Get Partition Location for 1693975002919-61094c8156f918062a5fae12d551bc90-0 0-1 but didn't exist.
```

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
ci

Closes #1881 from zhongqiangczq/fix-split-test.

Authored-by: zhongqiang.czq <zhongqiang.czq@alibaba-inc.com>
Signed-off-by: zhongqiang.czq <zhongqiang.czq@alibaba-inc.com>
2023-09-06 22:33:56 +08:00
zky.zhoukeyong
a42ec85a6e [CELEBORN-943][PERF] Pre-create CelebornInputStreams in CelebornShuffleReader
### What changes were proposed in this pull request?
This PR fixes performance degradation when Spark's coalescePartitions takes effect caused
by RPC latency.

### Why are the changes needed?
I encountered a performance degradation when testing  tpcds 10T q10:
||Time|
|---|---|
|ESS|14s|
|Celeborn| 24s|

After digging into it I found out that q10 triggers partition coalescence:
![image](https://github.com/apache/incubator-celeborn/assets/948245/0b4745da-8d57-4661-a35d-683d97f56e1d)

As I configured `spark.sql.adaptive.coalescePartitions.initialPartitionNum` to 1000, `CelebornShuffleReader`
will call `shuffleClient.readPartition` sequentially 1000 times, causing the delay.

This PR optimizes by calling `shuffleClient.readPartition` in parallel. After this PR q10 time becomes 14s.

### Does this PR introduce _any_ user-facing change?
No, but introduced a new client side configuration `celeborn.client.streamCreatorPool.threads`
which defaults to 32.

### How was this patch tested?
TPCDS 1T and passes GA.

Closes #1876 from waitinfuture/943.

Lead-authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Co-authored-by: Keyong Zhou <waitinfuture@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-09-04 21:46:11 +08:00
zhongqiang.czq
b66eaff880 [CELEBORN-627][FLINK] Support split partitions
### What changes were proposed in this pull request?
In MapPartiitoin, datas are split into regions.

1. Unlike ReducePartition whose partition split can occur on pushing data
to keep MapPartition data ordering,  PartitionSplit only be done on the time of sending PushDataHandShake or RegionStart messages (As shown in the following image). That's to say that the partition split only appear at the beginnig of a region but not inner a region.
> Notice: if the client side think that it's failed to push HandShake or RegionStart messages. but the worker side can still receive normal HandShake/RegionStart message. After client revive succss, it don't push any messages to old partition, so the worker having the old partition will create a empty file. After committing files, the worker will return empty commitids. That's to say that empty file will be filterd after committing files and ReduceTask will not read any empty files.

![image](https://github.com/apache/incubator-celeborn/assets/96606293/468fd660-afbc-42c1-b111-6643f5c1e944)

2. PushData/RegioinFinish don't care the following cases:
 - Diskfull
 - ExceedPartitionSplitThreshold
 - Worker ShuttingDown
so if one of the above three conditions appears, PushData and RegionFinish cant still do as normal. Workers should consider the ShuttingDown case and  try best to wait all the regions finished before shutting down.

if PushData or RegionFinish failed like network timeout and so on, then MapTask will failed and start another attempte maptask.

![image](https://github.com/apache/incubator-celeborn/assets/96606293/db9f9166-2085-4be1-b09e-cf73b469c55b)

3. how shuffle read supports partition split?
ReduceTask should get split paritions by order and open the stream by partition epoc orderly

### Why are the changes needed?
PartiitonSplit is not supported by MapPartition from now.
There still a risk that  a partition file'size is too large to store the file on worker disk.
To avoid this risk, this pr introduces partition split in shuffle read and shuffle write.

### Does this PR introduce _any_ user-facing change?
NO.

### How was this patch tested?
UT and manual TPCDS test

Closes #1550 from FMX/CELEBORN-627.

Lead-authored-by: zhongqiang.czq <zhongqiang.czq@alibaba-inc.com>
Co-authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Co-authored-by: Ethan Feng <ethanfeng@apache.org>
Signed-off-by: zhongqiang.czq <zhongqiang.czq@alibaba-inc.com>
2023-09-01 19:25:51 +08:00
mingji
505ba804c7 [CELEBORN-752] Support read local shuffle file for spark
### What changes were proposed in this pull request?
For spark clusters, support read local shuffle file if Celeborn is co-deployed with yarn node managers. This PR help to reduce the number of active connections.

### Why are the changes needed?
Ditto.

### Does this PR introduce _any_ user-facing change?
NO.

### How was this patch tested?
GA and cluster.  The performance is identical whether you enable local reader, but the active connection number may vary according to your connections per peer.
<img width="951" alt="截屏2023-08-16 20 20 14" src="https://github.com/apache/incubator-celeborn/assets/4150993/9106e731-28fc-4e78-9c05-ae6a269d249a">
The active connection number changed from 3745 to 2894. This PR will help to improve cluster stability.

Closes #1812 from FMX/CELEBORN-752.

Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-08-30 18:52:18 +08:00
SteNicholas
4625484d2c [CELEBORN-830] Check available workers in CelebornShuffleFallbackPolicyRunner
### What changes were proposed in this pull request?

`CelebornShuffleFallbackPolicyRunner` could not only check quota, but also check whether cluster has available workers. If there is no available workers, fallback to external shuffle.

### Why are the changes needed?

`CelebornShuffleFallbackPolicyRunner` adds a check for available workers.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

- `SparkShuffleManagerSuite#testClusterNotAvailableWithAvailableWorkers`

Closes #1814 from SteNicholas/CELEBORN-830.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-08-29 16:56:56 +08:00