Commit Graph

1491 Commits

Author SHA1 Message Date
mingji
7e05d64d04 [CELEBORN-1205] Disable Maven local caches to improve SBT building stability
### What changes were proposed in this pull request?
To eliminate build failure when using SBT.

### Why are the changes needed?
If the maven local cache is enabled, SBT can't find the correct dependencies.
If the maven local cache is disabled, SBT can find the correct dependencies.

### Does this PR introduce _any_ user-facing change?
NO

### How was this patch tested?
GA

Closes #2199 from FMX/b1205.

Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2024-01-02 21:47:08 +08:00
zwangsheng
d1b30b2827 [CELEBORN-1208][WORKER] Unify parse uniqueId to WorkerInfo
### What changes were proposed in this pull request?

Unify parse uniqueId to WorkerInfo

### Why are the changes needed?

Keep parse uniqueId behavior consistent and avoid multiple changes

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Unit test

Closes #2202 from zwangsheng/CELEBORN-1208.

Authored-by: zwangsheng <binjieyang@apache.org>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2024-01-02 21:27:58 +08:00
mingji
7be05b430b [CELEBORN-1133] Refactor fileinfo
### What changes were proposed in this pull request?
Rename FileWriter to PartitionLocationDataWriter, add storageManager, delete fileinfo, and flusher in the constructor.

FileInfo(userIdentifier,partitionSplitEnabled,fileMeta)
– NonMemoryFileInfo(streams,filePath,storageType,bytesFlushed)
– MemoryFileInfo(length,buffer)

FileMeta
– reduceFileMeta(chunkOffsets,sorted)
– mapFileMeta(bufferSize,numSubPartitions)

### Why are the changes needed?
1. To make concepts more clear.
2. To support memory storage and HDFS slot management.

### Does this PR introduce _any_ user-facing change?
NO.

### How was this patch tested?
GA and cluster test with worker kill.

Closes #2130 from FMX/b1133.

Lead-authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Co-authored-by: Keyong Zhou <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2024-01-02 21:26:10 +08:00
Cheng Pan
ecd577e5d3
[CELEBORN-1204] Update NOTICE year 2024
### What changes were proposed in this pull request?

As title

### Why are the changes needed?

it's 2024 now, and this patch is expected to be applied to all active branches(which are planned to be released)

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Review

Closes #2198 from pan3793/CELEBORN-1204.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2024-01-02 15:55:52 +08:00
Cheng Pan
77e468161d [CELEBORN-891] Remove pipeline feature for sort based writer
### What changes were proposed in this pull request?

Remove pipeline feature for sort based writer

### Why are the changes needed?

The pipeline feature is added as part of CELEBORN-295, for performance. Eventually, an unresolvable issue that would crash the JVM was identified in https://github.com/apache/incubator-celeborn/pull/1807, and after discussion, we decided to delete this feature.

### Does this PR introduce _any_ user-facing change?

No, the pipeline feature is disabled by default, there are no changes to users who use the default settings.

### How was this patch tested?

Pass GA.

Closes #2196 from pan3793/CELEBORN-891.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2024-01-01 10:42:17 +08:00
Fu Chen
9619958cb1 [CELEBORN-1203] Add LicenseAndNoticeMergeStrategy to resolve inner project LICENSE/NOTICE conflict for shaded client packaging
### What changes were proposed in this pull request?

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #2197 from cfmcgrady/license.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Fu Chen <cfmcgrady@gmail.com>
2024-01-01 00:48:34 +08:00
mingji
b4b86848e3
[CELEBORN-1202] LICENSE mentions third-party components under other open source licenses
### What changes were proposed in this pull request?

`LICENSE` mentions third-party components under other open source licenses like Apache Spark etc.

### Why are the changes needed?

`LICENSE` mentions 1 3rd party file from Guava. However, the `NOTICE` lists both Apache Spark and Apache Flink. `LICENSE` should mention all third-party components under other open source licenses.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #2193 from SteNicholas/CELEBORN-1202.

Lead-authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Co-authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-12-29 11:35:50 +08:00
Fu Chen
55df09c14c
[CELEBORN-1199][FOLLOWUP] Disabled the plugin AddMetaInfLicenseFiles for shaded clients
### What changes were proposed in this pull request?

As title

### Why are the changes needed?

https://github.com/apache/incubator-celeborn/pull/2189#issuecomment-1870940496

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Manually tested

Closes #2195 from cfmcgrady/license-followup.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-12-29 10:02:22 +08:00
Fu Chen
6691568242 [CELEBORN-1199] Add LICENSE and NOTICE files for service related sub-projects
### What changes were proposed in this pull request?

As title
### Why are the changes needed?

This PR introduces LICENSE and NOTICE files for service related sub-projects

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

```shell
$ ./build/sbt celeborn-worker/package
$ jar tf worker/target/scala-2.12/celeborn-worker_2.12-0.5.0-SNAPSHOT.jar | grep -i 'license\|notice'
META-INF/LICENSE
META-INF/NOTICE
```

Closes #2189 from cfmcgrady/license.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Fu Chen <cfmcgrady@gmail.com>
2023-12-28 01:07:51 +08:00
SteNicholas
3097ffe33b [CELEBORN-678][FOLLOWUP] MapperAttempts for a shuffle should reply MAP_ENDED when mapper has already been ended from speculative task
### What changes were proposed in this pull request?

MapperAttempts for a shuffle replies the `MAP_ENDED` when mapper has already been ended for receving push data or push merged data from speculative task.

Follow up #1591.

### Why are the changes needed?

When mapper has already been ended for receving push data or push merged data from speculative task, `PushDataHandler` should trigger MapEnd instead of StageEnd for worker. Meanwhile, the `ShuffleClientImpl` should handle `STAGE_ENDED` as MapEnd, otherwise causes that other tasks of the stage could not send shuffle data for data lost.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Internal test.

Closes #2190 from SteNicholas/CELEBORN-678.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-12-27 20:40:40 +08:00
liangyongyuan
f8eb1605a1 [CELEBORN-1036][FOLLOWUP] When inflightBatchesPerAddress clear, totalInflightReqs should reset
### What changes were proposed in this pull request?
When   `inflightBatchesPerAddress`  clear in  `InFlightRequestTracker.cleanup `, `totalInflightReqs` should also reset to avoid getting stuck when exiting.

### Why are the changes needed?
`inflightBatchesPerAddress` has cleared and be empty,but totalInflightReqs is always bigger than 0.
![image](https://github.com/apache/incubator-celeborn/assets/46274164/28223f1e-ac9b-4e0b-a26d-9b529af6bca1)

This occurred during the first attempt of the task, where the request for map end failed, but the driver marked that the map has already ended.
![image](https://github.com/apache/incubator-celeborn/assets/46274164/7f43d808-2f9b-4775-b04f-30afe4d31e5a)

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
through exists uts

Closes #2191 from lyy-pineapple/celebron-1036.

Authored-by: liangyongyuan <liangyongyuan@xiaomi.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-12-27 16:10:12 +08:00
sychen
587c8b55f8
[CELEBORN-1195] Use batch rack resolve when restore meta from file
### What changes were proposed in this pull request?、
```java
org.apache.celeborn.service.deploy.master.clustermeta.AbstractMetaManager#restoreMetaFromFile
```

### Why are the changes needed?
When the number of workers is large, the performance of parsing one by one will decrease.

YARN-9332. RackResolver tool should accept multiple hosts
https://issues.apache.org/jira/browse/YARN-9332

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #2185 from cxzl25/CELEBORN-1195.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-12-27 11:29:28 +08:00
SteNicholas
e7e39a51be
[CELEBORN-1189] Introduce RunningApplicationCount metric and /applications API to record running applications of worker
### What changes were proposed in this pull request?

Introduce `RunningApplicationCount` metric and `/applications` API to record running applications for Celeborn worker.

### Why are the changes needed?

`RunningApplicationCount` metrics only monitors the count of running applications in the cluster for master. Meanwhile, `/listTopDiskUsedApps` API lists the top disk usage application ids for master and worker. Therefore `RunningApplicationCount` metric and `/applications` API could be introduced to record running applications of worker.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Internal tests.

Closes #2172 from SteNicholas/CELEBORN-1189.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-12-27 09:51:16 +08:00
sychen
5746eb36ae
[CELEBORN-1198] Keep debug info when use SBT build
### What changes were proposed in this pull request?
Add "-g" to javac compile parameters when using SBT build

> -g
Generates all debugging information, including local variables. By default, only line number and source file information is generated.

https://docs.oracle.com/en/java/javase/17/docs/specs/man/javac.html

### Why are the changes needed?
`maven-compiler-plugin` defaults to debug=true, `plexus-compiler-javac` will add the parameter `-g`.

SBT does not have this behavior by default, which leads to some differences between the jars of maven and sbt builds, although the code logic is the same.

https://maven.apache.org/plugins/maven-compiler-plugin/compile-mojo.html#debug

736da68adf/src/main/java/org/apache/maven/plugin/compiler/AbstractCompilerMojo.java (L734)

6ae79d7f2f/plexus-compilers/plexus-compiler-javac/src/main/java/org/codehaus/plexus/compiler/javac/JavacCompiler.java (L279-L285)

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
```bash
./build/sbt celeborn-worker/package
```

#### Current
`String paramString`

<img width="1450" alt="image" src="https://github.com/apache/incubator-celeborn/assets/3898450/9582402c-93e1-4dc2-b094-0f23c30390a9">

#### PR
<img width="1278" alt="image" src="https://github.com/apache/incubator-celeborn/assets/3898450/82ac3c3d-b3ad-4c94-a73f-09e88371911d">

Closes #2188 from cxzl25/CELEBORN-1198.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-12-26 15:24:39 +08:00
sychen
2bf7062db3
[CELEBORN-1197] Avoid using the sleep command with the s suffix in bash scripts
### What changes were proposed in this pull request?
Remove the s suffix of sleep

### Why are the changes needed?
MacOS
```bash
./sbin/start-all.sh
```

```
usage: sleep seconds
```

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #2187 from cxzl25/CELEBORN-1197.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-12-25 16:57:20 +08:00
SteNicholas
276ab979a4
[CELEBORN-1187][FOLLOWUP] Unify the size and file count of active shuffle metrics for master and worker
### What changes were proposed in this pull request?

Unify the size and file count of active shuffle metrics for `MasterSource` and `WorkerSource`.

Follow up #2171.

### Why are the changes needed?

`MasterSource` uses `PartitionWritten` and `PartitionFileCount` metrics as the size and file count of active shuffle for all workers. Meanwhile, `WorkerSource` uses `ActiveShuffleSize` and `ActiveShuffleFileCount` metrics as the size and file count of active shuffle for a worker including master replica and slave replica. It's recommended to unify the size and file count of active shuffle metrics between `MasterSource` and `WorkerSource`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Internal tests.

Closes #2186 from SteNicholas/CELEBORN-1187.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-12-22 18:09:39 +08:00
SteNicholas
277f7ced57
[CELEBORN-1187] Unify the size and file count of active shuffle metrics for master and worker
### What changes were proposed in this pull request?

Unify the size and file count of active shuffle metrics for `MasterSource` and `WorkerSource`.

### Why are the changes needed?

`MasterSource` uses `PartitionWritten` and `PartitionFileCount` metrics as the size and file count of active shuffle for all workers. Meanwhile, `WorkerSource` uses `ActiveShuffleSize` and `ActiveShuffleFileCount` metrics as the size and file count of active shuffle for a worker including master replica and slave replica. It's recommended to unify the size and file count of active shuffle metrics between `MasterSource` and `WorkerSource`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Internal tests.

Closes #2171 from SteNicholas/CELEBORN-1187.

Lead-authored-by: SteNicholas <programgeek@163.com>
Co-authored-by: 蒋晓峰 <jiangxiaofeng@bilibili.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-12-22 17:07:39 +08:00
Angerszhuuuu
f751df50ba [CELEBORN-1192][BUG] Celeborn wait task timeout error message should show correct corresponding batch and target host and port
### What changes were proposed in this pull request?
Celeborn wait task timeout error message should show correct corresponding batch and target host and port

### Why are the changes needed?
Current error log here is confused, can't found out the target hostAndPushPort that have problem.

### Does this PR introduce _any_ user-facing change?
Refactor log help debug

### How was this patch tested?

Closes #2183 from AngersZhuuuu/CELEBORN-1192.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-12-22 16:53:30 +08:00
jiaoqingbo
704fae0e2d
[CELEBORN-1196] Slots allocator will increment disk index repeatedly
### What changes were proposed in this pull request?

Fix a bug about incorrect diskIndex calculation

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

NO

### How was this patch tested?

PASS GA

Closes #2181 from jiaoqingbo/diskIndex.

Authored-by: jiaoqingbo <1178404354@qq.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-12-22 14:13:17 +08:00
liangyongyuan
08e7b5962b [CELEBORN-1193] ResettableSlidingWindowReservoir should reset full to false
### What changes were proposed in this pull request?
When ResettableSlidingWindowReservoir reset,  it should reset` full` to `false`, `index` to `0`

### Why are the changes needed?
The ResettableSlidingWindowReservoir class, after invoking the reset operation, resets the data to zero, but fails to reset the 'index' and 'full' variables. Consequently, when retrieving a snapshot in the next operation, it is possible to obtain a considerable amount of zeros. This issue extends to the inaccurate calculation of metrics such as average and minimum values.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
add uts

Closes #2182 from lyy-pineapple/slide-bug.

Lead-authored-by: liangyongyuan <2081248500@qq.com>
Co-authored-by: liangyongyuan <liangyongyuan@xiaomi.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-12-21 19:53:47 +08:00
Fu Chen
173950bca2 [CELEBORN-1194] Add sbt-pgp plugin for publishing signed artifacts
### What changes were proposed in this pull request?

As title

### Why are the changes needed?

As title

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Pass GA

Closes #2184 from cfmcgrady/sbt-pgp-plugin.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Fu Chen <cfmcgrady@gmail.com>
2023-12-21 19:27:09 +08:00
liangyongyuan
4304be1a60 [CELEBORN-1172][SPARK] Support dynamic switch shuffle push write mode based on partition number
### What changes were proposed in this pull request?
Dynamically determine the writing mode in Spark based on the number of partitions.

### Why are the changes needed?
Enhance the flexibility of shuffle writes to improve performance.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Add uts

Closes #2160 from lyy-pineapple/dynamic-write-mode.

Lead-authored-by: liangyongyuan <liangyongyuan@xiaomi.com>
Co-authored-by: cxzl25 <cxzl25@users.noreply.github.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-12-21 16:58:51 +08:00
Fu Chen
8a34c376cb [CELEBORN-1191][FOLLOWUP] Migrate the release script from Maven to SBT
### What changes were proposed in this pull request?

As title

### Why are the changes needed?

1. extract `RELEASE_VERSION` from version.sbt instead of pom.xml
2. enable sbt when making binary distribution package

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Pass GA

Closes #2179 from cfmcgrady/celeborn-1191-followup.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Fu Chen <cfmcgrady@gmail.com>
2023-12-20 23:31:55 +08:00
pengqli
a808c252ba
[CELEBORN-1184] Update the snakeyaml version from 1.33 to 2.2
### What changes were proposed in this pull request?
Update the snakeyaml version from 1.33 to 2.2 reducing direct CVE vulnerabilities.

### Why are the changes needed?
The snakeyaml version has the follow CVE vulnerabilities, see
https://scout.docker.com/vulnerabilities/id/CVE-2022-1471

### Does this PR introduce _any_ user-facing change?
No any user-facing change

### How was this patch tested?
./build/make-distribution.sh to package and run test on the local.

Closes #2170 from dev-lpq/snakeyaml_version.

Authored-by: pengqli <pengqli@cisco.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-12-20 21:23:22 +08:00
zwangsheng
6c2fdf7477
[CELEBORN-1188][TEST] Using JUnit function instead of java assert
### What changes were proposed in this pull request?
Using Junit function instead of java assert.

### Why are the changes needed?
When java assert fail, will throw AssertException, which is hard to find diff.

![截屏2023-12-20 10 34 52](https://github.com/apache/incubator-celeborn/assets/52876270/b36421a5-64e1-4717-a6d4-3b08db403293)

Instead, when we use junit assert, we can clearly find diff.

![截屏2023-12-20 11 17 21](https://github.com/apache/incubator-celeborn/assets/52876270/ce39fa20-e9ab-4419-a4ca-62c4157e4b2c)

### Does this PR introduce _any_ user-facing change?
NO, only test changed

### How was this patch tested?
Run CI

Closes #2173 from zwangsheng/CELEBORN-1188.

Authored-by: zwangsheng <binjieyang@apache.org>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-12-20 21:20:38 +08:00
sychen
7f653ce7d6 [CELEBORN-1190] Apply error prone patch and suppress some problems
### What changes were proposed in this pull request?
1.  Fix MissingOverride, DefaultCharset, UnnecessaryParentheses Rule
2. Exclude generated sources, FutureReturnValueIgnored, TypeParameterUnusedInFormals, UnusedVariable

### Why are the changes needed?
```
./build/make-distribution.sh --release
```
We get a lot of WARNINGs.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
GA

Closes #2177 from cxzl25/error_prone_patch.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: Fu Chen <cfmcgrady@gmail.com>
2023-12-20 20:54:18 +08:00
Fu Chen
eba1efbb04 [CELEBORN-1191] Migrate the release script from Maven to SBT
### What changes were proposed in this pull request?

1. migrated the release script from Maven to SBT.
2. new clients added for publishing
- `celeborn-client-spark-3-shaded_2.13`
- `celeborn-client-mr-shaded_2.12`

### Why are the changes needed?

As title

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Pass GA

Closes #2178 from cfmcgrady/release-sbt.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Fu Chen <cfmcgrady@gmail.com>
2023-12-20 20:20:33 +08:00
SteNicholas
089a0f8686
[CELEBORN-1036][FOLLOWUP] totalInflightReqs should decrement when batchIdSet contains the batchId to avoid duplicate caller of removeBatch
### What changes were proposed in this pull request?

`totalInflightReqs` decrements when `batchIdSet` contains the `batchId` to avoid duplicate caller of `removeBatch` in `InFlightRequestTracker`.

### Why are the changes needed?

Caller of `InFlightRequestTracker#removeBatch` may be duplicated, which cause that `totalInflightReqs` could be negative. The source of truth should be that `totalInflightReqs` should decrement when `batchIdSet` contains the `batchId`. If `batchIdSet` does not contain the `batchId`, it does not need to decrement `totalInflightReqs`.

```
23/12/05 20:05:01 [Executor task launch worker for task 17.0 in stage 10.0 (TID 206)] ERROR InFlightRequestTracker: After waiting for 1200000 ms, there are still -1 batches in flight for hostAndPushPort [], which exceeds the current limit 0.
23/12/05 20:05:01 [Executor task launch worker for task 17.0 in stage 10.0 (TID 206)] WARN InFlightRequestTracker: Clear InFlightRequestTracker
23/12/05 20:05:01 [Executor task launch worker for task 17.0 in stage 10.0 (TID 206)] ERROR Executor: Exception in task 17.0 in stage 10.0 (TID 206)
org.apache.celeborn.common.exception.CelebornIOException: Waiting timeout for task 4-17-0 while limiting zero in-flight requests
	at org.apache.celeborn.client.ShuffleClientImpl.limitZeroInFlight(ShuffleClientImpl.java:598)
	at org.apache.celeborn.client.ShuffleClientImpl.prepareForMergeData(ShuffleClientImpl.java:1175)
	at org.apache.spark.shuffle.celeborn.HashBasedShuffleWriter.close(HashBasedShuffleWriter.java:455)
	at org.apache.spark.shuffle.celeborn.HashBasedShuffleWriter.write(HashBasedShuffleWriter.java:210)
	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:100)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
	at org.apache.spark.scheduler.Task.run(Task.scala:141)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:589)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1545)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:594)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Internal tests.

Closes #2134 from SteNicholas/CELEBORN-1036.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-12-20 18:11:07 +08:00
SteNicholas
35aa54bfe3 [MINOR] Update log level of CommitFiles success for CommitHandler from error to info
### What changes were proposed in this pull request?

Update log level of CommitFiles success for `CommitHandler` from error to info.

### Why are the changes needed?

The log level of sending CommitFiles success for `CommitHandler` should not be error.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #2174 from SteNicholas/commit-files-log.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-12-20 15:13:38 +08:00
mingji
4dacf72a6d
[CELEBORN-1150] support io encryption for spark
### What changes were proposed in this pull request?
1. To support io encryption for spark.

### Why are the changes needed?
Ditto.

### Does this PR introduce _any_ user-facing change?
NO.

### How was this patch tested?
GA and manually test on a cluster.

Closes #2135 from FMX/B1150.

Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-12-19 11:44:05 +08:00
Fu Chen
7a58b91c2a Bump 0.5.0-SNAPSHOT 2023-12-18 12:14:04 +08:00
Chandni Singh
b09febdd8c [CELEBORN-1176] Server side support for Sasl Auth
### What changes were proposed in this pull request?

This adds the server side Sasl authentication support in the transport layer. Most of this code is taken from Apache Spark.

### Why are the changes needed?

The changes are needed for adding authentication to Celeborn. See [CELEBORN-1011](https://issues.apache.org/jira/browse/CELEBORN-1011).

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added UTs.

Closes #2164 from otterc/CELEBORN-1176.

Authored-by: Chandni Singh <singh.chandni@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-12-18 11:27:28 +08:00
zky.zhoukeyong
4b7702e49c [CELEBORN-1181] Filter out null endpoint workers in destroySlotsWithRetry
### What changes were proposed in this pull request?
To avoid NPE in `val future = workerInfo.endpoint.ask[DestroyWorkerSlotsResponse](destroy)`

### Why are the changes needed?
ditto

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Manual test

Closes #2166 from waitinfuture/1181.

Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-12-17 20:17:20 +08:00
zky.zhoukeyong
e361788e48 [CELEBORN-1178] Destroy fail reserved slots in LifecycleManager#reserveSlotsWithRetry
### What changes were proposed in this pull request?
I'm testing main branch and encountered the following scenario.
I run `sbin/stop-worker.sh` near simultaneously on 3 out of 6 workers, and I'm expecting the 3 workers
will soon shutdown because I enabled graceful shutdown. However, only the first worker I stopped
shutdown in 15s as expected, the other two won't shutdown until shutdown timeout.

After digging into it, I found `LifecycleManager#reserveSlotsWithRetry` will reserve for the same location
twice:
1. At T1, only worker1 shutdown, pushes receive HARD_SPLIT and goes to revive
2. At T2, LifecycleManager handles revive requests in batch, and try to reallocate the locs to other workers
3. At T3, reserve to worker3 succeeds because it's not shutdown yet, but reserve to worker2 fails because it's shutdown
4. At T4, LifecycleManager will re-allocate the failed slots to other workers except worker1 and worker2. However, at this time Worker3 is also shutdown, so it fails to reserve on worker3
5. At T5, it re-allocates slots that failed to worker3. However, `getFailedPartitionLocations` will return slots allocated to worker3 in step 3, and increment the epoch to 2. At this time, worker3 has slots of epoch 1, but they will never to pushed to because newer epoch 3 is generated at the same time
6. Since the epoch 2 locs in worker3 will never be pushed to, it will never get a chance to return HARD_SPLIT, as a result it can't fast shutdown untile timeout.

This PR fixes this by destroying failed to be reserved slots in the process of `reserveSlotsWithRetry`

### Why are the changes needed?
ditto

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Manual test.

Before:
![image](https://github.com/apache/incubator-celeborn/assets/948245/50c55524-d37f-494e-a5aa-fba682438cda)
After:
![image](https://github.com/apache/incubator-celeborn/assets/948245/8c90a869-b388-46f3-a86b-a37fd0f4ce0f)

Closes #2163 from waitinfuture/1178.

Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-12-17 14:28:04 +08:00
pengqli
1037fbf921 [CELEBORN-1173] Upgrade netty version from 4.1.93.Final to 4.1.101.Final
### What changes were proposed in this pull request?
upgrade netty all version from 4.1.93.Final to 4.1.101.Final reducing direct CVE vulnerabilities

### Why are the changes needed?
The netty version has the follow CVE vulnerabilities, see
https://scout.docker.com/vulnerabilities/id/CVE-2023-4586
https://scout.docker.com/vulnerabilities/id/CVE-2023-44487
https://scout.docker.com/vulnerabilities/id/GHSA-xpw8-rcwv-8f8p

### Does this PR introduce _any_ user-facing change?
No any user-facing change

### How was this patch tested?
./build/make-distribution.sh to package and run test on the local.

Closes #2150 from dev-lpq/update_netty_all_version.

Lead-authored-by: pengqli <pengqli@cisco.com>
Co-authored-by: Keyong Zhou <zhouky@apache.org>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-12-16 14:03:37 +08:00
pengqli
0860553e18 [CELEBORN-1163] Upgrade protobuf from 3.19.2 to 3.21.7
### What changes were proposed in this pull request?
upgrade protobuf from 3.19.2 to 3.21.7 reducing direct CVE vulnerabilities

### Why are the changes needed?

The protobuf version has the follow CVE vulnerabilities, see
https://scout.docker.com/vulnerabilities/id/CVE-2022-3510
https://scout.docker.com/vulnerabilities/id/CVE-2022-3509
https://scout.docker.com/vulnerabilities/id/CVE-2021-22570
https://scout.docker.com/vulnerabilities/id/CVE-2021-22569

### Does this PR introduce _any_ user-facing change?
No any user-facing change

### How was this patch tested?
`./build/make-distribution.sh` to package and run test on the local.

Closes #2142 from dev-lpq/upgrade_protobuf-java_version.

Authored-by: pengqli <pengqli@cisco.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-12-16 13:58:36 +08:00
Chandni Singh
600bd53616 [CELEBORN-1180] Changed the version of Sasl Auth related config to 0.5
### What changes were proposed in this pull request?
Changes the version of the config to 0.5 given that 0.4 will be released soon.

### Why are the changes needed?
See above.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
NA

Closes #2165 from otterc/CELEBORN-1180.

Authored-by: Chandni Singh <singh.chandni@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-12-16 13:45:46 +08:00
zky.zhoukeyong
309153a99b [CELEBORN-1175] Add UT for commit files
### What changes were proposed in this pull request?
As title.

### Why are the changes needed?
As title.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Passes UTs.

Closes #2162 from waitinfuture/1175-2.

Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-12-16 01:36:29 +08:00
zky.zhoukeyong
01feb93abb [CELEBORN-1167] Avoid calling parmap when destroy slots
### What changes were proposed in this pull request?
As title

### Why are the changes needed?
One user reported that LifecycleManager's parmap can create huge number of threads and causes OOM.

![image](https://github.com/apache/incubator-celeborn/assets/948245/1e9a0b83-32fe-40d5-8739-2b370e030fc8)

There are four places where parmap is called:

1. When LifecycleManager commits files
2. When LifecycleManager reserves slots
3. When LifecycleManager setup connection to workers
4. When LifecycleManager call destroy slots

This PR fixes the fourth one. To be more detail, this PR eliminates `parmap` when destroying slots, and also replaces `askSync` with `ask`.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Manual test and GA.

Closes #2156 from waitinfuture/1167.

Lead-authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Co-authored-by: cxzl25 <cxzl25@users.noreply.github.com>
Co-authored-by: Keyong Zhou <waitinfuture@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-12-15 18:30:31 +08:00
Fu Chen
41df4ebbea [CELEBORN-1156][BUILD] SBT publish support
### What changes were proposed in this pull request?

As title

### Why are the changes needed?

As title

### Does this PR introduce _any_ user-facing change?

Yes, the user can publish shade clients via SBT

### How was this patch tested?

```shell
docker run -d -p 8081:8081 sonatype/nexus3
```

```shell
export SONATYPE_SNAPSHOTS_URL=http://192.168.3.46:8081/repository/maven-snapshots/
export SONATYPE_RELEASES_URL=http://192.168.3.46:8081/repository/maven-releases/
export ASF_USERNAME=admin
export ASF_PASSWORD=123456
```

- Publish the shade client for Spark 3.5:
```shell
./build/sbt -Pspark-3.4 celeborn-client-spark-3-shaded/publish
```

<img width="1673" alt="截屏2023-12-08 下午10 22 07" src="https://github.com/apache/incubator-celeborn/assets/8537877/1e87e7e2-cf3b-4bc0-8272-0f5b03ee65bf">

- Publish the shade client for Flink 1.18:

```shell
$ ./build/sbt -Pflink-1.18 celeborn-client-flink-1_18-shaded/publish
```
<img width="1676" alt="截屏2023-12-08 下午10 25 28" src="https://github.com/apache/incubator-celeborn/assets/8537877/62d0c3c4-e105-4e8a-8d8d-e78650a2eb09">

- Publish the shade client for MapReduce:
```shell
$ ./build/sbt -Pmr celeborn-client-mr-shaded/publish
```
<img width="1672" alt="截屏2023-12-08 下午10 25 47" src="https://github.com/apache/incubator-celeborn/assets/8537877/563d5ad5-fa6d-46fc-9465-8279ef96385a">

Closes #2129 from cfmcgrady/sbt-publish.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Fu Chen <cfmcgrady@gmail.com>
2023-12-15 11:22:35 +08:00
zky.zhoukeyong
b4bbe4b151 [CELEBORN-1171] Add UT for LifecycleManager's async setup endpoints
### What changes were proposed in this pull request?
as title

### Why are the changes needed?
as title

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Passes GA

Closes #2159 from waitinfuture/1171.

Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-12-15 11:00:13 +08:00
sychen
1567fec194 [CELEBORN-1169] Bump Spark from 3.4.1 to 3.4.2
### What changes were proposed in this pull request?

### Why are the changes needed?
[Spark 3.4.2 released](https://spark.apache.org/news/spark-3-4-2-released.html)
November 30, 2023

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #2157 from cxzl25/CELEBORN-1169.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-12-14 23:06:01 +08:00
Chandni Singh
a03ce6c165 [CELEBORN-1157] Add client-side support for Sasl Authentication in the transport layer
### What changes were proposed in this pull request?
This adds the client side Sasl authentication support in the transport layer. Most of this code is taken from Apache Spark.

### Why are the changes needed?
The changes are needed for adding authentication to Celeborn. See [CELEBORN-1011](https://issues.apache.org/jira/browse/CELEBORN-1011).

### Does this PR introduce _any_ user-facing change?
Added a configuration for Sasl request timeout

### How was this patch tested?
Will be adding `CelebornSaslSuiteJ.java` (https://github.com/apache/incubator-celeborn/pull/2105) that tests the end-to-end Sasl flow.

Closes #2139 from otterc/CELEBORN-1157.

Authored-by: Chandni Singh <singh.chandni@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-12-14 22:52:49 +08:00
sychen
2504b50dd2 [CELEBORN-1170] Upgrade snappy-java from 1.1.8.2 to 1.1.10.5
### What changes were proposed in this pull request?

### Why are the changes needed?
https://github.com/apache/incubator-celeborn/pull/2143

The snappy-java 1.1.8.2 version has the follow CVE vulnerabilities, see
https://scout.docker.com/vulnerabilities/id/CVE-2023-43642
https://scout.docker.com/vulnerabilities/id/CVE-2023-34455

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #2158 from cxzl25/CELEBORN-1170.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-12-14 22:28:32 +08:00
SteNicholas
850d3199ef [CELEBORN-1164] Introduce FetchChunkFailCount metric to expose the count of fetching chunk failed in current worker
### What changes were proposed in this pull request?

Introduce `FetchChunkFailCount` metric to expose the count of fetching chunk failed in current worker.

### Why are the changes needed?

The metrics about the count of PushData or PushMergedData failed in current worker is supported at present. It's better to support `FetchChunkFailCount` metric to expose the count of fetching chunk failed in current worker.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Internal test.

Closes #2151 from SteNicholas/CELEBORN-1164.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-12-13 23:01:16 +08:00
zky.zhoukeyong
ea0fff057f [CELEBORN-1166] Avoid calling parmap when setup endpoint
### What changes were proposed in this pull request?
As title

### Why are the changes needed?
One user reported that LifecycleManager's parmap can create huge number of threads and causes OOM.

![image](https://github.com/apache/incubator-celeborn/assets/948245/1e9a0b83-32fe-40d5-8739-2b370e030fc8)

There are four places where parmap is called:

1. When LifecycleManager commits files
2. When LifecycleManager reserves slots
3. When LifecycleManager setup connection to workers
4. When StorageManager calls close

This PR fixes the third one. To be more detail, this PR eliminates `parmap` when setup connection to workers, and also replaces `askSync` with `ask`.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Manual test and GA.

Closes #2154 from waitinfuture/1166.

Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-12-13 17:07:28 +08:00
zky.zhoukeyong
4303be3231 [CELEBORN-1165] Avoid calling parmap when reserve slots
### What changes were proposed in this pull request?
As title

### Why are the changes needed?
One user reported that LifecycleManager's parmap can create huge number of threads and causes OOM.

![image](https://github.com/apache/incubator-celeborn/assets/948245/1e9a0b83-32fe-40d5-8739-2b370e030fc8)

There are four places where parmap is called:

1. When LifecycleManager commits files
2. When LifecycleManager reserves slots
3. When LifecycleManager setup connection to workers
4. When StorageManager calls close

This PR fixes the second one. To be more detail, this PR eliminates `parmap` when reserving slots, and also replaces `askSync` with `ask`.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Manual test and GA.

Closes #2152 from waitinfuture/1165-1.

Lead-authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Co-authored-by: Cheng Pan <pan3793@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-12-13 16:37:20 +08:00
Fu Chen
0f2a9a3a63 [CELEBORN-1160][FOLLOWUP] Update the version for celeborn.client.rpc.shared.threads to 0.3.2
### What changes were proposed in this pull request?

As title

### Why are the changes needed?

Since we are backporting #2145 to branch-0.3, and the configuration entry `celeborn.client.rpc.shared.threads` in #2145
 has a start version of 0.4.0, this update aligns the version accordingly.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Pass GA

Closes #2153 from cfmcgrady/celeborn-1160-followup.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-12-13 15:12:50 +08:00
zky.zhoukeyong
92bebd305d [CELEBORN-1160] Avoid calling parmap when commit files
### What changes were proposed in this pull request?
As title

### Why are the changes needed?
One user reported that LifecycleManager's parmap can create huge number of threads and causes OOM.

![image](https://github.com/apache/incubator-celeborn/assets/948245/1e9a0b83-32fe-40d5-8739-2b370e030fc8)

There are four places where parmap is called:

1. When LifecycleManager commits files
2. When LifecycleManager reserves slots
3. When LifecycleManager setup connection to workers
4. When StorageManager calls close

This PR fixes the first one. To be more detail, this PR eliminates `parmap` when doing committing files, and also replaces `askSync` with `ask`.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Manual test and GA.

Closes #2145 from waitinfuture/1160.

Lead-authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Co-authored-by: Keyong Zhou <waitinfuture@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-12-13 14:36:48 +08:00
宪英
c1120e8b44 [CELEBORN-1139] Master's follower clean state before install snapshot
### What changes were proposed in this pull request?

Master follower  will clean state before install snapshot, instead of adding

### Why are the changes needed?
When a master's follower node receive a status snapshot from the leader, it will update the state machine directly without cleaning up the outdated status. This can cause problems, for example, the worker list may add an extra copy of registered workers in it.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
UT.
org.apache.celeborn.service.deploy.master.clustermeta.ha.MasterStateMachineSuiteJ

Closes #2147 from liying919/main.

Authored-by: 宪英 <xianying.ly@antgroup.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-12-13 09:55:36 +08:00