### What changes were proposed in this pull request?
To eliminate build failure when using SBT.
### Why are the changes needed?
If the maven local cache is enabled, SBT can't find the correct dependencies.
If the maven local cache is disabled, SBT can find the correct dependencies.
### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?
GA
Closes#2199 from FMX/b1205.
Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
Unify parse uniqueId to WorkerInfo
### Why are the changes needed?
Keep parse uniqueId behavior consistent and avoid multiple changes
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Unit test
Closes#2202 from zwangsheng/CELEBORN-1208.
Authored-by: zwangsheng <binjieyang@apache.org>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
Rename FileWriter to PartitionLocationDataWriter, add storageManager, delete fileinfo, and flusher in the constructor.
FileInfo(userIdentifier,partitionSplitEnabled,fileMeta)
– NonMemoryFileInfo(streams,filePath,storageType,bytesFlushed)
– MemoryFileInfo(length,buffer)
FileMeta
– reduceFileMeta(chunkOffsets,sorted)
– mapFileMeta(bufferSize,numSubPartitions)
### Why are the changes needed?
1. To make concepts more clear.
2. To support memory storage and HDFS slot management.
### Does this PR introduce _any_ user-facing change?
NO.
### How was this patch tested?
GA and cluster test with worker kill.
Closes#2130 from FMX/b1133.
Lead-authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Co-authored-by: Keyong Zhou <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
As title
### Why are the changes needed?
it's 2024 now, and this patch is expected to be applied to all active branches(which are planned to be released)
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Review
Closes#2198 from pan3793/CELEBORN-1204.
Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Cheng Pan <chengpan@apache.org>
### What changes were proposed in this pull request?
Remove pipeline feature for sort based writer
### Why are the changes needed?
The pipeline feature is added as part of CELEBORN-295, for performance. Eventually, an unresolvable issue that would crash the JVM was identified in https://github.com/apache/incubator-celeborn/pull/1807, and after discussion, we decided to delete this feature.
### Does this PR introduce _any_ user-facing change?
No, the pipeline feature is disabled by default, there are no changes to users who use the default settings.
### How was this patch tested?
Pass GA.
Closes#2196 from pan3793/CELEBORN-891.
Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
### Why are the changes needed?
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Closes#2197 from cfmcgrady/license.
Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Fu Chen <cfmcgrady@gmail.com>
### What changes were proposed in this pull request?
`LICENSE` mentions third-party components under other open source licenses like Apache Spark etc.
### Why are the changes needed?
`LICENSE` mentions 1 3rd party file from Guava. However, the `NOTICE` lists both Apache Spark and Apache Flink. `LICENSE` should mention all third-party components under other open source licenses.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
No.
Closes#2193 from SteNicholas/CELEBORN-1202.
Lead-authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Co-authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
As title
### Why are the changes needed?
https://github.com/apache/incubator-celeborn/pull/2189#issuecomment-1870940496
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Manually tested
Closes#2195 from cfmcgrady/license-followup.
Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
### What changes were proposed in this pull request?
As title
### Why are the changes needed?
This PR introduces LICENSE and NOTICE files for service related sub-projects
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
```shell
$ ./build/sbt celeborn-worker/package
$ jar tf worker/target/scala-2.12/celeborn-worker_2.12-0.5.0-SNAPSHOT.jar | grep -i 'license\|notice'
META-INF/LICENSE
META-INF/NOTICE
```
Closes#2189 from cfmcgrady/license.
Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Fu Chen <cfmcgrady@gmail.com>
### What changes were proposed in this pull request?
MapperAttempts for a shuffle replies the `MAP_ENDED` when mapper has already been ended for receving push data or push merged data from speculative task.
Follow up #1591.
### Why are the changes needed?
When mapper has already been ended for receving push data or push merged data from speculative task, `PushDataHandler` should trigger MapEnd instead of StageEnd for worker. Meanwhile, the `ShuffleClientImpl` should handle `STAGE_ENDED` as MapEnd, otherwise causes that other tasks of the stage could not send shuffle data for data lost.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Internal test.
Closes#2190 from SteNicholas/CELEBORN-678.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
When `inflightBatchesPerAddress` clear in `InFlightRequestTracker.cleanup `, `totalInflightReqs` should also reset to avoid getting stuck when exiting.
### Why are the changes needed?
`inflightBatchesPerAddress` has cleared and be empty,but totalInflightReqs is always bigger than 0.

This occurred during the first attempt of the task, where the request for map end failed, but the driver marked that the map has already ended.

### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
through exists uts
Closes#2191 from lyy-pineapple/celebron-1036.
Authored-by: liangyongyuan <liangyongyuan@xiaomi.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?、
```java
org.apache.celeborn.service.deploy.master.clustermeta.AbstractMetaManager#restoreMetaFromFile
```
### Why are the changes needed?
When the number of workers is large, the performance of parsing one by one will decrease.
YARN-9332. RackResolver tool should accept multiple hosts
https://issues.apache.org/jira/browse/YARN-9332
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Closes#2185 from cxzl25/CELEBORN-1195.
Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Introduce `RunningApplicationCount` metric and `/applications` API to record running applications for Celeborn worker.
### Why are the changes needed?
`RunningApplicationCount` metrics only monitors the count of running applications in the cluster for master. Meanwhile, `/listTopDiskUsedApps` API lists the top disk usage application ids for master and worker. Therefore `RunningApplicationCount` metric and `/applications` API could be introduced to record running applications of worker.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Internal tests.
Closes#2172 from SteNicholas/CELEBORN-1189.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Remove the s suffix of sleep
### Why are the changes needed?
MacOS
```bash
./sbin/start-all.sh
```
```
usage: sleep seconds
```
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Closes#2187 from cxzl25/CELEBORN-1197.
Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Unify the size and file count of active shuffle metrics for `MasterSource` and `WorkerSource`.
Follow up #2171.
### Why are the changes needed?
`MasterSource` uses `PartitionWritten` and `PartitionFileCount` metrics as the size and file count of active shuffle for all workers. Meanwhile, `WorkerSource` uses `ActiveShuffleSize` and `ActiveShuffleFileCount` metrics as the size and file count of active shuffle for a worker including master replica and slave replica. It's recommended to unify the size and file count of active shuffle metrics between `MasterSource` and `WorkerSource`.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Internal tests.
Closes#2186 from SteNicholas/CELEBORN-1187.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Unify the size and file count of active shuffle metrics for `MasterSource` and `WorkerSource`.
### Why are the changes needed?
`MasterSource` uses `PartitionWritten` and `PartitionFileCount` metrics as the size and file count of active shuffle for all workers. Meanwhile, `WorkerSource` uses `ActiveShuffleSize` and `ActiveShuffleFileCount` metrics as the size and file count of active shuffle for a worker including master replica and slave replica. It's recommended to unify the size and file count of active shuffle metrics between `MasterSource` and `WorkerSource`.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Internal tests.
Closes#2171 from SteNicholas/CELEBORN-1187.
Lead-authored-by: SteNicholas <programgeek@163.com>
Co-authored-by: 蒋晓峰 <jiangxiaofeng@bilibili.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Celeborn wait task timeout error message should show correct corresponding batch and target host and port
### Why are the changes needed?
Current error log here is confused, can't found out the target hostAndPushPort that have problem.
### Does this PR introduce _any_ user-facing change?
Refactor log help debug
### How was this patch tested?
Closes#2183 from AngersZhuuuu/CELEBORN-1192.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
### What changes were proposed in this pull request?
Fix a bug about incorrect diskIndex calculation
### Why are the changes needed?
### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?
PASS GA
Closes#2181 from jiaoqingbo/diskIndex.
Authored-by: jiaoqingbo <1178404354@qq.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
When ResettableSlidingWindowReservoir reset, it should reset` full` to `false`, `index` to `0`
### Why are the changes needed?
The ResettableSlidingWindowReservoir class, after invoking the reset operation, resets the data to zero, but fails to reset the 'index' and 'full' variables. Consequently, when retrieving a snapshot in the next operation, it is possible to obtain a considerable amount of zeros. This issue extends to the inaccurate calculation of metrics such as average and minimum values.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
add uts
Closes#2182 from lyy-pineapple/slide-bug.
Lead-authored-by: liangyongyuan <2081248500@qq.com>
Co-authored-by: liangyongyuan <liangyongyuan@xiaomi.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
As title
### Why are the changes needed?
As title
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Pass GA
Closes#2184 from cfmcgrady/sbt-pgp-plugin.
Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Fu Chen <cfmcgrady@gmail.com>
### What changes were proposed in this pull request?
Dynamically determine the writing mode in Spark based on the number of partitions.
### Why are the changes needed?
Enhance the flexibility of shuffle writes to improve performance.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Add uts
Closes#2160 from lyy-pineapple/dynamic-write-mode.
Lead-authored-by: liangyongyuan <liangyongyuan@xiaomi.com>
Co-authored-by: cxzl25 <cxzl25@users.noreply.github.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
As title
### Why are the changes needed?
1. extract `RELEASE_VERSION` from version.sbt instead of pom.xml
2. enable sbt when making binary distribution package
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Pass GA
Closes#2179 from cfmcgrady/celeborn-1191-followup.
Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Fu Chen <cfmcgrady@gmail.com>
### What changes were proposed in this pull request?
Update the snakeyaml version from 1.33 to 2.2 reducing direct CVE vulnerabilities.
### Why are the changes needed?
The snakeyaml version has the follow CVE vulnerabilities, see
https://scout.docker.com/vulnerabilities/id/CVE-2022-1471
### Does this PR introduce _any_ user-facing change?
No any user-facing change
### How was this patch tested?
./build/make-distribution.sh to package and run test on the local.
Closes#2170 from dev-lpq/snakeyaml_version.
Authored-by: pengqli <pengqli@cisco.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
### What changes were proposed in this pull request?
1. Fix MissingOverride, DefaultCharset, UnnecessaryParentheses Rule
2. Exclude generated sources, FutureReturnValueIgnored, TypeParameterUnusedInFormals, UnusedVariable
### Why are the changes needed?
```
./build/make-distribution.sh --release
```
We get a lot of WARNINGs.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
GA
Closes#2177 from cxzl25/error_prone_patch.
Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: Fu Chen <cfmcgrady@gmail.com>
### What changes were proposed in this pull request?
1. migrated the release script from Maven to SBT.
2. new clients added for publishing
- `celeborn-client-spark-3-shaded_2.13`
- `celeborn-client-mr-shaded_2.12`
### Why are the changes needed?
As title
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Pass GA
Closes#2178 from cfmcgrady/release-sbt.
Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Fu Chen <cfmcgrady@gmail.com>
### What changes were proposed in this pull request?
`totalInflightReqs` decrements when `batchIdSet` contains the `batchId` to avoid duplicate caller of `removeBatch` in `InFlightRequestTracker`.
### Why are the changes needed?
Caller of `InFlightRequestTracker#removeBatch` may be duplicated, which cause that `totalInflightReqs` could be negative. The source of truth should be that `totalInflightReqs` should decrement when `batchIdSet` contains the `batchId`. If `batchIdSet` does not contain the `batchId`, it does not need to decrement `totalInflightReqs`.
```
23/12/05 20:05:01 [Executor task launch worker for task 17.0 in stage 10.0 (TID 206)] ERROR InFlightRequestTracker: After waiting for 1200000 ms, there are still -1 batches in flight for hostAndPushPort [], which exceeds the current limit 0.
23/12/05 20:05:01 [Executor task launch worker for task 17.0 in stage 10.0 (TID 206)] WARN InFlightRequestTracker: Clear InFlightRequestTracker
23/12/05 20:05:01 [Executor task launch worker for task 17.0 in stage 10.0 (TID 206)] ERROR Executor: Exception in task 17.0 in stage 10.0 (TID 206)
org.apache.celeborn.common.exception.CelebornIOException: Waiting timeout for task 4-17-0 while limiting zero in-flight requests
at org.apache.celeborn.client.ShuffleClientImpl.limitZeroInFlight(ShuffleClientImpl.java:598)
at org.apache.celeborn.client.ShuffleClientImpl.prepareForMergeData(ShuffleClientImpl.java:1175)
at org.apache.spark.shuffle.celeborn.HashBasedShuffleWriter.close(HashBasedShuffleWriter.java:455)
at org.apache.spark.shuffle.celeborn.HashBasedShuffleWriter.write(HashBasedShuffleWriter.java:210)
at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:100)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:141)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:589)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1545)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:594)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
```
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Internal tests.
Closes#2134 from SteNicholas/CELEBORN-1036.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Update log level of CommitFiles success for `CommitHandler` from error to info.
### Why are the changes needed?
The log level of sending CommitFiles success for `CommitHandler` should not be error.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
No.
Closes#2174 from SteNicholas/commit-files-log.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
1. To support io encryption for spark.
### Why are the changes needed?
Ditto.
### Does this PR introduce _any_ user-facing change?
NO.
### How was this patch tested?
GA and manually test on a cluster.
Closes#2135 from FMX/B1150.
Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
This adds the server side Sasl authentication support in the transport layer. Most of this code is taken from Apache Spark.
### Why are the changes needed?
The changes are needed for adding authentication to Celeborn. See [CELEBORN-1011](https://issues.apache.org/jira/browse/CELEBORN-1011).
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Added UTs.
Closes#2164 from otterc/CELEBORN-1176.
Authored-by: Chandni Singh <singh.chandni@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
To avoid NPE in `val future = workerInfo.endpoint.ask[DestroyWorkerSlotsResponse](destroy)`
### Why are the changes needed?
ditto
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Manual test
Closes#2166 from waitinfuture/1181.
Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
I'm testing main branch and encountered the following scenario.
I run `sbin/stop-worker.sh` near simultaneously on 3 out of 6 workers, and I'm expecting the 3 workers
will soon shutdown because I enabled graceful shutdown. However, only the first worker I stopped
shutdown in 15s as expected, the other two won't shutdown until shutdown timeout.
After digging into it, I found `LifecycleManager#reserveSlotsWithRetry` will reserve for the same location
twice:
1. At T1, only worker1 shutdown, pushes receive HARD_SPLIT and goes to revive
2. At T2, LifecycleManager handles revive requests in batch, and try to reallocate the locs to other workers
3. At T3, reserve to worker3 succeeds because it's not shutdown yet, but reserve to worker2 fails because it's shutdown
4. At T4, LifecycleManager will re-allocate the failed slots to other workers except worker1 and worker2. However, at this time Worker3 is also shutdown, so it fails to reserve on worker3
5. At T5, it re-allocates slots that failed to worker3. However, `getFailedPartitionLocations` will return slots allocated to worker3 in step 3, and increment the epoch to 2. At this time, worker3 has slots of epoch 1, but they will never to pushed to because newer epoch 3 is generated at the same time
6. Since the epoch 2 locs in worker3 will never be pushed to, it will never get a chance to return HARD_SPLIT, as a result it can't fast shutdown untile timeout.
This PR fixes this by destroying failed to be reserved slots in the process of `reserveSlotsWithRetry`
### Why are the changes needed?
ditto
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Manual test.
Before:

After:

Closes#2163 from waitinfuture/1178.
Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
upgrade netty all version from 4.1.93.Final to 4.1.101.Final reducing direct CVE vulnerabilities
### Why are the changes needed?
The netty version has the follow CVE vulnerabilities, see
https://scout.docker.com/vulnerabilities/id/CVE-2023-4586https://scout.docker.com/vulnerabilities/id/CVE-2023-44487https://scout.docker.com/vulnerabilities/id/GHSA-xpw8-rcwv-8f8p
### Does this PR introduce _any_ user-facing change?
No any user-facing change
### How was this patch tested?
./build/make-distribution.sh to package and run test on the local.
Closes#2150 from dev-lpq/update_netty_all_version.
Lead-authored-by: pengqli <pengqli@cisco.com>
Co-authored-by: Keyong Zhou <zhouky@apache.org>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
Changes the version of the config to 0.5 given that 0.4 will be released soon.
### Why are the changes needed?
See above.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
NA
Closes#2165 from otterc/CELEBORN-1180.
Authored-by: Chandni Singh <singh.chandni@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
As title.
### Why are the changes needed?
As title.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Passes UTs.
Closes#2162 from waitinfuture/1175-2.
Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
As title
### Why are the changes needed?
One user reported that LifecycleManager's parmap can create huge number of threads and causes OOM.

There are four places where parmap is called:
1. When LifecycleManager commits files
2. When LifecycleManager reserves slots
3. When LifecycleManager setup connection to workers
4. When LifecycleManager call destroy slots
This PR fixes the fourth one. To be more detail, this PR eliminates `parmap` when destroying slots, and also replaces `askSync` with `ask`.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Manual test and GA.
Closes#2156 from waitinfuture/1167.
Lead-authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Co-authored-by: cxzl25 <cxzl25@users.noreply.github.com>
Co-authored-by: Keyong Zhou <waitinfuture@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
as title
### Why are the changes needed?
as title
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Passes GA
Closes#2159 from waitinfuture/1171.
Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
### Why are the changes needed?
[Spark 3.4.2 released](https://spark.apache.org/news/spark-3-4-2-released.html)
November 30, 2023
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Closes#2157 from cxzl25/CELEBORN-1169.
Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
This adds the client side Sasl authentication support in the transport layer. Most of this code is taken from Apache Spark.
### Why are the changes needed?
The changes are needed for adding authentication to Celeborn. See [CELEBORN-1011](https://issues.apache.org/jira/browse/CELEBORN-1011).
### Does this PR introduce _any_ user-facing change?
Added a configuration for Sasl request timeout
### How was this patch tested?
Will be adding `CelebornSaslSuiteJ.java` (https://github.com/apache/incubator-celeborn/pull/2105) that tests the end-to-end Sasl flow.
Closes#2139 from otterc/CELEBORN-1157.
Authored-by: Chandni Singh <singh.chandni@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
Introduce `FetchChunkFailCount` metric to expose the count of fetching chunk failed in current worker.
### Why are the changes needed?
The metrics about the count of PushData or PushMergedData failed in current worker is supported at present. It's better to support `FetchChunkFailCount` metric to expose the count of fetching chunk failed in current worker.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Internal test.
Closes#2151 from SteNicholas/CELEBORN-1164.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
As title
### Why are the changes needed?
One user reported that LifecycleManager's parmap can create huge number of threads and causes OOM.

There are four places where parmap is called:
1. When LifecycleManager commits files
2. When LifecycleManager reserves slots
3. When LifecycleManager setup connection to workers
4. When StorageManager calls close
This PR fixes the third one. To be more detail, this PR eliminates `parmap` when setup connection to workers, and also replaces `askSync` with `ask`.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Manual test and GA.
Closes#2154 from waitinfuture/1166.
Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
As title
### Why are the changes needed?
One user reported that LifecycleManager's parmap can create huge number of threads and causes OOM.

There are four places where parmap is called:
1. When LifecycleManager commits files
2. When LifecycleManager reserves slots
3. When LifecycleManager setup connection to workers
4. When StorageManager calls close
This PR fixes the second one. To be more detail, this PR eliminates `parmap` when reserving slots, and also replaces `askSync` with `ask`.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Manual test and GA.
Closes#2152 from waitinfuture/1165-1.
Lead-authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Co-authored-by: Cheng Pan <pan3793@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
As title
### Why are the changes needed?
Since we are backporting #2145 to branch-0.3, and the configuration entry `celeborn.client.rpc.shared.threads` in #2145
has a start version of 0.4.0, this update aligns the version accordingly.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Pass GA
Closes#2153 from cfmcgrady/celeborn-1160-followup.
Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
As title
### Why are the changes needed?
One user reported that LifecycleManager's parmap can create huge number of threads and causes OOM.

There are four places where parmap is called:
1. When LifecycleManager commits files
2. When LifecycleManager reserves slots
3. When LifecycleManager setup connection to workers
4. When StorageManager calls close
This PR fixes the first one. To be more detail, this PR eliminates `parmap` when doing committing files, and also replaces `askSync` with `ask`.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Manual test and GA.
Closes#2145 from waitinfuture/1160.
Lead-authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Co-authored-by: Keyong Zhou <waitinfuture@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
Master follower will clean state before install snapshot, instead of adding
### Why are the changes needed?
When a master's follower node receive a status snapshot from the leader, it will update the state machine directly without cleaning up the outdated status. This can cause problems, for example, the worker list may add an extra copy of registered workers in it.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
UT.
org.apache.celeborn.service.deploy.master.clustermeta.ha.MasterStateMachineSuiteJ
Closes#2147 from liying919/main.
Authored-by: 宪英 <xianying.ly@antgroup.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>