### What changes were proposed in this pull request?
In `TimeSlidingHub.add()` `_deque` will clear then add the pair.
```
if (nodesToAdd >= maxQueueSize) {
// The new node exceed existing sliding list, need to clear all old nodes
// and create a new sliding list
_deque.clear();
_deque.add(Pair.of(currentTimestamp, (N) newNode.clone()));
sumNode = (N) newNode.clone();
return;
}
```
Then when call `BufferStatusHub.avgBytesPerSec()`, `currentNumBytes` can be `> 0` but `getCurrentTimeWindowsInMills` may return 0. Cause the error.
```
public long avgBytesPerSec() {
long currentNumBytes = sum().numBytes();
if (currentNumBytes > 0) {
return currentNumBytes * 1000 / (long) getCurrentTimeWindowsInMills();
}
return 0L;
}
```
### Why are the changes needed?
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Closes#1690 from AngersZhuuuu/CELEBORN-777.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
### What changes were proposed in this pull request?
Fix the refactor bug of CELEBORN-614 (https://github.com/apache/incubator-celeborn/pull/1517).
### Why are the changes needed?
This is a bug fix, the condition `writer.getException != null` was inverted accidentally during CELEBORN-614 (https://github.com/apache/incubator-celeborn/pull/1517), which causes the trim became no-op.
### Does this PR introduce _any_ user-facing change?
No. The bug was caused by an unreleased commit.
### How was this patch tested?
Set Worker off-heap memory to 2G, and run 1T tera sort.
Before: the trim does not trigger disk buffer flush, causing the worker can not to recover from the pause pushdata state, then Job failed.
After: the trim correctly triggers disk buffer flush, releases the worker memory, and the Job succeeded.
<img width="1653" alt="image" src="https://github.com/apache/incubator-celeborn/assets/26535726/9ef62c78-e6a9-497f-9dac-d3f712e830cc">
Closes#1689 from pan3793/CELEBORN-614-followup.
Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Sorry, the last added `nanoDurationToString` is not correct, this PR fixes it and adds UT to verify.
### Why are the changes needed?
Bug fix.
### Does this PR introduce _any_ user-facing change?
No, unreleased change.
### How was this patch tested?
UT.
Closes#1688 from pan3793/CELEBORN-747-followup2.
Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Cheng Pan <chengpan@apache.org>
<!--
Thanks for sending a pull request! Here are some tips for you:
- Make sure the PR title start w/ a JIRA ticket, e.g. '[CELEBORN-XXXX] Your PR title ...'.
- Be sure to keep the PR description updated to reflect all changes.
- Please write your PR title to summarize what this PR proposes.
- If possible, provide a concise example to reproduce the issue for a faster review.
-->
### What changes were proposed in this pull request?
`avgFlushTime` and `avgFetchTime` are in nano seconds, it was accidentally formatted by `msDurationToString` and caused unreasonable logs.
```
usableSpace: 1602.1 GiB, avgFlushTime: 1.35 h, avgFetchTime: 1187.66 h
```
### Why are the changes needed?
Fix logs.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
UT is updated
Closes#1687 from pan3793/CELEBORN-747-followup.
Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Cheng Pan <chengpan@apache.org>
### What changes were proposed in this pull request?
As title
### Why are the changes needed?
```shell
$ bin/spark-shell --master local[2]
23/07/06 16:11:30 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/07/06 16:11:40 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Spark context available as 'sc' (master = local[2], app id = local-1688631101733).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 3.3.1
/_/
Using Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java 1.8.0_292)
Type in expressions to have them evaluated.
Type :help for more information.
scala> spark.sparkContext.getConf.get("spark.executor.cores")
java.util.NoSuchElementException: spark.executor.cores
at org.apache.spark.SparkConf.$anonfun$get$1(SparkConf.scala:245)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.SparkConf.get(SparkConf.scala:245)
... 47 elided
scala>
```
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
CelebornPipelineSortSuite should cover this change
Closes#1685 from cfmcgrady/local-core-number.
Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
### What changes were proposed in this pull request?
Pullout hardcoded `celeborn.rpc.dispatcher.numThreads` to `CelebornConf` and rename it to `celeborn.rpc.dispatcher.threads` to align with existing configuration style
### Why are the changes needed?
Pullout inline configuration to `CelebornConf`, and expose it in configuration docs
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass GA.
Closes#1684 from pan3793/CELEBORN-774.
Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Cheng Pan <chengpan@apache.org>
…Flight to 16
### What changes were proposed in this pull request?
Change default value of celeborn.client.push.maxReqsInFlight to 16.
### Why are the changes needed?
Previous value 4 is too small, 16 is more reasonable.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass GA.
Closes#1683 from waitinfuture/769.
Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
Make Celeborn leader clean expired app dirs on HDFS when an application is Lost.
### Why are the changes needed?
If Celeborn is working on HDFS, the storage manager starts and cleans expired app directories, and the newly created worker will want to delete any unknown app directories.
This will cause using app directories to be deleted unexpectedly.
### Does this PR introduce _any_ user-facing change?
NO.
### How was this patch tested?
UT and cluster.
Closes#1678 from FMX/CELEBORN-764.
Lead-authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Co-authored-by: Cheng Pan <pan3793@gmail.com>
Co-authored-by: Ethan Feng <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
…memory allocator
### What changes were proposed in this pull request?
Changes the following configs' default values
| config | previous value | current value |
| ------------- | ------------- | ------------- |
| celeborn.network.memory.allocator.share | false | true |
| celeborn.client.shuffle.batchHandleChangePartition.enabled | false | true |
| celeborn.client.shuffle.batchHandleCommitPartition.enabled | false | true |
### Why are the changes needed?
In my test, when graceful shutdown is enabled but ```celeborn.client.shuffle.batchHandleChangePartition.enabled``` and ```celeborn.client.shuffle.batchHandleCommitPartition.enabled``` disabled, the worker takes much longer to stop than the two configs enabled.
In another test where worker size is quite small(2 cores 4 G) and replication is on, if shared allocator is disabled, the netty's onTrim fails to release memory, and further causes push data timeout.
### Does this PR introduce _any_ user-facing change?
No, these conifgs are introduces from 0.3.0.
### How was this patch tested?
Passes GA.
Closes#1682 from waitinfuture/768.
Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
…upgrading
### What changes were proposed in this pull request?
upgrade notice-binary for commons-lang and commons-io
### Why are the changes needed?
In pre prs, the following dependencies are upgraded and its' notice file are upgraded too.
Bump commons-io to 2.13.0 [[CELEBORN-743]](https://issues.apache.org/jira/projects/CELEBORN/issues/CELEBORN-743?filter=allissues)
Bump commons-lang3 to 3.12.0 [[CELEBORN-736]](https://issues.apache.org/jira/projects/CELEBORN/issues/CELEBORN-736?filter=allissues)
1. NOTICE.txt in commons-lang3-3.12.0.jar
```
Apache Commons Lang
Copyright 2001-2021 The Apache Software Foundation
```
2. NOTICE.txt in commons-io-2.13.0.jar
```
Apache Commons IO
Copyright 2002-2023 The Apache Software Foundation
```
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
manually
Closes#1681 from zhongqiangczq/notice-0705.
Authored-by: zhongqiang.czq <zhongqiang.czq@alibaba-inc.com>
Signed-off-by: zhongqiang.czq <zhongqiang.czq@alibaba-inc.com>
### What changes were proposed in this pull request?
As title
### Why are the changes needed?
To clarify the usage of conf `celeborn.client.spark.push.sort.memory.threshold`
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Pass GA
Closes#1680 from cfmcgrady/docs.
Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
…nfigurations
### What changes were proposed in this pull request?
In Doc Readme, setting partitionSplit to false should be added in Flink engine related configurations.
### Why are the changes needed?
Currently, Mappartition split is not supported, but shuffle partition split is enabled by default, so error will be thrown when flink task's shuffle data size exceeds 1G(by Default).
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
manually
Closes#1679 from zhongqiangczq/readme.
Authored-by: zhongqiang.czq <zhongqiang.czq@alibaba-inc.com>
Signed-off-by: zhongqiang.czq <zhongqiang.czq@alibaba-inc.com>
### What changes were proposed in this pull request?
Add --add-opens to bootstrap shell scripts
### Why are the changes needed?
Additional `--add-opens` is required for Java 17, notes, the `--add-opens` list is copied from Spark and was used for UT, I am not sure each of them is required but at least the UT passed with them.
Details supplied by cfmcgrady
[JEP 403](https://openjdk.java.net/jeps/403) targeted for [JDK 17](https://openjdk.java.net/projects/jdk/17/) will remove `--illegal-access` flag. That will be equivalent to `--illegal-access=deny`.
this means using reflection to invoke protected methods of exported `java.*` APIs will no longer work. For example:
```shell
> /Library/Java/JavaVirtualMachines/temurin-17.jdk/Contents/Home/bin/jshell
| 欢迎使用 JShell -- 版本 17.0.7
| 要大致了解该版本, 请键入: /help intro
jshell> java.nio.ByteBuffer direct = java.nio.ByteBuffer.allocateDirect(1);
direct ==> java.nio.DirectByteBuffer[pos=0 lim=1 cap=1]
jshell> direct.getClass().getDeclaredConstructor(long.class, int.class).setAccessible(true);
| 异常错误 java.lang.reflect.InaccessibleObjectException:Unable to make private java.nio.DirectByteBuffer(long,int) accessible: module java.base does not "opens java.nio" to unnamed module 34c45dca
| at AccessibleObject.checkCanSetAccessible (AccessibleObject.java:354)
| at AccessibleObject.checkCanSetAccessible (AccessibleObject.java:297)
| at Constructor.checkCanSetAccessible (Constructor.java:188)
| at Constructor.setAccessible (Constructor.java:181)
| at (#2:1)
jshell>
```
```shell
> /Library/Java/JavaVirtualMachines/temurin-17.jdk/Contents/Home/bin/jshell -R --add-opens=java.base/java.nio=ALL-UNNAMED
| 欢迎使用 JShell -- 版本 17.0.7
| 要大致了解该版本, 请键入: /help intro
jshell> java.nio.ByteBuffer direct = java.nio.ByteBuffer.allocateDirect(1);
direct ==> java.nio.DirectByteBuffer[pos=0 lim=1 cap=1]
jshell> direct.getClass().getDeclaredConstructor(long.class, int.class).setAccessible(true);
jshell>
```
### Does this PR introduce _any_ user-facing change?
Yes, for Java 17 support.
### How was this patch tested?
CI and review
Closes#1677 from pan3793/CELEBORN-763.
Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Cheng Pan <chengpan@apache.org>
### What changes were proposed in this pull request?
Always set JVM opts `-XX:+IgnoreUnrecognizedVMOptions`
### Why are the changes needed?
By default, JVM failed to start when unknown opts are set, it's not friendly for users who want to use different versions of JDK.
### Does this PR introduce _any_ user-facing change?
Yes, users can success start celeborn even if they provide unknown JVM opts.
### How was this patch tested?
Review.
Closes#1676 from pan3793/CELEBORN-762.
Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Cheng Pan <chengpan@apache.org>
### What changes were proposed in this pull request?
Rename remain rss related class name and filenames etc...
### Why are the changes needed?
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Closes#1664 from AngersZhuuuu/CELEBORN-751.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
### What changes were proposed in this pull request?
as title
### Why are the changes needed?
mention configuration behavior change in migration guide
### Does this PR introduce _any_ user-facing change?
Yes, the migration guide is updated
### How was this patch tested?
review
Closes#1673 from pan3793/CELEBORN-637-followup.
Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
As title
### Why are the changes needed?
eliminate comments introduced in https://github.com/apache/incubator-celeborn/pull/1650
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Pass GA
Closes#1672 from cfmcgrady/primary-replica-followup.
Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
### What changes were proposed in this pull request?
- gauge method definition improvement. i.e.
before
```
def addGauge[T](name: String, f: Unit => T, labels: Map[String, String])
```
after
```
def addGauge[T](name: String, labels: Map[String, String])(f: () => T)
```
which improves the caller-side code style
```
addGauge(name, labels) { () =>
...
}
```
- remove unnecessary Java/Scala collection conversion. Since Scala 2.11 does not support SAM, the extra implicit function is required.
- leverage Logging to defer message evaluation
- UPPER_CASE string constants
### Why are the changes needed?
Improve code quality and performance(maybe)
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass GA.
Closes#1670 from pan3793/CELEBORN-757.
Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
Match TransportMessage type use number not enum to support change MessageType name,after this pr, then we can change the MessageType name.
### Why are the changes needed?
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Closes#1658 from AngersZhuuuu/CELEBORN-745.
Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Co-authored-by: Shuang <lvshuang.tb@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
…sion disabled
### What changes were proposed in this pull request?
Avoid memory copy for code path where compression is disabled. Followup of https://github.com/apache/incubator-celeborn/pull/1669
### Why are the changes needed?
ditto
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Pass GA
Closes#1671 from waitinfuture/755.
Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
Support to decide whether to compress shuffle data through configuration.
### Why are the changes needed?
Currently, Celeborn compresses all shuffle data, but for example, the shuffle data of Gluten has already been compressed. In this case, no additional compression is required. Therefore, configuration needs to be provided for users to decide whether to use Celeborn’s compression according to the actual situation.
### Does this PR introduce _any_ user-facing change?
no.
Closes#1669 from kerwin-zk/celeborn-755.
Authored-by: xiyu.zk <xiyu.zk@alibaba-inc.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
### What changes were proposed in this pull request?
Provide a new SparkShuffleManager to replace RssShuffleManager in the future
### Why are the changes needed?
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Closes#1667 from AngersZhuuuu/CELEBORN-754.
Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Co-authored-by: Cheng Pan <pan3793@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
### What changes were proposed in this pull request?
Rename spark patch file name to make it more clear
### Why are the changes needed?
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Closes#1666 from AngersZhuuuu/CELEBORN-753.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Update Grafana dashboard and its setup demo to remove the old name "RSS"
### Why are the changes needed?
Ditto.
### Does this PR introduce _any_ user-facing change?
NO.
### How was this patch tested?
No test needed.
Closes#1663 from FMX/CELEBORN-749.
Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
### What changes were proposed in this pull request?
The benchmark shows that `computeIfAbsent` still has better performance on simple case
```
================================================================================================
HashMap
================================================================================================
OpenJDK 64-Bit Server VM 1.8.0_332-b09 on Mac OS X 13.4.1
Apple M1 Pro
HashMap: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
putIfAbsent 701 702 0 95.7 10.4 1.0X
computeIfAbsent 534 535 1 125.6 8.0 1.3X
================================================================================================
ConcurrentHashMap
================================================================================================
OpenJDK 64-Bit Server VM 1.8.0_332-b09 on Mac OS X 13.4.1
Apple M1 Pro
ConcurrentHashMap: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
putIfAbsent 712 716 3 94.2 10.6 1.0X
computeIfAbsent 702 705 2 95.6 10.5 1.0X
```
### Why are the changes needed?
Introduce a Benchmark framework for future performance sensitive case measurement.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass GA.
Closes#1657 from pan3793/CELEBORN-744.
Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Cheng Pan <chengpan@apache.org>
### What changes were proposed in this pull request?
netty has exposed the public API `PlatformDependent.usedDirectMemory()` to get netty used direct memory since [netty-4.1.35.Final](https://github.com/netty/netty/pull/8945), simplifies the logic
### Why are the changes needed?
simplifies the get netty used direct memory logic
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Pass GA
Closes#1662 from cfmcgrady/netty-used-memory.
Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
As title
### Why are the changes needed?
In order to distinguish it from the existing master/worker, refactor data replication terminology to 'primary/replica' for improved clarity and inclusivity in the codebase
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
existing tests
Closes#1639 from cfmcgrady/primary-replica.
Lead-authored-by: Fu Chen <cfmcgrady@gmail.com>
Co-authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Cheng Pan <chengpan@apache.org>
### What changes were proposed in this pull request?
Rename RssHARetryClient to MasterClient
### Why are the changes needed?
Code refactor
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass GA.
Closes#1661 from AngersZhuuuu/CELEBORN-748.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
### What changes were proposed in this pull request?
Print time/bytes in human-readable format
### Why are the changes needed?
Make logs readable
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass GA.
Closes#1659 from pan3793/minor.
Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Cheng Pan <chengpan@apache.org>
### What changes were proposed in this pull request?
Replace `putIfAbsent` with computeIfAbsent in ConcurrentHashMap
### Why are the changes needed?
The invoking of `putIfAbsent` will always call its value if it's a time-consuming operation. So we'd better replace `putIfAbsent` with `computeIfAbsent` in some critical paths.
### Does this PR introduce _any_ user-facing change?
No, it does not affect the user-facing API
### How was this patch tested?
current UT
Closes#1567 from cchung100m/CELEBORN-478.
Lead-authored-by: cchung100m <cchung100m@cs.ccu.edu.tw>
Co-authored-by: Cheng Pan <chengpan@apache.org>
Co-authored-by: Neo Chien <cchung100m@cs.ccu.edu.tw>
Signed-off-by: Cheng Pan <chengpan@apache.org>
### What changes were proposed in this pull request?
Rename project files from rss-xx to celeborn-xx
### Why are the changes needed?
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Closes#1660 from AngersZhuuuu/CELEBORN-746.
Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Co-authored-by: Cheng Pan <pan3793@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
### What changes were proposed in this pull request?
Enable CI for Celeborn Master/Worker and Client with Spark 3.3/3.4
### Why are the changes needed?
Ensure Celeborn works on Java 17.
Note: there may be some code paths that are not covered by tests, we should fix them in the future.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass GA
Closes#1649 from pan3793/CELEBORN-738.
Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Cheng Pan <chengpan@apache.org>
### What changes were proposed in this pull request?
Rename HeartbeatResponse to HeartbeatFromWorkerResponse
### Why are the changes needed?
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Closes#1651 from AngersZhuuuu/CELEBORN-739.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
### What changes were proposed in this pull request?
Bump Hadoop from 3.2.1 to 3.2.4.
### Why are the changes needed?
Always use the latest patched version.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass GA.
Closes#1654 from pan3793/CELEBORN-742.
Lead-authored-by: Cheng Pan <chengpan@apache.org>
Co-authored-by: Keyong Zhou <zhouky@apache.org>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Bump Spark
- from 3.2.2 to 3.2.4
- from 3.3.1 to 3.3.2
- from 3.4.0 to 3.4.1
### Why are the changes needed?
Keep Spark version update-to-date
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass GA.
Closes#1653 from pan3793/CELEBORN-741.
Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
Remove usage of deprecated `java.security.AccessController`
### Why are the changes needed?
`AccessController` is deprecated for removal since Java 17
https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/security/AccessController.html
Recover building for Java 17
```
[INFO] compiling 72 Scala sources and 209 Java sources to /home/runner/work/incubator-celeborn/incubator-celeborn/common/target/classes ...
Error: /home/runner/work/incubator-celeborn/incubator-celeborn/common/src/main/scala/org/apache/celeborn/common/serializer/SerializationDebugger.scala:71: class AccessController in package security is deprecated
Error: [ERROR] one error found
```
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
```
scala> System.getProperty("java.version")
res0: String = 1.8.0_332
scala> System.getProperty("sun.io.serialization.extendedDebugInfo")
res1: String = null
scala> java.lang.Boolean.getBoolean("sun.io.serialization.extendedDebugInfo")
res2: Boolean = false
```
Closes#1652 from pan3793/CELEBORN-740.
Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
This pull PR is an integral component of #1639 . It primarily focuses on updating configuration settings and metrics terminology, while ensuring compatibility with older client versions by refraining from introducing changes related to RPC.
### Why are the changes needed?
In order to distinguish it from the existing master/worker, refactor data replication terminology to 'primary/replica' for improved clarity and inclusivity in the codebase
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
existing tests.
Closes#1650 from cfmcgrady/primary-replica-metrics.
Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
### What changes were proposed in this pull request?
Remove unused RPC GetWorkerInfo & GetWorkerInfosResponse
### Why are the changes needed?
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Closes#1647 from AngersZhuuuu/CELEBORN-735.
Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Co-authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
### What changes were proposed in this pull request?
Remove unused RPC ReregisterWorkerResonse
### Why are the changes needed?
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Closes#1646 from AngersZhuuuu/CELEBORN-734.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
### What changes were proposed in this pull request?
In this pr, we rename all RPC blacklist fields, it won't have have compatibility issues.
For RPC `GetBlacklist` and `GetBlacklistResponse` we won't change it, since it won't be used in next release, so we can remove these two RPC in next release.
### Why are the changes needed?
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Closes#1643 from AngersZhuuuu/CELEBORN-666-RPC.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
### What changes were proposed in this pull request?
Remove unused RPC ThreadDump & ThreadDumpResponse
### Why are the changes needed?
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Closes#1645 from AngersZhuuuu/CELEBORN-732.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
### What changes were proposed in this pull request?
Remove unused SlaveLostResponse
### Why are the changes needed?
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Closes#1644 from AngersZhuuuu/CELEBORN-730.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
### What changes were proposed in this pull request?
Refine the congestion relevant code/log/comments
### Why are the changes needed?
ditto
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
manually test
Closes#1637 from onebox-li/improve-congestion.
Authored-by: onebox-li <lyh-36@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
Fix typo `numMapppers`, should be `numMappers`
### Why are the changes needed?
Fix typo
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Protobuf serde depends on message field seq no, not name.
Closes#1642 from pan3793/CELEBORN-729.
Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
Fix the flaky test by enlarging `celeborn.client.shuffle.expired.checkInterval`
### Why are the changes needed?
```
RssHashCheckDiskSuite:
- celeborn spark integration test - hash-checkDiskFull *** FAILED ***
868 was not less than 0 (RssHashCheckDiskSuite.scala:83)
```
https://github.com/apache/incubator-celeborn/actions/runs/5396767745/jobs/9800766633
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass GA, and should observe CI,
Closes#1640 from pan3793/CELEBORN-727.
Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Cheng Pan <chengpan@apache.org>
### What changes were proposed in this pull request?
To clean the remnant application directory after Celeborn Worker is restarted.
### Why are the changes needed?
Remnant application directories will not be deleted, because `hadoopFs.listFiles(path,false)` will not list directories.
### Does this PR introduce _any_ user-facing change?
No.
Closes#1641 from Demon-Liang/0.3-dev.
Authored-by: Demon Liang <liangdingwen.ldw@alipay.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
(cherry picked from commit 42a9160c8ceaf79bae514c54dafcb5b8e12d5251)
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
### What changes were proposed in this pull request?
Remove new allocated location's workers from pushExecludedWrkers should also remove peers
### Why are the changes needed?
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Closes#1636 from AngersZhuuuu/CELEBORN-696-FOLLOWUP.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>