Commit Graph

1061 Commits

Author SHA1 Message Date
Angerszhuuuu
52dcd3b5df [CELEBORN-777][BUG] CongestionControl getPotentialConsumeSpeed throw /zero error
### What changes were proposed in this pull request?
In `TimeSlidingHub.add()` `_deque` will clear then add the pair.

```
      if (nodesToAdd >= maxQueueSize) {
        // The new node exceed existing sliding list, need to clear all old nodes
        // and create a new sliding list
        _deque.clear();
        _deque.add(Pair.of(currentTimestamp, (N) newNode.clone()));
        sumNode = (N) newNode.clone();
        return;
      }

```

Then when call `BufferStatusHub.avgBytesPerSec()`,  `currentNumBytes` can be `> 0` but `getCurrentTimeWindowsInMills` may return 0. Cause the error.

```
  public long avgBytesPerSec() {
    long currentNumBytes = sum().numBytes();
    if (currentNumBytes > 0) {
      return currentNumBytes * 1000 / (long) getCurrentTimeWindowsInMills();
    }
    return 0L;
  }
```

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1690 from AngersZhuuuu/CELEBORN-777.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
2023-07-08 21:46:37 +08:00
Cheng Pan
e314f85087
[CELEBORN-614][FOLLOWUP] Fix flushOnMemoryPressure condition
### What changes were proposed in this pull request?

Fix the refactor bug of CELEBORN-614 (https://github.com/apache/incubator-celeborn/pull/1517).

### Why are the changes needed?

This is a bug fix, the condition `writer.getException != null` was inverted accidentally during CELEBORN-614 (https://github.com/apache/incubator-celeborn/pull/1517), which causes the trim became no-op.

### Does this PR introduce _any_ user-facing change?

No. The bug was caused by an unreleased commit.

### How was this patch tested?

Set Worker off-heap memory to 2G, and run 1T tera sort.

Before: the trim does not trigger disk buffer flush, causing the worker can not to recover from the pause pushdata state, then Job failed.

After: the trim correctly triggers disk buffer flush, releases the worker memory, and the Job succeeded.

<img width="1653" alt="image" src="https://github.com/apache/incubator-celeborn/assets/26535726/9ef62c78-e6a9-497f-9dac-d3f712e830cc">

Closes #1689 from pan3793/CELEBORN-614-followup.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-07-07 15:33:09 +08:00
Cheng Pan
a556b02bc1
[CELEBORN-747][FOLLOWUP] Fix nanoDurationToString and add UT
### What changes were proposed in this pull request?

Sorry, the last added `nanoDurationToString` is not correct, this PR fixes it and adds UT to verify.

### Why are the changes needed?

Bug fix.

### Does this PR introduce _any_ user-facing change?

No, unreleased change.

### How was this patch tested?

UT.

Closes #1688 from pan3793/CELEBORN-747-followup2.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-07-06 19:34:35 +08:00
Cheng Pan
ed035d7ab0
[CELEBORN-747][FOLLOWUP] avgFlushTime and avgFetchTime are in nano seconds
<!--
Thanks for sending a pull request!  Here are some tips for you:
  - Make sure the PR title start w/ a JIRA ticket, e.g. '[CELEBORN-XXXX] Your PR title ...'.
  - Be sure to keep the PR description updated to reflect all changes.
  - Please write your PR title to summarize what this PR proposes.
  - If possible, provide a concise example to reproduce the issue for a faster review.
-->

### What changes were proposed in this pull request?

`avgFlushTime` and `avgFetchTime` are in nano seconds, it was accidentally formatted by `msDurationToString` and caused unreasonable logs.

```
usableSpace: 1602.1 GiB, avgFlushTime: 1.35 h, avgFetchTime: 1187.66 h
```

### Why are the changes needed?

Fix logs.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

UT is updated

Closes #1687 from pan3793/CELEBORN-747-followup.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-07-06 17:59:49 +08:00
Fu Chen
2bd1d86d41
[CELEBORN-775] Fix executorCores calculation in SparkShuffleManager for Spark local mode
### What changes were proposed in this pull request?

As title

### Why are the changes needed?

```shell
$ bin/spark-shell --master local[2]
23/07/06 16:11:30 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/07/06 16:11:40 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Spark context available as 'sc' (master = local[2], app id = local-1688631101733).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.3.1
      /_/

Using Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java 1.8.0_292)
Type in expressions to have them evaluated.
Type :help for more information.

scala> spark.sparkContext.getConf.get("spark.executor.cores")
java.util.NoSuchElementException: spark.executor.cores
  at org.apache.spark.SparkConf.$anonfun$get$1(SparkConf.scala:245)
  at scala.Option.getOrElse(Option.scala:189)
  at org.apache.spark.SparkConf.get(SparkConf.scala:245)
  ... 47 elided

scala>
```

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

CelebornPipelineSortSuite should cover this change

Closes #1685 from cfmcgrady/local-core-number.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-07-06 16:29:59 +08:00
Cheng Pan
4f8e72f217
[CELEBORN-774] Pullout celeborn.rpc.dispatcher.threads to CelebornConf
### What changes were proposed in this pull request?

Pullout hardcoded `celeborn.rpc.dispatcher.numThreads` to `CelebornConf` and rename it to `celeborn.rpc.dispatcher.threads` to align with existing configuration style

### Why are the changes needed?

Pullout inline configuration to `CelebornConf`, and expose it in configuration docs

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass GA.

Closes #1684 from pan3793/CELEBORN-774.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-07-06 16:23:32 +08:00
zky.zhoukeyong
09881f5cff [CELEBORN-769] Change default value of celeborn.client.push.maxReqsInFlight to 16
…Flight to 16

### What changes were proposed in this pull request?
Change default value of celeborn.client.push.maxReqsInFlight to 16.

### Why are the changes needed?
Previous value 4 is too small, 16 is more reasonable.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Pass GA.

Closes #1683 from waitinfuture/769.

Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-06 10:22:06 +08:00
mingji
d0ecf83fec [CELEBORN-764] Fix celeborn on HDFS might clean using app directories
### What changes were proposed in this pull request?
Make Celeborn leader clean expired app dirs on HDFS when an application is Lost.

### Why are the changes needed?
If Celeborn is working on HDFS, the storage manager starts and cleans expired app directories, and the newly created worker will want to delete any unknown app directories.
This will cause using app directories to be deleted unexpectedly.

### Does this PR introduce _any_ user-facing change?
NO.

### How was this patch tested?
UT and cluster.

Closes #1678 from FMX/CELEBORN-764.

Lead-authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Co-authored-by: Cheng Pan <pan3793@gmail.com>
Co-authored-by: Ethan Feng <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-05 23:11:50 +08:00
zky.zhoukeyong
4300835363 [CELEBORN-768] Change default config values for batch rpcs and netty …
…memory allocator

### What changes were proposed in this pull request?
Changes the following configs' default values
| config  | previous value | current value |
| ------------- | ------------- | ------------- |
| celeborn.network.memory.allocator.share  | false | true |
| celeborn.client.shuffle.batchHandleChangePartition.enabled  | false | true |
| celeborn.client.shuffle.batchHandleCommitPartition.enabled | false | true |

### Why are the changes needed?
In my test, when graceful shutdown is enabled but ```celeborn.client.shuffle.batchHandleChangePartition.enabled``` and ```celeborn.client.shuffle.batchHandleCommitPartition.enabled``` disabled, the worker takes much longer to stop than the two configs enabled.
In another test where worker size is quite small(2 cores 4 G) and replication is on, if shared allocator is disabled, the netty's onTrim fails to release memory, and further causes push data timeout.

### Does this PR introduce _any_ user-facing change?
No, these conifgs are introduces from 0.3.0.

### How was this patch tested?
Passes GA.

Closes #1682 from waitinfuture/768.

Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-05 18:16:41 +08:00
zhongqiang.czq
95f08300e5 [CELEBORN-766][LICENSE] Upgrade notice file responding to dependency …
…upgrading

### What changes were proposed in this pull request?

upgrade notice-binary for commons-lang and commons-io

### Why are the changes needed?
In pre prs, the following dependencies are upgraded and its' notice file are upgraded too.
Bump commons-io to 2.13.0 [[CELEBORN-743]](https://issues.apache.org/jira/projects/CELEBORN/issues/CELEBORN-743?filter=allissues)
Bump commons-lang3 to 3.12.0 [[CELEBORN-736]](https://issues.apache.org/jira/projects/CELEBORN/issues/CELEBORN-736?filter=allissues)
1. NOTICE.txt  in commons-lang3-3.12.0.jar
```
Apache Commons Lang
Copyright 2001-2021 The Apache Software Foundation
```
2. NOTICE.txt  in commons-io-2.13.0.jar
```
Apache Commons IO
Copyright 2002-2023 The Apache Software Foundation
```

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?
manually

Closes #1681 from zhongqiangczq/notice-0705.

Authored-by: zhongqiang.czq <zhongqiang.czq@alibaba-inc.com>
Signed-off-by: zhongqiang.czq <zhongqiang.czq@alibaba-inc.com>
2023-07-05 18:13:21 +08:00
Fu Chen
3af5c231c7 [CELEBORN-767][DOC] Update the docs of celeborn.client.spark.push.sort.memory.threshold
### What changes were proposed in this pull request?

As title

### Why are the changes needed?

To clarify the usage of conf `celeborn.client.spark.push.sort.memory.threshold`

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Pass GA

Closes #1680 from cfmcgrady/docs.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-05 18:07:09 +08:00
zhongqiang.czq
a0f4be67a9 [CELEBORN-765][DOC] Disable partitionSplit in Flink engine related co…
…nfigurations

### What changes were proposed in this pull request?
In Doc Readme, setting partitionSplit to false should be added in Flink engine related configurations.

### Why are the changes needed?
Currently, Mappartition split is not supported, but shuffle partition split is enabled by default, so error will be thrown when flink task's shuffle data size exceeds 1G(by Default).

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?
manually

Closes #1679 from zhongqiangczq/readme.

Authored-by: zhongqiang.czq <zhongqiang.czq@alibaba-inc.com>
Signed-off-by: zhongqiang.czq <zhongqiang.czq@alibaba-inc.com>
2023-07-05 18:04:10 +08:00
Cheng Pan
5b3f43dffc
[CELEBORN-763] Add --add-opens to bootstrap shell scripts
### What changes were proposed in this pull request?

Add --add-opens to bootstrap shell scripts

### Why are the changes needed?

Additional `--add-opens` is required for Java 17, notes, the `--add-opens` list is copied from Spark and was used for UT, I am not sure each of them is required but at least the UT passed with them.

Details supplied by cfmcgrady

[JEP 403](https://openjdk.java.net/jeps/403) targeted for [JDK 17](https://openjdk.java.net/projects/jdk/17/) will remove `--illegal-access` flag. That will be equivalent to `--illegal-access=deny`.

this means using reflection to invoke protected methods of exported `java.*` APIs will no longer work. For example:

```shell
> /Library/Java/JavaVirtualMachines/temurin-17.jdk/Contents/Home/bin/jshell
|  欢迎使用 JShell -- 版本 17.0.7
|  要大致了解该版本, 请键入: /help intro

jshell> java.nio.ByteBuffer direct = java.nio.ByteBuffer.allocateDirect(1);
direct ==> java.nio.DirectByteBuffer[pos=0 lim=1 cap=1]

jshell> direct.getClass().getDeclaredConstructor(long.class, int.class).setAccessible(true);
|  异常错误 java.lang.reflect.InaccessibleObjectException:Unable to make private java.nio.DirectByteBuffer(long,int) accessible: module java.base does not "opens java.nio" to unnamed module 34c45dca
|        at AccessibleObject.checkCanSetAccessible (AccessibleObject.java:354)
|        at AccessibleObject.checkCanSetAccessible (AccessibleObject.java:297)
|        at Constructor.checkCanSetAccessible (Constructor.java:188)
|        at Constructor.setAccessible (Constructor.java:181)
|        at (#2:1)

jshell>

```

```shell
>  /Library/Java/JavaVirtualMachines/temurin-17.jdk/Contents/Home/bin/jshell -R --add-opens=java.base/java.nio=ALL-UNNAMED
|  欢迎使用 JShell -- 版本 17.0.7
|  要大致了解该版本, 请键入: /help intro

jshell> java.nio.ByteBuffer direct = java.nio.ByteBuffer.allocateDirect(1);
direct ==> java.nio.DirectByteBuffer[pos=0 lim=1 cap=1]

jshell> direct.getClass().getDeclaredConstructor(long.class, int.class).setAccessible(true);

jshell>
```

### Does this PR introduce _any_ user-facing change?

Yes, for Java 17 support.

### How was this patch tested?

CI and review

Closes #1677 from pan3793/CELEBORN-763.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-07-05 11:31:21 +08:00
Cheng Pan
de0fd8cc44 [CELEBORN-762] Always set JVM opts -XX:+IgnoreUnrecognizedVMOptions
### What changes were proposed in this pull request?

Always set JVM opts `-XX:+IgnoreUnrecognizedVMOptions`

### Why are the changes needed?

By default, JVM failed to start when unknown opts are set, it's not friendly for users who want to use different versions of JDK.

### Does this PR introduce _any_ user-facing change?

Yes, users can success start celeborn even if they provide unknown JVM opts.

### How was this patch tested?

Review.

Closes #1676 from pan3793/CELEBORN-762.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-07-04 21:37:19 +08:00
Angerszhuuuu
693172d0bd [CELEBORN-751] Rename remain rss related class name and filenames etc
### What changes were proposed in this pull request?
Rename remain rss related class name and filenames etc...

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1664 from AngersZhuuuu/CELEBORN-751.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
2023-07-04 10:20:08 +08:00
Cheng Pan
26aaba14d4 [CELEBORN-637][FOLLOWUP] Mention configurations change in migration guide
### What changes were proposed in this pull request?

as title

### Why are the changes needed?

mention configuration behavior change in migration guide

### Does this PR introduce _any_ user-facing change?

Yes, the migration guide is updated

### How was this patch tested?

review

Closes #1673 from pan3793/CELEBORN-637-followup.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-03 14:26:43 +08:00
Fu Chen
3964861fd7
[CELEBORN-726][FOLLOWUP] Eliminate 'TODO' comments within the Controller class
### What changes were proposed in this pull request?

As title

### Why are the changes needed?

eliminate comments introduced in https://github.com/apache/incubator-celeborn/pull/1650

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Pass GA

Closes #1672 from cfmcgrady/primary-replica-followup.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-07-03 13:40:30 +08:00
Cheng Pan
a1be02b4fa [CELEBORN-757] Improve metrics method signature and code style
### What changes were proposed in this pull request?

- gauge method definition improvement. i.e.

  before
  ```
  def addGauge[T](name: String, f: Unit => T, labels: Map[String, String])
  ```
  after
  ```
  def addGauge[T](name: String, labels: Map[String, String])(f: () => T)
  ```
  which improves the caller-side code style
  ```
  addGauge(name, labels) { () =>
    ...
  }
  ```

- remove unnecessary Java/Scala collection conversion. Since Scala 2.11 does not support SAM, the extra implicit function is required.

- leverage Logging to defer message evaluation

- UPPER_CASE string constants

### Why are the changes needed?

Improve code quality and performance(maybe)

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass GA.

Closes #1670 from pan3793/CELEBORN-757.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-03 11:56:43 +08:00
Angerszhuuuu
7880c52fff [CELEBORN-745] Match TransportMessage type use number instead of enum
### What changes were proposed in this pull request?
Match TransportMessage type use number not enum to support change MessageType name,after this pr, then we can change the MessageType name.

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1658 from AngersZhuuuu/CELEBORN-745.

Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Co-authored-by: Shuang <lvshuang.tb@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-01 18:50:02 +08:00
zky.zhoukeyong
af0f5e5784 [CELEBORN-755][FOLLOWUP] Avoid unnecessary memory copy when compression disabled
…sion disabled

### What changes were proposed in this pull request?
Avoid memory copy for code path where compression is disabled. Followup of https://github.com/apache/incubator-celeborn/pull/1669

### Why are the changes needed?
ditto

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass GA

Closes #1671 from waitinfuture/755.

Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-01 18:27:33 +08:00
xiyu.zk
381165d4e7
[CELEBORN-755] Support disable shuffle compression
### What changes were proposed in this pull request?
Support to decide whether to compress shuffle data through configuration.

### Why are the changes needed?
Currently, Celeborn compresses all shuffle data, but for example, the shuffle data of Gluten has already been compressed. In this case, no additional compression is required. Therefore, configuration needs to be provided for users to decide whether to use Celeborn’s compression according to the actual situation.

### Does this PR introduce _any_ user-facing change?
no.

Closes #1669 from kerwin-zk/celeborn-755.

Authored-by: xiyu.zk <xiyu.zk@alibaba-inc.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-07-01 00:03:50 +08:00
Fu Chen
047e90b17b
[CELEBORN-756][WORKER] Refactor PushDataHandler class to utilize while loop
### What changes were proposed in this pull request?

as title

### Why are the changes needed?

per https://github.com/databricks/scala-style-guide#traversal-and-zipwithindex, use `while` loop for performance-sensitive code

worker's flame graph before:

![截屏2023-06-30 下午5 58 02](https://github.com/apache/incubator-celeborn/assets/8537877/28c199b6-a29b-4501-8064-e0f2ddb2a8b9)

after:

![截屏2023-06-30 下午5 58 18](https://github.com/apache/incubator-celeborn/assets/8537877/c6134959-5f78-436b-aa29-a78882b09e84)

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Pass GA

Closes #1668 from cfmcgrady/while-loop-2.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-06-30 18:13:53 +08:00
Angerszhuuuu
5c7ecb8302
[CELEBORN-754][IMPORTANT] Provide a new SparkShuffleManager to replace RssShuffleManager in the future
### What changes were proposed in this pull request?
Provide a new SparkShuffleManager to replace RssShuffleManager in the future

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1667 from AngersZhuuuu/CELEBORN-754.

Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Co-authored-by: Cheng Pan <pan3793@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-06-30 17:27:33 +08:00
Angerszhuuuu
6e35745736
[CELEBORN-753] Rename spark patch file name to make it more clear
### What changes were proposed in this pull request?
Rename spark patch file name to make it more clear

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1666 from AngersZhuuuu/CELEBORN-753.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-06-30 11:41:12 +08:00
mingji
742815f285
[CELEBORN-749] Update grafana dashboard to remove "RSS"
### What changes were proposed in this pull request?
Update Grafana dashboard and its setup demo to remove the old name "RSS"

### Why are the changes needed?
Ditto.

### Does this PR introduce _any_ user-facing change?
NO.

### How was this patch tested?
No test needed.

Closes #1663 from FMX/CELEBORN-749.

Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-06-29 20:44:09 +08:00
Cheng Pan
df3fc194fb
[CELEBORN-744] Add Benchmark framework and ComputeIfAbsentBenchmark
### What changes were proposed in this pull request?

The benchmark shows that `computeIfAbsent` still has better performance on simple case

```
================================================================================================
HashMap
================================================================================================

OpenJDK 64-Bit Server VM 1.8.0_332-b09 on Mac OS X 13.4.1
Apple M1 Pro
HashMap:                                  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
putIfAbsent                                         701            702           0         95.7          10.4       1.0X
computeIfAbsent                                     534            535           1        125.6           8.0       1.3X

================================================================================================
ConcurrentHashMap
================================================================================================

OpenJDK 64-Bit Server VM 1.8.0_332-b09 on Mac OS X 13.4.1
Apple M1 Pro
ConcurrentHashMap:                        Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
putIfAbsent                                         712            716           3         94.2          10.6       1.0X
computeIfAbsent                                     702            705           2         95.6          10.5       1.0X
```

### Why are the changes needed?

Introduce a Benchmark framework for future performance sensitive case measurement.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass GA.

Closes #1657 from pan3793/CELEBORN-744.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-06-29 20:19:30 +08:00
Fu Chen
baa0d0b3b4
[CELEBORN-750] Simplify get Netty used direct memory logic
### What changes were proposed in this pull request?

netty has exposed the public API `PlatformDependent.usedDirectMemory()` to get netty used direct memory since [netty-4.1.35.Final](https://github.com/netty/netty/pull/8945), simplifies the logic

### Why are the changes needed?

simplifies the get netty used direct memory logic

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Pass GA

Closes #1662 from cfmcgrady/netty-used-memory.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-06-29 18:24:16 +08:00
Fu Chen
adbd38a926
[CELEBORN-726][FOLLOWUP] Update data replication terminology from master/slave to primary/replica in the codebase
### What changes were proposed in this pull request?

As title

### Why are the changes needed?

In order to distinguish it from the existing master/worker, refactor data replication terminology to 'primary/replica' for improved clarity and inclusivity in the codebase

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

existing tests

Closes #1639 from cfmcgrady/primary-replica.

Lead-authored-by: Fu Chen <cfmcgrady@gmail.com>
Co-authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-06-29 17:07:26 +08:00
Angerszhuuuu
1fd8833756
[CELEBORN-748] Rename RssHARetryClient to MasterClient
### What changes were proposed in this pull request?

Rename RssHARetryClient to MasterClient

### Why are the changes needed?

Code refactor

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass GA.

Closes #1661 from AngersZhuuuu/CELEBORN-748.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-06-29 16:47:15 +08:00
Cheng Pan
11569689be
[CELEBORN-747] Print time/bytes in human-readable format
### What changes were proposed in this pull request?

Print time/bytes in human-readable format

### Why are the changes needed?

Make logs readable

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass GA.

Closes #1659 from pan3793/minor.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-06-29 16:46:12 +08:00
cchung100m
4e1e97b2c2
[CELEBORN-478] Replace putIfAbsent with computeIfAbsent in ConcurrentHashMap in Java code
### What changes were proposed in this pull request?

Replace `putIfAbsent` with computeIfAbsent in ConcurrentHashMap

### Why are the changes needed?

The invoking of `putIfAbsent` will always call its value if it's a time-consuming operation. So we'd better replace `putIfAbsent` with `computeIfAbsent` in some critical paths.

### Does this PR introduce _any_ user-facing change?

No, it does not affect the user-facing API

### How was this patch tested?

current UT

Closes #1567 from cchung100m/CELEBORN-478.

Lead-authored-by: cchung100m <cchung100m@cs.ccu.edu.tw>
Co-authored-by: Cheng Pan <chengpan@apache.org>
Co-authored-by: Neo Chien <cchung100m@cs.ccu.edu.tw>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-06-29 16:44:18 +08:00
Angerszhuuuu
bd7c2ea35a [CELEBORN-746][BUILD] Rename project files from rss-xx to celeborn-xx
### What changes were proposed in this pull request?
Rename project files from rss-xx to celeborn-xx

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1660 from AngersZhuuuu/CELEBORN-746.

Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Co-authored-by: Cheng Pan <pan3793@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
2023-06-29 16:30:02 +08:00
Cheng Pan
fd0cf11eca
[CELEBORN-738] Enable CI for Java 17
### What changes were proposed in this pull request?

Enable CI for Celeborn Master/Worker and Client with Spark 3.3/3.4

### Why are the changes needed?

Ensure Celeborn works on Java 17.

Note: there may be some code paths that are not covered by tests, we should fix them in the future.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass GA

Closes #1649 from pan3793/CELEBORN-738.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-06-29 13:47:55 +08:00
Angerszhuuuu
1a53db22ce
[CELEBORN-739] Rename HeartbeatResponse to HeartbeatFromWorkerResponse
### What changes were proposed in this pull request?
Rename HeartbeatResponse to HeartbeatFromWorkerResponse

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1651 from AngersZhuuuu/CELEBORN-739.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-06-29 13:08:08 +08:00
Cheng Pan
b308ac6717
[CELEBORN-742][BUILD] Bump Hadoop 3.2.4
### What changes were proposed in this pull request?

Bump Hadoop from 3.2.1 to 3.2.4.

### Why are the changes needed?

Always use the latest patched version.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass GA.

Closes #1654 from pan3793/CELEBORN-742.

Lead-authored-by: Cheng Pan <chengpan@apache.org>
Co-authored-by: Keyong Zhou <zhouky@apache.org>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-06-29 11:48:45 +08:00
Cheng Pan
78327ebd4a [CELEBORN-743][BUILD] Bump commons-io to 2.13.0
### What changes were proposed in this pull request?

Bump commons-io to 2.13.0

### Why are the changes needed?

- https://commons.apache.org/proper/commons-io/changes-report.html#a2.9.0
- https://commons.apache.org/proper/commons-io/changes-report.html#a2.10.0
- https://commons.apache.org/proper/commons-io/changes-report.html#a2.11.0
- https://commons.apache.org/proper/commons-io/changes-report.html#a2.12.0
- https://commons.apache.org/proper/commons-io/changes-report.html#a2.13.0

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass GA.

Closes #1655 from pan3793/CELEBORN-743.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-06-29 10:26:57 +08:00
Cheng Pan
3c6d90b5e5 [CELEBORN-741][BUILD] Bump Spark to latest patched version
### What changes were proposed in this pull request?

Bump Spark

- from 3.2.2 to 3.2.4
- from 3.3.1 to 3.3.2
- from 3.4.0 to 3.4.1

### Why are the changes needed?

Keep Spark version update-to-date

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass GA.

Closes #1653 from pan3793/CELEBORN-741.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-06-29 10:23:40 +08:00
Cheng Pan
c33aabfa37 [CELEBORN-740] Remove usage of AccessController
### What changes were proposed in this pull request?

Remove usage of deprecated `java.security.AccessController`

### Why are the changes needed?

`AccessController` is deprecated for removal since Java 17

https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/security/AccessController.html

Recover building for Java 17

```
[INFO] compiling 72 Scala sources and 209 Java sources to /home/runner/work/incubator-celeborn/incubator-celeborn/common/target/classes ...
Error:  /home/runner/work/incubator-celeborn/incubator-celeborn/common/src/main/scala/org/apache/celeborn/common/serializer/SerializationDebugger.scala:71: class AccessController in package security is deprecated
Error: [ERROR] one error found
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

```
scala> System.getProperty("java.version")
res0: String = 1.8.0_332

scala> System.getProperty("sun.io.serialization.extendedDebugInfo")
res1: String = null

scala> java.lang.Boolean.getBoolean("sun.io.serialization.extendedDebugInfo")
res2: Boolean = false
```

Closes #1652 from pan3793/CELEBORN-740.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-06-29 10:21:02 +08:00
Fu Chen
17c1e01874
[CELEBORN-726] Update data replication terminology from master/slave to primary/replica for configurations and metrics
### What changes were proposed in this pull request?

This pull PR is an integral component of #1639 . It primarily focuses on updating configuration settings and metrics terminology, while ensuring compatibility with older client versions by refraining from introducing changes related to RPC.

### Why are the changes needed?

In order to distinguish it from the existing master/worker, refactor data replication terminology to 'primary/replica' for improved clarity and inclusivity in the codebase

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

existing tests.

Closes #1650 from cfmcgrady/primary-replica-metrics.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-06-29 09:47:02 +08:00
Cheng Pan
c2352a2f9f [CELEBORN-736][BUILD] Bump commons-lang3 3.12.0
### What changes were proposed in this pull request?

Bump commons-lang3 to latest version

### Why are the changes needed?

- https://commons.apache.org/proper/commons-lang/changes-report.html#a3.11
- https://commons.apache.org/proper/commons-lang/changes-report.html#a3.12.0

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass GA.

Closes #1648 from pan3793/CELEBORN-736.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-06-28 21:15:44 +08:00
Angerszhuuuu
4c4e18b0d6 [CELEBORN-735] Remove unused RPC GetWorkerInfo & GetWorkerInfosResponse
### What changes were proposed in this pull request?
Remove unused RPC GetWorkerInfo & GetWorkerInfosResponse

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1647 from AngersZhuuuu/CELEBORN-735.

Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Co-authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
2023-06-28 20:17:56 +08:00
Angerszhuuuu
a672db719a [CELEBORN-734] Remove unused RPC ReregisterWorkerResonse
### What changes were proposed in this pull request?
Remove unused RPC ReregisterWorkerResonse

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1646 from AngersZhuuuu/CELEBORN-734.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
2023-06-28 19:59:53 +08:00
Angerszhuuuu
590198ecea [CELEBORN-666][FOLLOWUP] Rename all RPC blacklist fields
### What changes were proposed in this pull request?
In this pr, we rename all RPC blacklist fields,  it won't have have compatibility issues.

For RPC `GetBlacklist` and `GetBlacklistResponse` we won't change it, since it won't be used in next release, so we can remove these two RPC in next release.

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1643 from AngersZhuuuu/CELEBORN-666-RPC.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-06-28 19:49:44 +08:00
Angerszhuuuu
ad13b04f2e [CELEBORN-732] Remove unused RPC ThreadDump & ThreadDumpResponse
### What changes were proposed in this pull request?
Remove unused RPC ThreadDump & ThreadDumpResponse

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1645 from AngersZhuuuu/CELEBORN-732.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-06-28 19:43:39 +08:00
Angerszhuuuu
63f22342e9
[CELEBORN-730] Remove unused SlaveLostResponse
### What changes were proposed in this pull request?
Remove unused SlaveLostResponse

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1644 from AngersZhuuuu/CELEBORN-730.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-06-28 19:35:23 +08:00
onebox-li
1b74d85fb1 [CELEBORN-725][MINOR] Refine congestion code
### What changes were proposed in this pull request?
Refine the congestion relevant code/log/comments

### Why are the changes needed?
ditto

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
manually test

Closes #1637 from onebox-li/improve-congestion.

Authored-by: onebox-li <lyh-36@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-06-28 18:31:40 +08:00
Cheng Pan
3d7c1fa0ae [CELEBORN-729] Fix typo PbRegisterShuffle#numMappers
### What changes were proposed in this pull request?

Fix typo `numMapppers`, should be `numMappers`

### Why are the changes needed?

Fix typo

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Protobuf serde depends on message field seq no, not name.

Closes #1642 from pan3793/CELEBORN-729.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-06-28 18:28:34 +08:00
Cheng Pan
b821349c4a
[CELEBORN-727][TEST] Fix flaky test RssHashCheckDiskSuite
### What changes were proposed in this pull request?

Fix the flaky test by enlarging `celeborn.client.shuffle.expired.checkInterval`

### Why are the changes needed?

```
RssHashCheckDiskSuite:
- celeborn spark integration test - hash-checkDiskFull *** FAILED ***
  868 was not less than 0 (RssHashCheckDiskSuite.scala:83)
```

https://github.com/apache/incubator-celeborn/actions/runs/5396767745/jobs/9800766633

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass GA, and should observe CI,

Closes #1640 from pan3793/CELEBORN-727.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-06-28 17:59:54 +08:00
Demon Liang
a1199a9895 [CELEBORN-728] Celeborn won't clean remnant application directory on HDFS if worker is restarted
### What changes were proposed in this pull request?
To clean the remnant application directory after Celeborn Worker is restarted.

### Why are the changes needed?
Remnant application directories will not be deleted, because `hadoopFs.listFiles(path,false)` will not list directories.

### Does this PR introduce _any_ user-facing change?
No.

Closes #1641 from Demon-Liang/0.3-dev.

Authored-by: Demon Liang <liangdingwen.ldw@alipay.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
(cherry picked from commit 42a9160c8ceaf79bae514c54dafcb5b8e12d5251)
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-06-28 17:54:08 +08:00
Angerszhuuuu
afab4a0a3b [CELEBORN-696][FOLLOWUP] Remove new allocated peer workers from pushExecludedWrkers
### What changes were proposed in this pull request?
Remove new allocated location's workers from pushExecludedWrkers should also remove peers

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1636 from AngersZhuuuu/CELEBORN-696-FOLLOWUP.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-06-28 17:38:36 +08:00